A common question in weka forums is how to keep track of instances with names. Weka does not have a name field for instances, so to keep track of instances one has to create a string ID attribute that has the name of each instance. The catch, though, is that most classifiers don't work with string attributes, and you wouldn't want to classify on the ID anyway. The official solution then is to delete the ID attribute before calling the classifier. Of course, if you delete the ID, you loose the names for your instances! Oof! The solution is to use the meta.FilteredClassifier classifier with the RemoveType filter as the filter. When you hand a FilteredClassifier off to Evaluation, it will apply the filter before sending it to the classifier, but will keep track of the relationship between the source Instances (with the ID) and the filtered set sent to the classifier. Great. Now what if you want to know how instances were classified during your cross-validation? The API for extracting those classifications is not obvious, but it's easy enough once you know where to look. In Evaluate.crossValidateModel() you pass in a StringBuffer to hold the predictions. This can then be parsed to obtain the predictions and the instance names they go with. Source code to do this below:
arffName = args
// Read arff file...
data = DataSource.read(arffName)
// Pick out the class attribute..
// Create a classifier from the name...
// By using filtered classifer to remove ID, the cross-validation
// wrapper will keep the original dataset and keep track of the mapping
// between the original and the folds (minus ID).
options = Utils.splitOptions(classifier)
classname = options
options = ""
classifier = Classifier.forName(classname,options)
// Perform cross-validation of the model..
eval = new Evaluation(data)
predictions = new StringBuffer()
eval.crossValidateModel(classifier,data,cvFolds = 5,
lines = predictions.toString().split("\n")
// Output of predictions looks like:
// inst# actual predicted error prediction (ID)
// 1 1:low 1:low 1 (P1)
// 2 2:high 1:low + 0.5 (P6)
// 3 2:high 2:high 1 (P0)
// Parse out fields we're interested in..
m = line =~ /\d:(\w+).*\d:(\w+).*\((\w+)\)/
actual = m
predicted = m
sample = m