What is an experiment?
An experiment is a procedure that you perform in order to validate (or to reject) your hypothesis.
Your hypothesis might be that the selection strategy, the classifier (regressor), or a smart combination of those that you developed performs better than others. Or maybe you just want to let your approaches in the wild (on your data) and assess the results.
For the sake of simplicity, let’s assume that you have a paradigm H (your hypothesis), a data set X, and a performance measure E (this is how you assess the performance of your approach numerically; e.g. classification accuracy).
The following approach works for supervised learning too, not just for active learning.
A simple example
The main idea behind design of experiments is:
the design of the experiment is similar to a contest.
The Contest: Alice has a dataset consisting of 100 data points and wants to know if Bob or Carl is the better data scientist. So, she gives Bob and Carl 75 data points and asks each of them to provide the best model they can achieve. After that Alice will compare both models on the 25 data points, she held back.
The Optimization: Now, both data scientists try to find the best parameters for their model. They also split the data: 60 for training and 15 for validation. After training several models with different parameters on the 60 data points, each of them chooses the model which performed best on the remaining 15 data points.
The Comparison: Finally, Alice will evaluate the final models of both data scientists on her held out data. Bob wins if his model performs best and Carl respectively.
In the following, we use these terms to describe the different kinds of subsets (see also wikipedia):
- Outer training set: the data Bob and Carl are given by Alice to find their best approach (75 data points)
- Outer test set (often: test or evaluation set): the data Alice held back to test Bob’s and Carl’s approach (25 data points)
- Inner training set (often: training set): the data Bob and Carl used to train a model with specific parameters of their approach (60 data points)
- Inner test set (often: validation set): the data Bob and Carl used to determine the best parameter set (15 data points)
How can Bob and Carl do better (improve the generalization of their training procedure)?
So far, both data scientist just had one fixed training set (inner training set) and one validation set (inner test set). By random it could happen that one test set is particularly difficult for parameter setting and easy for another. Hence, we should ensure that every instance has been used for testing.
In k-fold cross validation, the data given by Alice (75 data points) is split in \(k=5\) folds. Hence, they have 5 subsets with 15 instances each. To predict the labels of the first fold, the data from folds 2, 3, 4, 5 is used for training. For the second fold, the algorithm is trained on folds 1, 3, 4, 5, etc. This methodology is much more robust and therefore leads to better results. Hence, it is more probable that the parameter setting which performed best actually is the best for the given data.
But now, one problem occurs. For the best parameter setting, each data scientist has 5 different model because of the k-fold cross validation. As Carl did not know what to do, he chose one by random. Bob had a better idea: He used the parameter setting, he found out was best, and trained the model on all data that he was given.
How can Alice do better?
Alice is faced with a similar situation as Bob and Carl. Maybe, someone just got lucky or the selection of training resp. test instances has been better for one of the competitors. Hence, Alice also performs k-fold cross validation (here \(k=4\)). Hence, Bob and Carl are asked to provide 4 different models and Alice checks if the results are consistent.
To be even more certain, she calculates only one performance value for one k-fold cross validation. Then she repeats the selection of instances multiple times to be certain that the results are not random.
Summary: How do you split your data?
The main idea of cross validation is to prevent that the model had seen the test data during training. This means that test data has neither been used for training or tuning. If we want to rank different algorithms with their best parameter setting, we need the two-staged cross validation. Hence, algorithms selection is the outer cross validation and on each training set, we perform a separate inner cross validation. More details can be found in the wikipedia pages mentioned above.
If you are interested how to evaluate active learning algorithms, please see the paper:
Challenges of Reliable, Realistic and Comparable Active Learning Evaluation by Kottke, Calma et al.