Machine learning has to address the challenge of using a finite and perhaps limited data set to generate predictions about unlimited, unknown data. The key techniques we use are explored below.
Machine learning essentially aims to find models to make predictions or assumptions about data. If we had access to all the data relevant to a particular problem, it would be relatively easy to arrive at a very accurate range of predictions about this problem. We’d have all the possible examples of the problem at our disposal so we could then choose the model that gave the lowest error rate on the entire population, and that would be a true error rate and as accurate as we could ever get. But, in the real world, we almost always only have access to limited examples of the problem. So we need to find ways of optimising our choice of a model set-up, using one data set, by subsequently evaluating its performance on another data set which is equally representative of the problem. This approach is called training and testing.
The training element of the process is where we take a representative set of data to help choose and set up the model we’re going to use. Successive training iterations are employed to progressively improve the accuracy of the result until a point is reached which gives an acceptable result (or as good as can be achieved).
A testing data set is then used to validate the performance of the training set. This has to be a different set of data from the training set in order for this to be a valid means of evaluation. However, the testing data has to be representative of the problem just as the training set is. The testing step, in other words, is testing the wider applicability of the training model to other, but similar, data sets.
The diagram below captures key elements of the training and testing steps.
The next section looks at a particular example of this approach: the hold-out method for large data sets.
The hold-out method is an example of the training/testing approach, although it has an extra element in that the training step is divided into two phases: training and validation. These are completed before the testing step.
In this approach, we still have the separation of data into training and testing, with the testing data using a completely different set of data representing the problem to that used for training. However, in this case, the training data are themselves split into two sets (training and validation) so that, even in the training stage, the process of optimising the model and selecting the best fit is carried out using different data to that used for the initial set-up. The validation itself provides an element of pre-testing to hone the performance of the model.
The testing step then uses a different sample, again, to test performance and accuracy.
The hold-out method is typically used for large data sets to train the machine learning model to classify, predict, associate or cluster.
We don’t always have the luxury of setting aside a meaningful portion of a data set for validation or testing purposes. Many data sets are sparse and manually put together, e.g., annotated natural language text corpora or scanned image data indicating cancer. So, having sufficient data for the different sets may be an issue. Test data may also not be available at the time of training, e.g., for predicting the weather tomorrow or inflation in three month’s time.
If there is only a single train-validation pair, this may be insufficiently accurate for our purpose.
If you’d like to explore this topic further, here is a link to some related information on the web, produced by Tarang Shah on the Towards Data Science website.