Validation Sets
The Validation DataSet should be carefully split based on the task at hand.
Splitting the training, validation, and test datasets at random is not always a good bet.
Before we get ahead of ourselves, let us properly define what training, validation, and test set mean.
Definition of Datasets#
All the definition of sets refers to the data at hand.
| Type of dataset | Purpose | Remarks |
|---|---|---|
| Training Set |
|
|
| Validation Set |
|
|
| Test Set |
|
|
With definitions in place, the most important thing about the validation and test dataset is that it should be representative of the data we will see in the future.
Examples of how to split datasets#
Time series#
For time series data, older, continuous, time series should be used to train the model.
The newer continuous dataset should be used for validation, with the newest continuous dataset reserved for testing.
Videos and photos#
Frames that are similar to each other (2 adjacent frames in a video) could be placed in the training and validation set. The result is great scores, but bad generalizability.
Why?
Because our algorithm starts to pick up on the features of the objects in the photo rather than for features that correspond to the task at hand.
Conclusion#
At the end of the day, having a good validation set will allow you to iterate quicker, and know with greater confidence, the progress that you are making.
Do not skimp on making a validation set!