
Model validation is the process of assessing the ability of the model we have trained on a test data set. If we wish to check if the model is working for the purpose it is intended for, we validate our model. Model validation is considered to be an integral part of the data science life cycle. First, we have to go through a few steps.
- Data Collection
- Data wrangling and preprocessing
- Data Analysis
- Feature Engineering
- Selecting a Model
- Model training
- Model Validation
- Result Interpretation
After selecting and training the model, we have to see the performance and usability of the model. If the model is validated, and it performs well on the validation set, it is said to be a good model that has generalized well.
Model Stability
Machine Learning Models are required to perform the correct prediction of the output value based on the varying input values. If a model can correctly make these predictions, the model is said to be stable, and it generalizes well.
Two major problems affect the stability of the model. Namely, they are the following:
- Overfitting – Overfitting is when a model is too complex, and it has learned the data so well, that it has also captured the noise present in the data. This model will perform very well on the training dataset but will fail to show the same decent results when dealing with unseen data.
- Underfitting – Underfitting is when the model isn’t complex enough, and it fails to learn the training data. This model will perform poorly not only on the test dataset but also on the training dataset.
Validating the model will ensure that our model fits and generalizes well, it will also ensure that our model doesn’t fail to make predictions on unseen data.
Importance of Model Validation
Model validation is one of the most crucial stages of the data science life cycle. Selecting the right machine-learning method is essential for developing our model. Each model has strengths and shortfalls of its own. For instance, certain algorithms may perform better with smaller datasets while others may perform better with larger datasets. Because two separate models might predict different results with varying degrees of accuracy when given similar data, model validation is necessary. It takes up a lot of time as scientists and engineers have to ensure that the trained model is robust and accurate.
If model validation is not done properly, it may lead to:
- Poor performance on unseen data.
- Unable to adapt and make satisfactory predictions in volatile environments such as the pandemic.
- The Model is not robust enough and can not perform well.
An overfit model wouldn’t have much value in the real world. Thus, this is why it is important to validate your model.
There are several techniques available to validate your machine learning models. After going through this post, you will have adept knowledge of machine learning model validation techniques.
We can remark that there are two different sorts of ml model validation methods while training datasets for machine learning models:
Testing data from the same dataset that was used to create the model is known as “in-sample validation.”
Testing data from a fresh dataset that wasn’t used to create the model is known as “out-of-sample validation.”
ML Model Validation Methods
Let’s take a look at some of the best practices for validating machine learning models which are used after training machine learning models to validate them. They have been mentioned below:
- Resubstituion
- Hold-out
- K-fold Cross-validation
- LOOCV
- Random Subsampling
- Bootstrapping
Now let’s go through these techniques in detail:
Resubstitution:
In this model validation method, the entire dataset is used for training the model. Then we calculate the error rate using the actual and predicted values from the same training dataset.
This error is known as the resubstitution error, and this technique is known as the resubstitution validation technique.
Hold-out:
Hold-out validation technique is one of the commonly used techniques in validation methods. It is very easy to implement. In this method, we split our data into two sets. These are the test datasets and the training datasets for machine learning models
The split ratio is kept at 60-40, 70-30, and 80-20. To ensure both sets get the equal distribution of each class, we apply a technique called stratification. It prevents the creation of unequal distributions of classes in both sets. This is considered to be one of the best practices for validating machine learning models.
Such a method is not suitable for an imbalanced dataset, and it ends up isolating a lot of data from the testing dataset.
From the image above, we can see that the entire original data is split. Some of it is used for training, while the remaining is held out to be used for validation or testing later.
Let’s take a look at the code below:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y)
Where X contains our features and y contains our target variables.
If we want to apply stratification, simply make the following changes:
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify = y)
K-fold Cross-Validation:
K-fold cross-validation is a technique in which we split our data into K folds. Then we use K-1 of these folds for training our data. The one fold which is left is later used for testing and validation of our model.
In this technique, we use the entire dataset for the testing and training of the data. We calculate the error rate at each iteration
The reason behind using cross-validation is that sometimes we might encounter a situation where the data size isn’t large. And we want to estimate the generalization error. So, we would prefer to have some unseen data while training the model.
This approach has low time complexity as well but this approach isn’t suitable for problems which have an imbalanced dataset.
From the image above, we can see that the data is split into K folds, and each time a different fold is used for validation.
To implement it in python, take a look at the code below:
from sklearn.model_selection import KFold
folds = KFold(n_splits = 5)
folds.get_n_splits(df)
This will return the indices for the split of train and test variables. You can use that to fit your model and validate it.
LOOCV:
The LOOCV or Leave-One-Out Cross-Validation is a technique in which we use all of the data for training and leave out 1 sample for testing.
This is an iterative process, and it is repeated N times. This technique is computationally expensive, and it might lead to a model with a low bias.
To calculate the error rate, we use the average error rate for each iteration.
In the above image, we can see that each time 1 sample is left out while the remaining is used for training the model. This sample is then used for validating the model.
We can implement this technique by using the Sci-kit learn library easily. Take a look at the code below:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
loo.get_n_splits(df)
Using this, we can get as many splits as the length of our dataset. Each sample in the set is used as the test set at least once.
This is why we get n splits, where n is the length of our dataset.
Random Subsampling:
Random Subsampling is a technique in which we choose subsets from the data at random. These randomly chosen subsets form the test set.
The remaining data is used for training the data. When calculating the error rate, we take the average of all the iterations of this experiment.
Some samples may never be selected for the training or validation set in this technique. It also is not suitable for problems with an imbalanced dataset.
As you can see, in each iteration we randomly pick samples from the data which form our test set later.
Bootstrapping:
In Bootstrapping, we select the training data randomly with replacement. The samples that are selected are used for training. The remaining unselected samples are then used for testing.
Similar to the above methods, we have to calculate the error rate by taking the average of all of the iterations.
As you can see from the image, it isn’t necessary that all the subsets be used for training. Sometimes some subsets might remain unpicked.
Conclusion
In this article, you got an overview of model validation and why it is important. We also went through how model validation contributes to the stability of a model allowing it to generalize on unseen data.
If a model fails to perform well on unseen data, it doesn’t hold much value. Thus this is why model validation is important.
Then we covered different model validation techniques. We went through each one of them in depth. The hold-out and the K-fold cross-validation techniques are two of the most popular and commonly used methods. Depending on your data, you may choose either one of them.