Data Validation

What is data validation

Before training a new model version, data validation entails evaluating the correctness and quality of the source data. It guarantees that rare abnormalities or anomalies reflected in incremental data are not overlooked. It focuses on ensuring that the new data’s statistics are correct.

Depending on the aims and restrictions, many methods of validation can be undertaken. The following are some examples of such goals in the machine learning pipeline.

  • Is the incremental data free of abnormalities or data errors? If this is the case, notify the team and request an investigation.
  • Are there any data assumptions made during model training that are then broken when serving? If this is the case, notify the team and request an investigation.
  • Is there a major difference between data used for training and data used for serving? Or, are there any discrepancies in the data that is being added to the training data? If this is the case, issue an alert to look into the variations in the training and serving code stacks.

The data validation stages’ output should provide enough information for a data engineer to take action. It also requires a high level of accuracy, since too many false alarms can lead to a loss of confidence.

How does data validation work?

Consider the data validation component as an ML application’s checkpoint that prevents bad data from entering. It monitors each new data entry that is going to be added to the training data. The data validation methodology may be summarized in five phases:

  1. Calculate statistics based on the training data using a set of criteria.
  2. Calculate the statistics of the data that has to be validated that has been ingested.
  3. Compare the validation data’s statistics to the training data’s statistics.
  4. Take automatic actions like eliminating the row, capping, or flooring the values based on the validation findings.
  5. Notifications and alerts are sent for approval.

Importance of data validation

Machine learning algorithms are subject to poor data quality.

The model is re-trained in production with a fresh set of incremental data added regularly (as often as daily), and the revised model is pushed to the serving layer. While serving, the model creates predictions based on fresh data, and the same data is combined with real labels and utilized for retraining. This guarantees that the newly created model adjusts to changes in data properties.

However, owing to different factors such as code modifications that generate problems in the serving data ingestion component or the difference between training and serving stacks, the new incoming data in the serving layer might vary. With time, the incorrectly ingested data will become part of the training data, causing the model’s accuracy to deteriorate. Because freshly added data makes up a small portion of overall training data in each iteration, improvements in model accuracy will be readily overlooked, and mistakes will accumulate over time.

Thus, identifying data mistakes early on is critical since it lowers the cost of data errors, which is only going to rise as the issue spreads farther down the pipeline.

While developing a data validation component, a data scientist encounters several obstacles, including:

  • Creating data validation rules for a dataset with a few columns appears to be a straightforward task. When the number of columns in a dataset grows, however, it becomes a massive undertaking.
  • A data scientist must spend a significant amount of time tracking and comparing metrics from previous datasets to detect abnormalities in past patterns for each column.
  • Data validation must be automated in today’s systems, which are required to run 24 hours a day, seven days a week. Validation rules should be refreshed by the data validation component.

Final thoughts

Verifying a dataset offers the user confidence in their model’s stability. With machine learning infiltrating many aspects of society and being utilized in our daily lives, it’s more important than ever that the models reflect our civilization. The two most prevalent problems that a Data Scientist might encounter throughout the model construction process are overfitting and underfitting. Validation is the first step in optimizing your model’s performance and keeping it stable for a long time before it has to be retrained