Data Validation for Machine Learning

Data Validation for Machine Learning

The data validation stage has three main components: the data analyzer computes statistics over the new data batch, the data validator checks properties of the data against a schema, and the model unit tester looks for errors in the training code using synthetic data (schema-led fuzzing). Model unit testing is a little different, because it isn’t validation of the incoming data, but rather validation of the training code to handle the variety of data it may see. To flush these out, the schema is used to generate synthetic inputs in a manner similar to fuzz testing, and the generated data is then used to drive a few iterations of the training code.

Source: blog.acolyer.org