P111110: Speeding up large scale data quality validation: the Gilberto project

We already discussed in previous posts of the importance of data quality validation as part of the data operations toolbox, along with a metadata management solution such as Mastro. However, we never really discussed the integration of those two aspects. Data catalogues and Feature stores are becoming standard design patterns in enabling on one hand data annotation, discovery and lineage tracking; while on the other hand feature stores assure basic quality constraint on produced features, in order to enable reuse and presidium of data assets by continuously monitoring it for deviations from expected requirements. While catalogues allow for building a dependency graph of data assets, they do not go into the statistical details of monitored data assets; Similarly, feature stores are meant to abstract repetitive feature extraction processes that can occur on different environments, be it a data science notebook or a scheduled pipeline, often written with different frameworks and languages; Feature stores also allow for the versioning of those computations in order to achieve repeatability. A similar goal is meant for ML models by means of model registries.

As it appears, catalogues and feature stores target lineage and computational aspects rather than data-centric ones, that is, without going into data quality matters, such as completeness, accuracy, timeliness, integrity, validity (e.g. wrt a type schema), distribution; These are target by so call metrics stores, which store data quality metrics as calculated by specific runs of quality validation pipelines;

Example data quality frameworks are:

All of those require to some extent some customization for the data source of interest. Great expectations is a Python-centric tool meant to provide a meta-language to define constraints that can be run by multiple backends. Similarly, TFX DV originated from the Tensorflow project and can extract meaningful statistics from data represented in common formats, such as TFRecord; Deequ is a library written in Scala Spark and offers in my opinion the most general purpose tool out of those, especially when targeting data generally sitting in data warehouses, on either HDFS or S3, as it is common nowadays and where Spark really excels. Deequ benefits from the great integration of Spark on modern data processing technologies to offer mainly the following:

profiling - extraction of statistics on input Data(frames);
constraint suggestion - extraction of meaningful constraints based on profiled Data(frames)
validation - enforcement of Checks to detect deviations
anomaly detection - to detect deviations over time from common characteristics

Deequ is an amazing tool, but still requires some customization, to load those data sources and define checks. Moreover, whereas computed metrics can be saved to so called metrics repositories, they are provided as either an InMemoryMetricsRepository and a FileSystemMetricsRepository. The former is basically a Concurrent Hash map, while the second is a connector allowing for writing a json file of kind metrics.json to HDFS or S3. Clearly, this has various drawbacks. Most of all, writing to a unique file blob all metrics does not scale and does not allow for querying from Presto and other engines alike.

To overcome this issues we:

introduce the Gilberto project, meant to curtail the boilerplate coding required with Deequ; the developer can define checks in Scala code files which can be deployed on a distributed FS along the artifact or mounted on a local volume, for instance on a k8s cluster;
An example Check file

Gilberto is able to use reflection to dynamically load and enforce those checks and return standard error codes. This makes the tool easily integrable in workflow management systems, such as Argo-workflows; Gilberto is meant to be ran on both Yarn and k8s alike. Check the sections YARN_DEPLOY and K8S_DEPLOY for an example. For K8s, the master branch contains a full-fledged deployment script. You can use that in combination with a version available in one of the other branches, such as for Spark 3.1.2 or Spark 2.4.7, which you can either build locally or pull from Dockerhub.
introduce metrics stores in the Mastro project, meant to store various kinds of metricsets, including those generated by Deequ/Gilberto and sent via a PUT over a REST interface.
type definition for MetricSet

As such, Metric Sets can be easily integrated with Mastro's Data Assets (stored in the catalogue) and Feature Sets (as stored in the featurestore), which does close the gap we discussed at the beginning of this post; Also, being the format the same used by existing Deequ's Metrics Repositories, this enables anomaly detection use cases, since metrics can be retrieved by tags and time, also using a REST interface.
type definition for DeequMetric
Getting started with mastro is super easy. Beside a docker-compose there is also a Helm Chart, to help you get started on K8s. The only prerequisite is a DB, we use bitnami/mongo for most of our tests.