Thursday, April 15, 2021

The Data Operations toolbox

In spite of an existing manifesto and various active social media groups, Data Operations, like many approaches inspired by DevOps, do not have a perfectly clear definition. On the contrary, DataOps is not tied to any specific architecture, framework, tool whatsoever. The core idea, however, is to consider data as a living thing, or better said, as a supply chain, with incoming data to be tracked throughout the entire manufacturing line, until the very end, where they produce models, other data, dashboard, report or any sort of asset. So let us agree at least on the core idea: data (and depending models) must be monitored as we do with software artifacts, since it is on data that we establish value and modern businesses.

In this post I would like to outline what is, in my own opinion, the data operations toolbox. We will hereby be naming data to mean data assets, such as datasets, data sources,  etc.

First of all let's recall some related work in this direction.

In [1], Uber lists what are the classic problems emerging from working with data:
  • data duplication - with distributed teams working on use-cases, often times there are multiple solutions and results (i.e., data) for the same entity or problem;
  • discovery issues - without a shared definition of data attributes there may be different meaning given to the same dataset or source, for instance in terms of freshness or specific fields represented;
  • disconnected tools - without tracking downstream usage of data, there is no understanding of impact of changes, as well as possible re-use of components for different use-cases;
  • logging inconsistencies - non-standard logging across applications
  • lack of process - meaning lack of guidelines and common practices across data engineering teams
  • lack of ownership and SLAs - which means lack of accountability, quality, guarantees and SLAs
They then sketch possible solutions to these problems: 
  1. data shall be treated as code - with similar review processes when modifications to their schema is required, as well as continuous integration testing of depending artifacts;
  2. data is owned - there is always somebody in charge and accountable for their maintenance, until they get deprecated and archived;
  3. data quality is known - SLAs on data must be set and enforced; to meet those SLAs, data quality is continuously monitored based on agreed metrics;
  4. data tools and processes - establishment of processes and tools allowing for the integration of data
For data quality, the post specifically mentions: 
  • freshness - the delay between data production and availability
  • completeness - the actual data availability, i.e., the number of available rows out of produced ones, depending on used systems, this may also relate to noise and incomplete data;
  • duplication - rows with duplicated primary or unique key
  • cross-data-center consistency - percent of lost data when replicating across data centers (e.g. for disaster recovery)
  • semantic checks - addressing specific field semantics, such as uniqueness, number of null and distinct values, etc.
This implies annotating data - in a sort of catalogue, according to a shared data model - with information such as: i) static info (ownership, lineage - related pipelines, code, tier), ii) usage (audit information, especially those modifying the data), iii) quality (available tests and provided metrics and SLAs), iv) cost (resources needed to (re)compute those data), v) reference to open issues and bugs; 

The article at [2] tries to also make some clarity about the importance of metadata management systems, aka catalogues, in improving the data governance and constituting a base for a data marketplace where developers of use-cases can directly discover themselves new data assets along with a clear description of dependability and what doable with that asset is;

It is also important to mention that while data catalogues provide a metadata layer for all kinds of data (e.g. users, data bases, streams), data are produced at various abstraction levels. For instance, a growing trend is that of model registries [3] (such as mlflow [4]) and that of feature stores [5], that in turn address the problem of recording, versioning and querying experiments, configurations, code and results;

As data is produced at different levels, its quality does also follow its peculiarities; While common data quality metrics are important, they are often strictly related to the work domain. Projects such as [6] attempt providing common statistical metrics to analyze data at both syntactical (e.g. completeness, distinct values) and semantical level (e.g., frequency, max/min/avg/std, distribution), thus achieving most of the data quality metrics discussed by uber in [1] and presented above. Their tool does provide means to extract validation rules from a reference data set, which can then be applied to monitor the data as they evolve over time, for instance upon arrival of new entries; As spotted by [7], these can be extended to also allow for novelty detection - as this may in turn lead to model drift when used to train models - using one-class classification methods such as isolation forests [8].

Going towards the conclusion of this post, these are the activities that constitute the toolbox of a DataOps Engineer:
  • data discovery - definition of means for the description and crawling of data assets (sources, processed data, feature, models, etc.);
  • data quality - definition of means for the enforcement of pre/post computation constraints as well as monitoring of data;
  • data protection - definition of mechanisms to mask and/or encrypt data; typical examples are explorative analyses done on sensitive data such as critical infrastructure or dataset excerpts being "scrumbled" (e.g. by randomizing within the distribution) to be used for integration tests, as well as encryption of specific fields for long term historization;
  • data versioning - definition of mechanisms to manage the data (data, features, models) lifecycle as well as their lineage control - for instance to allow for the estimation of impact on downstream dependencies;
  • process control - definition and enforcement of guidelines for data preparation, feature extraction and management, model training and serving;

Going back to the manifesto we started the post with, Data Kitchen (the company behind the manifesto) tried to assess the dataops maturity with 6 dimensions [9]: i) error rates, ii) cycle time, iii) measurement, iv) collaboration, v) team culture, vi) customer happiness. Although I am not aware of such a framework, if further extended, those domains may lead to the calculation of a health radar like the one presented for devops in [10].

Although useful, I think something like the Joel Test [11] for dev teams would give a more technical roadmap to decision makers.

Considering what we discussed above this may be something like:
  1. Do you enforce access control and protect data by masking and encryption? this is important to make sure different roles have different access to the data, either totally or to a limited, scrumbled, encrypted version that still allows to work while making sure no leakage can occur;
  2. Do you keep an audit log of data accesses? this is important to guarantee any access is tracked and can be used in place of a potential breach;
  3. Do you employ data lineage to track data usage? this is important to track the data flow across the organization and allow for its overall monitoring; 
  4. Do you use data versioning tools? this is important to assign basic dependability;
  5. Do you run periodic data quality validations? are we building use cases on a moving target or we do know how data looks like?
  6. Do you have a specification to annotate data assets? modeling data assets is a first step;
  7. Do you collect metadata in a data catalogue? collecting data assets (either via push/pull approaches) is a second step towards having metadata in one place;
  8. Do developers rely on a centralized catalogue to autonomously develop use cases? this defines the level of autonomy of the team, towards the vision of a self-serviced data lab
  9. Do you have guidelines for logging and established stacks for the collection of logs and metrics? this says a lot on whether the team relies on established stacks for the collection of logs and the definition and collection of metrics or rather uses random tools for any project;
  10. Do you use libraries to calculate features or feature stores to manage their life cycle? this tells a lot on whether the team is replicating the logic across multiple languages and applications or rather centralizing the implementation and making sure every stakeholder relies on that;
  11. Do you use means to track your experiments and models (via registries)? this tells a lot on how reproducible experiments are and how models are managed;
  12. Do you have standardized mechanisms for running ml pipelines for: data preparation, feature extraction, model training? it is important to understand whether the team has means to standardize the industrialization of pipelines, i.e., their monitoring and scheduling on production; what happens if a training step fails? is the whole process reproducible? this is more an end to end perspective to the previously presented points;
There you go, 12 steps towards improved data governance;

References:

No comments:

Post a Comment