In spite of an existing manifesto and various active social media groups, Data Operations, like many approaches inspired by DevOps, do not have a perfectly clear definition. On the contrary, DataOps is not tied to any specific architecture, framework, tool whatsoever. The core idea, however, is to consider data as a living thing, or better said, as a supply chain, with incoming data to be tracked throughout the entire manufacturing line, until the very end, where they produce models, other data, dashboard, report or any sort of asset. So let us agree at least on the core idea: data (and depending models) must be monitored as we do with software artifacts, since it is on data that we establish value and modern businesses.
In this post I would like to outline what is, in my own opinion, the data operations toolbox. We will hereby be naming data to mean data assets, such as datasets, data sources, etc.
First of all let's recall some related work in this direction.
- data duplication - with distributed teams working on use-cases, often times there are multiple solutions and results (i.e., data) for the same entity or problem;
- discovery issues - without a shared definition of data attributes there may be different meaning given to the same dataset or source, for instance in terms of freshness or specific fields represented;
- disconnected tools - without tracking downstream usage of data, there is no understanding of impact of changes, as well as possible re-use of components for different use-cases;
- logging inconsistencies - non-standard logging across applications
- lack of process - meaning lack of guidelines and common practices across data engineering teams
- lack of ownership and SLAs - which means lack of accountability, quality, guarantees and SLAs
- data shall be treated as code - with similar review processes when modifications to their schema is required, as well as continuous integration testing of depending artifacts;
- data is owned - there is always somebody in charge and accountable for their maintenance, until they get deprecated and archived;
- data quality is known - SLAs on data must be set and enforced; to meet those SLAs, data quality is continuously monitored based on agreed metrics;
- data tools and processes - establishment of processes and tools allowing for the integration of data
- freshness - the delay between data production and availability
- completeness - the actual data availability, i.e., the number of available rows out of produced ones, depending on used systems, this may also relate to noise and incomplete data;
- duplication - rows with duplicated primary or unique key
- cross-data-center consistency - percent of lost data when replicating across data centers (e.g. for disaster recovery)
- semantic checks - addressing specific field semantics, such as uniqueness, number of null and distinct values, etc.
- data discovery - definition of means for the description and crawling of data assets (sources, processed data, feature, models, etc.);
- data quality - definition of means for the enforcement of pre/post computation constraints as well as monitoring of data;
- data protection - definition of mechanisms to mask and/or encrypt data; typical examples are explorative analyses done on sensitive data such as critical infrastructure or dataset excerpts being "scrumbled" (e.g. by randomizing within the distribution) to be used for integration tests, as well as encryption of specific fields for long term historization;
- data versioning - definition of mechanisms to manage the data (data, features, models) lifecycle as well as their lineage control - for instance to allow for the estimation of impact on downstream dependencies;
- process control - definition and enforcement of guidelines for data preparation, feature extraction and management, model training and serving;
- Do you enforce access control and protect data by masking and encryption? this is important to make sure different roles have different access to the data, either totally or to a limited, scrumbled, encrypted version that still allows to work while making sure no leakage can occur;
- Do you keep an audit log of data accesses? this is important to guarantee any access is tracked and can be used in place of a potential breach;
- Do you employ data lineage to track data usage? this is important to track the data flow across the organization and allow for its overall monitoring;
- Do you use data versioning tools? this is important to assign basic dependability;
- Do you run periodic data quality validations? are we building use cases on a moving target or we do know how data looks like?
- Do you have a specification to annotate data assets? modeling data assets is a first step;
- Do you collect metadata in a data catalogue? collecting data assets (either via push/pull approaches) is a second step towards having metadata in one place;
- Do developers rely on a centralized catalogue to autonomously develop use cases? this defines the level of autonomy of the team, towards the vision of a self-serviced data lab
- Do you have guidelines for logging and established stacks for the collection of logs and metrics? this says a lot on whether the team relies on established stacks for the collection of logs and the definition and collection of metrics or rather uses random tools for any project;
- Do you use libraries to calculate features or feature stores to manage their life cycle? this tells a lot on whether the team is replicating the logic across multiple languages and applications or rather centralizing the implementation and making sure every stakeholder relies on that;
- Do you use means to track your experiments and models (via registries)? this tells a lot on how reproducible experiments are and how models are managed;
- Do you have standardized mechanisms for running ml pipelines for: data preparation, feature extraction, model training? it is important to understand whether the team has means to standardize the industrialization of pipelines, i.e., their monitoring and scheduling on production; what happens if a training step fails? is the whole process reproducible? this is more an end to end perspective to the previously presented points;
References:
- https://eng.uber.com/ubers-journey-toward-better-data-culture-from-first-principles/
- https://gradientflow.com/the-growing-importance-of-metadata-management-systems/
- https://mlinproduction.com/model-registries-for-ml-deployment-deployment-series-06/
- https://mlflow.org/
- https://www.featurestore.org/
- https://www.amazon.science/publications/deequ-data-quality-validation-for-machine-learning-pipelines
- https://edbt2021proceedings.github.io/docs/p79.pdf
- https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e
- https://datakitchen.io/infographic-6-dimensions-of-dataops-maturity/
- https://agilityhealthradar.com/
- https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/
No comments:
Post a Comment