P111110: DataOps: A DevOps approach to Machine Learning Development

1. Motivation

We discussed in previous blog posts of the benefits of the Site Reliability Engineering and the DevOps approaches to software development and management of complex distributed systems. Unfortunately, this is nowadays not as widespread across data science and machine learning engineering teams for the development and deployment of algorithms. This creates a technical gap between the data scientists initially developing the model or algorithm, and the consequent industrialization phase that should export the artifact to production. The main reason is the different skillset among team members which implies a lack of knowledge transfer concerning the model and on the opposite direction the infrastructure details.

This video provides a very nice introduction to this issue:

Based on the video and personal experience, it seems clear that most analytics projects follow the typical lambda architecture (Fig. 1) while data scientists tend to follow a different workflow (Fig. 2).

Fig. 1: Lambda architecture

Fig. 2: Typical DS workflow

This makes somehow incomplete the integration of DS activities into a CICD pipeline chain, with unit and integration tests only capturing either data preparation code or production-ready components. Consequently, new means are necessary to: i) automate the data quality assessment of ingested data (i.e., without waiting for a data scientist to perform an explorative phase once again), and ii) automate the model training process in order to achieve reproducible and comparable models. Pachyderm offers a cloud-native (read K8s-based) means to version incoming data and manage their provenance, by also versioning processing pipelines. Please see an introductory video below:

Consequently, we have an engineering problem to both versioning incoming data (e.g. pachyderm) and the ongoing data science development workflow, which encompasses multiple aspects of ML model development. The main goal is to make sure that stakeholders (e.g. product owner, managers) all get a periodical and reproducible development status beside the classic agile methods that might be already in use (e.g. scrum retrospective). As mentioned at the beginning of the post, data scientists are often left aside the rest of the development team, with their tasks considered something more related to a research practice than engineering, i.e. that might eventually work but not necessarily be of usable. This lack of control can lead to information gaps and waste of resources, as well as to unexpected behaviours upon changes on data and models.

Speaking of DevOps in the AI/ML domain, I would then advocate the following requirements:

Model reproducibility - an experimental setting is necessary to share the same runtime across all data scientists and be able to both seamlessly access the data without further technical assistance, as well as allow for the implementation of algorithms and models in a shared environment where not only code is versioned, but also trials on models in terms of achieved performance, in order to track changes during the development process, as well as be able to reproduce specific instances of those models, since both data and model structure are shared across all team members and therefore inherently documented.
Model integration - the interface used to instantiate and run the model should be standardized and documented in order to allow for continuous testing and integration in the rest of the architecture, to early spot flaws in the development process.
Model deployment and continuous monitoring - the level of automation for the deployment and delivery process should be reach a greater extent to allow for the rolling out of new models at higher frequency, with the goal of collecting performance on actual usage as well as usage data over new functionalities, which allows for a validation of the model with respect to both its training metrics and actual user requirements (e.g. A/B testing).

Along those requirements we can maybe also add a few optional functionalities:

Workflow management - to schedule periodic tasks, such as complex data preparation or model updates;
Hyperparameter tuning - to benchmark multiple values for input parameters via black-box optimization;

These requirements resulted in multiple frameworks to automate ML-related development processes, offered by both cloud providers (e.g., Amazon SageMaker and Google ML Engine) and the open source community, with projects such as ML-Flow and KubeFlow. In this blog post, I want to explore ML-Flow and KubeFlow.

2. MLflow

MLflow is an open source tool introduced by DataBrick to manage the ML software lifecycle.
MLflow offers 3 main components:

Tracking - for tracking experiments in terms of parameters and results, to make them reproducible; This is an API to log metrics and results when running ML code. Tracking can be done on a file (even remote, e.g. on S3) or an actual server.
Projects - for packaging code and manage dependencies to make it more easily shareable across team members and later on movable to production; Specifically, MLflow provides a YAML format to define projects.

Models - offering a common interface for the deployment (or serving) process for multiple ML libraries; To this end, MLflow defines an interface, i.e. a bunch of methods that can be defined by the ML developer and called similarly when serving the model on different target platforms, on both on premise and cloud environments.

MLflow is language agnostic (i.e., offers API for major programming languages) and can be installed using Python pip. It can be used on both on-premise clusters and cloud-based installations, as it integrates well with Azure ML and Amazon Sage Maker. A CLI interface is also provided for common workflow operations (e.g., run experiments, up/down-load and serve models).

A quickstart is provided here and a full tutorial here. At the time I checked MLflow it seemed to be at a pretty early stage (early beta, version 0.8.0) and I had issues in getting the CLI installed on my MAC. I found a docker image here, which is simply inhering a python environment and installing it there using pip.

Starting the docker and checking the CLI help:

root@e0f9de7fb06c:/# mlflow --help
Usage: mlflow [OPTIONS] COMMAND [ARGS]...

Options:
--version Show the version and exit.
--help Show this message and exit.

Commands:
artifacts Upload, list, and download artifacts from an MLflow artifact...
azureml Serve models on Azure ML.
download Download the artifact at the specified DBFS or S3 URI into...
experiments Manage experiments.
pyfunc Serve Python models locally.
rfunc Serve R models locally.
run Run an MLflow project from the given URI.
sagemaker Serve models on SageMaker.
server Run the MLflow tracking server.
ui Launch the MLflow tracking UI.

The difference is subtle, the CLI allows for the management of:

artifacts

root@e0f9de7fb06c:/# mlflow artifacts --help
Usage: mlflow artifacts [OPTIONS] COMMAND [ARGS]...

Upload, list, and download artifacts from an MLflow artifact repository.

To manage artifacts for a run associated with a tracking server, set the
MLFLOW_TRACKING_URI environment variable to the URL of the desired server.

Options:
--help Show this message and exit.

Commands:
download Download an artifact file or directory to a local...
list Return all the artifacts directly under run's root
artifact...
log-artifact Logs a local file as an artifact of a run, optionally...
log-artifacts Logs the files within a local directory as an artifact of
a...
experiments, grouping different runs of a certain source code

root@e0f9de7fb06c:/# mlflow experiments --help
Usage: mlflow experiments [OPTIONS] COMMAND [ARGS]...

Manage experiments. To manage experiments associated with a tracking
server, set the MLFLOW_TRACKING_URI environment variable to the URL of the
desired server.

Options:
--help Show this message and exit.

Commands:
create Create an experiment in the configured tracking server.
delete Mark an experiment for deletion.
list List all experiments in the configured tracking server.
rename Renames an active experiment.
restore Restore a deleted experiment.
projects, directly starting code from a local folder or a git repository

root@cbc963f1e596:/# mlflow run --help
Usage: mlflow run [OPTIONS] URI

Run an MLflow project from the given URI.

For local runs, blocks the run completes. Otherwise, runs the project
asynchronously.

If running locally (the default), the URI can be either a Git repository
URI or a local path. If running on Databricks, the URI must be a Git
repository.

By default, Git projects run in a new working directory with the given
parameters, while local projects run from the project's root directory.

Options:
-e, --entry-point NAME Entry point within project. [default: main]. If
the entry point is not found, attempts to run
the project file with the specified name as a
script, using 'python' to run .py files and the
default shell (specified by environment
variable $SHELL) to run .sh files
-v, --version VERSION Version of the project to run, as a Git commit
reference for Git projects.
-P, --param-list NAME=VALUE A parameter for the run, of the form -P
name=value. Provided parameters that are not in
the list of parameters for an entry point will
be passed to the corresponding entry point as
command-line arguments in the form `--name
value`
--experiment-id INTEGER ID of the experiment under which to launch the
run. Defaults to 0
-m, --mode MODE Execution mode to use for run. Supported
values: 'local' (runs projectlocally) and
'databricks' (runs project on a Databricks
cluster).Defaults to 'local'. If running
against Databricks, will run against the
Databricks workspace specified in the default
Databricks CLI profile. See
https://github.com/databricks/databricks-cli
for more info on configuring a Databricks CLI
profile.
-c, --cluster-spec FILE Path to JSON file (must end in '.json') or JSON
string describing the clusterto use when
launching a run on Databricks. See https://docs
.databricks.com/api/latest/jobs.html#jobscluste
rspecnewcluster for more info. Note that MLflow
runs are currently launched against a new
cluster.
--git-username USERNAME Username for HTTP(S) Git authentication.
--git-password PASSWORD Password for HTTP(S) Git authentication.
--no-conda If specified, will assume that
MLModel/MLProject is running within a Conda
environmen with the necessary dependencies for
the current project instead of attempting to
create a new conda environment.
--storage-dir TEXT Only valid when `mode` is local.MLflow
downloads artifacts from distributed URIs
passed to parameters of type 'path' to
subdirectories of storage_dir.
--run-id RUN_ID If specified, the given run ID will be used
instead of creating a new run. Note: this
argument is used internally by the MLflow
project APIs and should not be specified.
--help Show this message and exit.

The first step is to start the tracking server:

root@e0f9de7fb06c:/# mlflow server --help
Usage: mlflow server [OPTIONS]

Run the MLflow tracking server.

The server which listen on http://localhost:5000 by default, and only
accept connections from the local machine. To let the server accept
connections from other machines, you will need to pass --host 0.0.0.0 to
listen on all network interfaces (or a specific interface address).

Options:
--file-store PATH The root of the backing file store for
experiment and run data (default: ./mlruns).
--default-artifact-root URI Local or S3 URI to store artifacts in, for
newly created experiments. Note that this flag
does not impact already-created experiments.
Default: inside file store.
-h, --host HOST The network address to listen on (default:
127.0.0.1). Use 0.0.0.0 to bind to all
addresses if you want to access the tracking
server from other machines.
-p, --port INTEGER The port to listen on (default: 5000).
-w, --workers INTEGER Number of gunicorn worker processes to handle
requests (default: 4).
--static-prefix TEXT A prefix which will be prepended to the path of
all static paths.
--gunicorn-opts TEXT Additional command line options forwarded to
gunicorn processes.
--help Show this message and exit.

As visible, the server receives the path to a local or remote S3 URI where artifacts will be saved, along with a location where experiment information is saved, by default to a file. MLflow uses gunicorn to expose a REST interface, so the number of worker processes can be set here, along with the default port and host to listen on.

I got to the same exact result when running the mlflow ui instead of the server command, and have not seen anything in the documentation explaining the difference.

The webpage simply tells us that no experiment is present and it is time to create one:

root@9d3e4db7110a:/# mlflow experiments create --help

Usage: mlflow experiments create [OPTIONS] EXPERIMENT_NAME

Create an experiment in the configured tracking server.

Options:

-l, --artifact-location TEXT Base location for runs to store artifact

results. Artifacts will be stored at

$artifact_location/$run_id/artifacts. See http

s://mlflow.org/docs/latest/tracking.html#where

-runs-get-recorded for more info on the

properties of artifact location. If no

location is provided, the tracking server will

pick a default.

--help Show this message and exit.

so for instance

root@0cf24699bef0:/# mlflow experiments create gilberto

Created experiment 'gilberto' with id 1

An experiment was created in our /mlruns folder with id 1, and a meta.yaml describing the project:

root@0cf24699bef0:/# cat /mlruns/1/meta.yaml

artifact_location: /mlruns/1

experiment_id: 1

lifecycle_stage: active

name: gilberto

The id can be passed explicitly when invoking the run command, with --experiment-id, or by setting the environment variable MLFLOW_EXPERIMENT_ID=1.

As long as no run is performed, nothing is visible on the UI. Alternatively, the runs can be logged to a remote tracking server, by setting the MLFLOW_TRACKING_URI variable or programmatically calling mlflow.set_tracking_uri().

2.1 MLflow tracking

The Tracking module works on the concept of run, i.e. code run, where it is possible to collect data concerning: code version, start and end time, source file being run, parameters passed as input, metrics collected explicitly in the code, artifacts auxiliary to the run or created by the run, such as specific data files (e.g. images) or models.

Typically, a run would be structured as follows:

a start_run is used to initiate a run, useful especially inside notebooks or files where multiple runs are present and we want to delimit them
specific methods to log params (log_param), metrics (log_metric), track output artifacts (log_artifact)

from mlflow import log_param, log_metric, log_artifact
import mlflow

with mlflow.start_run():
    mlflow.log_param("param1", 1)
    mlflow.log_metric("metric1", 2)

    with open("results.csv", w) as f:
        f.write("val, val2, val2")
        f.write("1, 2, 3")
    log_artifact("results.csv")

2.2 MLflow projects

In Mlflow any directory (whose name is also the project name) or git repository can be a project, as long as specific configuration files are available:

A Conda Yaml environment specification file;
A MLProject file, a Yaml specification file which locates the environment dependencies, along with the entry point, i.e. the command to be run;

name: My Project
conda_env: conda.yaml
entry_points:
  main:
    parameters:
      data_file: path
      regularization: {type: float, default: 0.1}
    command: "python train.py -r {regularization} {data_file}"
  validate:
    parameters:
      data_file: path
    command: "python validate.py {data_file}"

This allows for running MLflow projects directly from the CLI using the run command on either a local folder or a git repository, directly passing the argument parameters, for instance:

mlflow run tutorial -P alpha=0.5
mlflow run git@github.com:mlflow/mlflow-example.git -P alpha=5

2.3 MLflow models

The MLflow model eases the storage and serving of ML models, by:

specifying the creation time and run_id for the model so that it can be related to the run that created it;
using tags (i.e., hashmaps to provide model metadata) called flavours to list how the model can be used, for instance if compatible with scikit-learn, if implemented as python function, and so on. The flavour mechanism is the main strength of MLflow model, since this allows for standardization of the deployment process. Specifically, MLflow specifies some built-in flavours for main ML frameworks (e.g. scikit-learn, Keras, PyTorch, Spark MLlib);

Models can be saved to any format and through flavours the developer defines how they can be packages into a standard interface. Additional flavours can be defined for the model as well.

A very common flavour is the python_function serve which for instance lets the developer expose a REST interface to interact with the model, for instance using JSON or CSV for data serialization.

root@0cf24699bef0:/# mlflow pyfunc serve --help
Usage: mlflow pyfunc serve [OPTIONS]

Serve a pyfunc model saved with MLflow by launching a webserver on the
specified host and port. For information about the input data formats
accepted by the webserver, see the following documentation:
https://www.mlflow.org/docs/latest/models.html#pyfunc-deployment.

If a ``run_id`` is specified, ``model-path`` is treated as an artifact
path within that run; otherwise it is treated as a local path.

Options:
-m, --model-path PATH Path to the model. The path is relative to the run
with the given run-id or local filesystem path
without run-id. [required]
-r, --run-id ID ID of the MLflow run that generated the referenced
content.
-p, --port INTEGER Server port. [default: 5000]
-h, --host TEXT Server host. [default: 127.0.0.1]
--no-conda If specified, will assume that MLModel/MLProject is
running within a Conda environmen with the necessary
dependencies for the current project instead of
attempting to create a new conda environment.
--help Show this message and exit.

Predict loads the input data and expects the output data computed by the ML algorithm:

root@0cf24699bef0:/# mlflow pyfunc predict --help
Usage: mlflow pyfunc predict [OPTIONS]

Load a pandas DataFrame and runs a python_function model saved with MLflow
against it. Return the prediction results as a CSV-formatted pandas
DataFrame.

If a ``run-id`` is specified, ``model-path`` is treated as an artifact
path within that run; otherwise it is treated as a local path.

Options:
-m, --model-path PATH Path to the model. The path is relative to the run
with the given run-id or local filesystem path
without run-id. [required]
-r, --run-id ID ID of the MLflow run that generated the referenced
content.
-i, --input-path TEXT CSV containing pandas DataFrame to predict against.
[required]
-o, --output-path TEXT File to output results to as CSV file. If not
provided, output to stdout.
--no-conda If specified, will assume that MLModel/MLProject is
running within a Conda environmen with the necessary
dependencies for the current project instead of
attempting to create a new conda environment.
--help Show this message and exit.

3. KubeFlow

Kubeflow has the same aim of easing the typical ML workflow, but started from the way Google ran TensorFlow computation internally on its Kubernetes infrastructure, and has since then evolved to a more general purpose. Because of this Kubeflow, is in fact a collection of related libraries and frameworks:

Jupyter (Hub) - no need for further explanation;
TensorFlow - to natively train models on the Kubernetes distributed infrastructure;
TensorFlow Hub - to publish reusable ML modules, i.e., TF Graphs and Model Weights, this can also be directly used from wrapping libraries such as Keras;
TensorFlow Serve - to ramp up a scalable web service to wrap the model, i.e. to perform inference on the model using a standard API, as data is being ingested;
SeldonCore - to deploy ML models on K8s that are not necessarily related to TensorFlow (e.g., scikit-learn, Spark MLlib)
Katib/Vizier - for hyperparameter tuning on K8s via black-box optimization
Ambassador - a reverse proxy for K8s
Argo - a container-native workflow management system for K8s (link)

And yes, it seems like a lot of stuff, and may feel like a lot of mess too.

3.1 Installation

KubeFlow can be installed on an existing K8s cluster. Shall you need a test cluster, minikube is always the suggested solution, basically installing K8s in a local VM.
At the time of writing, KubeFlow is installed using a download.sh and a kfctl.sh setup script. See the script I uploaded here if you want to jump right at the end of setup.

Once installed the available K8s namespaces can be returned with:

$ kubectl get namespaces
NAME STATUS AGE
default Active 18m
kube-public Active 18m
kube-system Active 18m
kubeflow Active 18m

And finally have a view of the pods:

$ kubectl --namespace=kubeflow get pods
NAME READY STATUS RESTARTS AGE
ambassador-7fb86f6bc5-plcsn 3/3 Running 1 18m
argo-ui-7b6585d85d-r424v 1/1 Running 0 18m
centraldashboard-79645788-pftdc 1/1 Running 0 18m
minio-84969865c4-768fp 1/1 Running 0 18m
ml-pipeline-7f4c96bfc4-vtckr 1/1 Running 1 18m
ml-pipeline-persistenceagent-7ccd95bc65-pddhf 1/1 Running 1 18m
ml-pipeline-scheduledworkflow-7994cff76c-hdgxd 1/1 Running 0 18m
ml-pipeline-ui-6b767c487c-wc7bm 1/1 Running 0 18m
mysql-c4c4c8f69-knkcp 1/1 Running 0 18m
spartakus-volunteer-77458746f4-x92nq 1/1 Running 0 18m
tf-hub-0 1/1 Running 0 18m
tf-job-dashboard-7cddcdf9c4-4j6tq 1/1 Running 0 18m
tf-job-operator-v1alpha2-6566f45db-b9lsr 1/1 Running 0 18m
workflow-controller-59c7967f59-xlplb 1/1 Running 0 18m

And similarly for the services:

$ kubectl --namespace=kubeflow get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ambassador ClusterIP 10.98.175.240 <none> 80/TCP 19m
ambassador-admin ClusterIP 10.109.108.165 <none> 8877/TCP 19m
argo-ui NodePort 10.100.99.141 <none> 80:31861/TCP 18m
centraldashboard ClusterIP 10.111.78.198 <none> 80/TCP 18m
k8s-dashboard ClusterIP 10.99.70.75 <none> 443/TCP 19m
minio-service ClusterIP 10.107.145.101 <none> 9000/TCP 18m
ml-pipeline ClusterIP 10.105.170.246 <none> 8888/TCP,8887/TCP 18m
ml-pipeline-tensorboard-ui ClusterIP 10.102.53.154 <none> 80/TCP 18m
ml-pipeline-ui ClusterIP 10.108.159.180 <none> 80/TCP 18m
mysql ClusterIP 10.98.154.235 <none> 3306/TCP 18m
statsd-sink ClusterIP 10.105.74.2 <none> 9102/TCP 19m
tf-hub-0 ClusterIP None <none> 8000/TCP 18m
tf-hub-lb ClusterIP 10.106.122.192 <none> 80/TCP 18m
tf-job-dashboard ClusterIP 10.104.122.161 <none> 80/TCP 18m

We can now port forward the pod or the service using kubectl port-forward:

$ kubectl port-forward --namespace=kubeflow svc/centraldashboard 9991:80

Forwarding from 127.0.0.1:9991 -> 8082

Forwarding from [::1]:9991 -> 8082

Handling connection for 9991

And I connect to my K8s cluster with: ssh -L 9991:localhost:9991 pilillo@192.168.1.9

So i can now access the service locally at localhost:9991. This is however limited to the dashboard service. We could do this for all services in the namespace if we wanted. A smarter thing to do is however to use our local kubectl to point to the K8s Cluster. To do this we have to introduce the concept of Kubernetes context. A context consists of a cluster, a namespace and a user to use to access cluster resources.

The current selected context can be shown with:
kubectl config current-context

The context can be selected with:
kubectl config use-context <context-name>

The context configuration is done in a KUBECONFIG colon-delimited environment variable or a config file, whose content can be shown with: kubectl config view, for instance for the cluster this is:

apiVersion: v1
clusters:
- cluster:
certificate-authority: /home/pilillo/.minikube/ca.crt
server: https://192.168.39.249:8443
name: minikube
contexts:
- context:
cluster: minikube
user: minikube
name: minikube
current-context: minikube
kind: Config
preferences: {}
users:
- name: minikube
user:
client-certificate: /home/pilillo/.minikube/client.crt
client-key: /home/pilillo/.minikube/client.key

As visible, the configuration file consists of a cluster and a context, along with authentication details.

The file is commonly located at $HOME/.kube/config.

An additional cluster (other than local Minikube) can be added to my local Kubectl config with:

kubectl config set-cluster NAME [--server=server] [--certificate-authority=path/to/certificate/authority][--insecure-skip-tls-verify=true] [options]

For instance:

kubectl config set-cluster 192.168.1.9 \
--insecure-skip-tls-verify=true \
--server=https://192.168.1.9

Similarly, a set-context command can be used to define a new context:
kubectl config set-context NAME [--cluster=cluster_nickname] [--user=user_nickname] [--namespace=namespace] [options]

I will show in the following sections how services can be accessed from outside the cluster, and go through those that are of main interest to the scope of this post.

3.2 Accessing the services

While we showed how to port-forward services to local ports on our remote cluster, as well as connect our local kubeflow to the remote cluster using contexts, the general approach to connect to the services is to use Ambassador. Ambassador offers one point of interaction with KubeFlow, as a reverse proxy service, while allowing distributed configuration of the individual services, instead of centralized. This means that to access any service it is enough to port forward Ambassador:

kubectl port-forward --namespace=kubeflow svc/ambassador 9992:80 &

As shown, this can be done either directly from a client (e.g. my MAC) or we can tunnel to the K8s cluster with a -L 9992:localhost:9992 or even -D <port> and use an utility like foxy proxy to redirect the entire traffic to the tunnel.

The main Kubeflow interface:

The Jupyter Hub, where Jupyter lab sessions can be spawned:

The pipeline section, where experiments can be also managed:

3.4 SeldonCore

This is probably the nicest component available for K8s within Kubeflow. Seldon allows for packing code from different ML libraries in a format that can be served and monitored alike. Specifically it provides an API for:

Models - to wrap models from various known ML libraries
Routers - to control where requests to the gateway are directed, i.e. which model or model ensemble (e.g. for AB-testing or a multi-armed bandwitdth selector)
Combiners - to group multiple models into an ensemble
Transformers - to preprocess the request (e.g. feature extraction) and the response

I should be writing a post only about this component later on. Until then, please enjoy the video below:

Hope this post gave an idea of the growing potential of DataOps. My feeling is that this is pretty much alpha at this time, but we are getting there, eventually.

P111110

Tuesday, December 11, 2018

DataOps: A DevOps approach to Machine Learning Development