Tuesday, December 11, 2018

DataOps: A DevOps approach to Machine Learning Development

1. Motivation

We discussed in previous blog posts of the benefits of the Site Reliability Engineering and the DevOps approaches to software development and management of complex distributed systems. Unfortunately, this is nowadays not as widespread across data science and machine learning engineering teams for the development and deployment of algorithms. This creates a technical gap between the data scientists initially developing the model or algorithm, and the consequent industrialization phase that should export the artifact to production. The main reason is the different skillset among team members which implies a lack of knowledge transfer concerning the model and on the opposite direction the infrastructure details.

This video provides a very nice introduction to this issue:



Based on the video and personal experience, it seems clear that most analytics projects follow the typical lambda architecture (Fig. 1) while data scientists tend to follow a different workflow (Fig. 2).
Fig. 1: Lambda architecture

Fig. 2: Typical DS workflow

This makes somehow incomplete the integration of DS activities into a CICD pipeline chain, with unit and integration tests only capturing either data preparation code or production-ready components. Consequently, new means are necessary to: i) automate the data quality assessment of ingested data (i.e., without waiting for a data scientist to perform an explorative phase once again), and ii) automate the model training process in order to achieve reproducible and comparable models.  Pachyderm offers a cloud-native (read K8s-based) means to version incoming data and manage their provenance, by also versioning processing pipelines. Please see an introductory video below:



Consequently, we have an engineering problem to both versioning incoming data (e.g. pachyderm) and the ongoing data science development workflow, which encompasses multiple aspects of ML model development. The main goal is to make sure that stakeholders (e.g. product owner, managers) all get a periodical and reproducible development status beside the classic agile methods that might be already in use (e.g. scrum retrospective). As mentioned at the beginning of the post, data scientists are often left aside the rest of the development team, with their tasks considered something more related to a research practice than engineering, i.e. that might eventually work but not necessarily be of usable. This lack of control can lead to information gaps and waste of resources, as well as to unexpected behaviours upon changes on data and models.

Speaking of DevOps in the AI/ML domain, I would then advocate the following requirements:
  1. Model reproducibility - an experimental setting is necessary to share the same runtime across all data scientists and be able to both seamlessly access the data without further technical assistance, as well as allow for the implementation of algorithms and models in a shared environment where not only code is versioned, but also trials on models in terms of achieved performance, in order to track changes during the development process, as well as be able to reproduce specific instances of those models, since both data and model structure are shared across all team members and therefore inherently documented.
  2. Model integration - the interface used to instantiate and run the model should be standardized and documented in order to allow for continuous testing and integration in the rest of the architecture, to early spot flaws in the development process.
  3. Model deployment and continuous monitoring - the level of automation for the deployment and delivery process should be reach a greater extent to allow for the rolling out of new models at higher frequency, with the goal of collecting performance on actual usage as well as usage data over new functionalities, which allows for a validation of the model with respect to both its training metrics and actual user requirements (e.g. A/B testing).
Along those requirements we can maybe also add a few optional functionalities:
  1. Workflow management - to schedule periodic tasks, such as complex data preparation or model updates;
  2. Hyperparameter tuning - to benchmark multiple values for input parameters via black-box optimization;

These requirements resulted in multiple frameworks to automate ML-related development processes, offered by both cloud providers (e.g., Amazon SageMaker and Google ML Engine) and the open source community, with projects  such as ML-Flow and KubeFlow. In this blog post, I want to explore ML-Flow and KubeFlow.

2. MLflow

MLflow is an open source tool introduced by DataBrick to manage the ML software lifecycle.
MLflow offers 3 main components:

  • Tracking - for tracking experiments in terms of parameters and results, to make them reproducible; This is an API to log metrics and results when running ML code. Tracking can be done on a file (even remote, e.g. on S3) or an actual server.
  • Projects - for packaging code and manage dependencies to make it more easily shareable across team members and later on movable to production; Specifically, MLflow provides a YAML format to define projects. 
  • Models - offering a common interface for the deployment (or serving) process for multiple ML libraries; To this end, MLflow defines an interface, i.e. a bunch of methods that can be defined by the ML developer and called similarly when serving the model on different target platforms, on both on premise and cloud environments.

MLflow is language agnostic (i.e., offers API for major programming languages) and can be installed using Python pip. It can be used on both on-premise clusters and cloud-based installations, as it integrates well with Azure ML and Amazon Sage Maker. A CLI interface is also provided for common workflow operations (e.g., run experiments, up/down-load and serve models).

A quickstart is provided here and a full tutorial here. At the time I checked MLflow it seemed to be at a pretty early stage (early beta, version 0.8.0) and I had issues in getting the CLI installed on my MAC. I found a docker image here, which is simply inhering a python environment and installing it there using pip.

Starting the docker and checking the CLI help:

root@e0f9de7fb06c:/# mlflow --help
Usage: mlflow [OPTIONS] COMMAND [ARGS]...

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  artifacts    Upload, list, and download artifacts from an MLflow artifact...
  azureml      Serve models on Azure ML.
  download     Download the artifact at the specified DBFS or S3 URI into...
  experiments  Manage experiments.
  pyfunc       Serve Python models locally.
  rfunc        Serve R models locally.
  run          Run an MLflow project from the given URI.
  sagemaker    Serve models on SageMaker.
  server       Run the MLflow tracking server.
  ui           Launch the MLflow tracking UI.

The difference is subtle, the CLI allows for the management of:

  • artifacts

    root@e0f9de7fb06c:/# mlflow artifacts --help
    Usage: mlflow artifacts [OPTIONS] COMMAND [ARGS]...

      Upload, list, and download artifacts from an MLflow artifact repository.

      To manage artifacts for a run associated with a tracking server, set the
      MLFLOW_TRACKING_URI environment variable to the URL of the desired server.

    Options:
      --help  Show this message and exit.

    Commands:
      download       Download an artifact file or directory to a local...
      list           Return all the artifacts directly under run's root
                     artifact...
      log-artifact   Logs a local file as an artifact of a run, optionally...
      log-artifacts  Logs the files within a local directory as an artifact of
                     a...
  • experiments, grouping different runs of a certain source code

    root@e0f9de7fb06c:/# mlflow experiments --help
    Usage: mlflow experiments [OPTIONS] COMMAND [ARGS]...

      Manage experiments. To manage experiments associated with a tracking
      server, set the MLFLOW_TRACKING_URI environment variable to the URL of the
      desired server.

    Options:
      --help  Show this message and exit.

    Commands:
      create   Create an experiment in the configured tracking server.
      delete   Mark an experiment for deletion.
      list     List all experiments in the configured tracking server.
      rename   Renames an active experiment.
      restore  Restore a deleted experiment.
  • projects, directly starting code from a local folder or a git repository

    root@cbc963f1e596:/# mlflow run --help
    Usage: mlflow run [OPTIONS] URI

      Run an MLflow project from the given URI.

      For local runs, blocks the run completes. Otherwise, runs the project
      asynchronously.

      If running locally (the default), the URI can be either a Git repository
      URI or a local path. If running on Databricks, the URI must be a Git
      repository.

      By default, Git projects run in a new working directory with the given
      parameters, while local projects run from the project's root directory.

    Options:
      -e, --entry-point NAME       Entry point within project. [default: main]. If
                                   the entry point is not found, attempts to run
                                   the project file with the specified name as a
                                   script, using 'python' to run .py files and the
                                   default shell (specified by environment
                                   variable $SHELL) to run .sh files
      -v, --version VERSION        Version of the project to run, as a Git commit
                                   reference for Git projects.
      -P, --param-list NAME=VALUE  A parameter for the run, of the form -P
                                   name=value. Provided parameters that are not in
                                   the list of parameters for an entry point will
                                   be passed to the corresponding entry point as
                                   command-line arguments in the form `--name
                                   value`
      --experiment-id INTEGER      ID of the experiment under which to launch the
                                   run. Defaults to 0
      -m, --mode MODE              Execution mode to use for run. Supported
                                   values: 'local' (runs projectlocally) and
                                   'databricks' (runs project on a Databricks
                                   cluster).Defaults to 'local'. If running
                                   against Databricks, will run against the
                                   Databricks workspace specified in the default
                                   Databricks CLI profile. See
                                   https://github.com/databricks/databricks-cli
                                   for more info on configuring a Databricks CLI
                                   profile.
      -c, --cluster-spec FILE      Path to JSON file (must end in '.json') or JSON
                                   string describing the clusterto use when
                                   launching a run on Databricks. See https://docs
                                   .databricks.com/api/latest/jobs.html#jobscluste
                                   rspecnewcluster for more info. Note that MLflow
                                   runs are currently launched against a new
                                   cluster.
      --git-username USERNAME      Username for HTTP(S) Git authentication.
      --git-password PASSWORD      Password for HTTP(S) Git authentication.
      --no-conda                   If specified, will assume that
                                   MLModel/MLProject is running within a Conda
                                   environmen with the necessary dependencies for
                                   the current project instead of attempting to
                                   create a new conda environment.
      --storage-dir TEXT           Only valid when `mode` is local.MLflow
                                   downloads artifacts from distributed URIs
                                   passed to parameters of type 'path' to
                                   subdirectories of storage_dir.
      --run-id RUN_ID              If specified, the given run ID will be used
                                   instead of creating a new run. Note: this
                                   argument is used internally by the MLflow
                                   project APIs and should not be specified.
      --help                       Show this message and exit.
The first step is to start the tracking server:

root@e0f9de7fb06c:/# mlflow server --help
Usage: mlflow server [OPTIONS]

  Run the MLflow tracking server.

  The server which listen on http://localhost:5000 by default, and only
  accept connections from the local machine. To let the server accept
  connections from other machines, you will need to pass --host 0.0.0.0 to
  listen on all network interfaces (or a specific interface address).

Options:
  --file-store PATH            The root of the backing file store for
                               experiment and run data (default: ./mlruns).
  --default-artifact-root URI  Local or S3 URI to store artifacts in, for
                               newly created experiments. Note that this flag
                               does not impact already-created experiments.
                               Default: inside file store.
  -h, --host HOST              The network address to listen on (default:
                               127.0.0.1). Use 0.0.0.0 to bind to all
                               addresses if you want to access the tracking
                               server from other machines.
  -p, --port INTEGER           The port to listen on (default: 5000).
  -w, --workers INTEGER        Number of gunicorn worker processes to handle
                               requests (default: 4).
  --static-prefix TEXT         A prefix which will be prepended to the path of
                               all static paths.
  --gunicorn-opts TEXT         Additional command line options forwarded to
                               gunicorn processes.
  --help                       Show this message and exit.

As visible, the server receives the path to a local or remote S3 URI where artifacts will be saved, along with a location where experiment information is saved, by default to a file. MLflow uses gunicorn to expose a REST interface, so the number of worker processes can be set here, along with the default port and host to listen on.


I got to the same exact result when running the mlflow ui instead of the server command, and have not seen anything in the documentation explaining the difference.

The webpage simply tells us that no experiment is present and it is time to create one:

root@9d3e4db7110a:/# mlflow experiments create --help
Usage: mlflow experiments create [OPTIONS] EXPERIMENT_NAME

  Create an experiment in the configured tracking server.

Options:
  -l, --artifact-location TEXT  Base location for runs to store artifact
                                results. Artifacts will be stored at
                                $artifact_location/$run_id/artifacts. See http
                                s://mlflow.org/docs/latest/tracking.html#where
                                -runs-get-recorded for more info on the
                                properties of artifact location. If no
                                location is provided, the tracking server will
                                pick a default.
  --help                        Show this message and exit.



so for instance

root@0cf24699bef0:/# mlflow experiments create gilberto
Created experiment 'gilberto' with id 1

An experiment was created in our /mlruns folder with id 1, and a meta.yaml describing the project:

root@0cf24699bef0:/# cat /mlruns/1/meta.yaml
artifact_location: /mlruns/1
experiment_id: 1
lifecycle_stage: active
name: gilberto

The id can be passed explicitly when invoking the run command, with --experiment-id, or by setting the environment variable MLFLOW_EXPERIMENT_ID=1.
As long as no run is performed, nothing is visible on the UI. Alternatively, the runs can be logged to a remote tracking server, by setting the MLFLOW_TRACKING_URI variable or programmatically calling mlflow.set_tracking_uri().

2.1 MLflow tracking

The Tracking module works on the concept of run, i.e. code run, where it is possible to collect data concerning: code version, start and end time, source file being run, parameters passed as input, metrics collected explicitly in the code, artifacts auxiliary to the run or created by the run, such as specific data files (e.g. images) or models.

Typically, a run would be structured as follows:

  1. a start_run is used to initiate a run, useful especially inside notebooks or files where multiple runs are present and we want to delimit them
  2. specific methods to log params (log_param), metrics (log_metric), track output artifacts (log_artifact)


from mlflow import log_param, log_metric, log_artifact
import mlflow
with mlflow.start_run():
    mlflow.log_param("param1", 1)
    mlflow.log_metric("metric1", 2)

    with open("results.csv", w) as f:
        f.write("val, val2, val2")
        f.write("1, 2, 3")
    log_artifact("results.csv")


2.2 MLflow projects

In Mlflow any directory (whose name is also the project name) or git repository can be a project, as long as specific configuration files are available:

  • A Conda Yaml environment specification file;
  • A MLProject file, a Yaml specification file which locates the environment dependencies, along with the entry point, i.e. the command to be run;

name: My Project
conda_env: conda.yaml
entry_points:
  main:
    parameters:
      data_file: path
      regularization: {type: float, default: 0.1}
    command: "python train.py -r {regularization} {data_file}"
  validate:
    parameters:
      data_file: path
    command: "python validate.py {data_file}"
This allows for running MLflow projects directly from the CLI using the run command on either a local folder or a git repository, directly passing the argument parameters, for instance:

mlflow run tutorial -P alpha=0.5
mlflow run git@github.com:mlflow/mlflow-example.git -P alpha=5

2.3 MLflow models

The MLflow model eases the storage and serving of ML models, by:

  • specifying the creation time and run_id for the model so that it can be related to the run that created it;
  • using tags (i.e., hashmaps to provide model metadata) called flavours to list how the model can be used, for instance if compatible with scikit-learn, if implemented as python function, and so on. The flavour mechanism is the main strength of MLflow model, since this allows for standardization of the deployment process. Specifically, MLflow specifies some built-in flavours for main ML frameworks (e.g. scikit-learn, Keras, PyTorch, Spark MLlib);

Models can be saved to any format and through flavours the developer defines how they can be packages into a standard interface. Additional flavours can be defined for the model as well.

A very common flavour is the python_function serve which for instance lets the developer expose a REST interface to interact with the model, for instance using JSON or CSV for data serialization.

root@0cf24699bef0:/# mlflow pyfunc serve --help
Usage: mlflow pyfunc serve [OPTIONS]

  Serve a pyfunc model saved with MLflow by launching a webserver on the
  specified host and port. For information about the input data formats
  accepted by the webserver, see the following documentation:
  https://www.mlflow.org/docs/latest/models.html#pyfunc-deployment.

  If a ``run_id`` is specified, ``model-path`` is treated as an artifact
  path within that run; otherwise it is treated as a local path.

Options:
  -m, --model-path PATH  Path to the model. The path is relative to the run
                         with the given run-id or local filesystem path
                         without run-id.  [required]
  -r, --run-id ID        ID of the MLflow run that generated the referenced
                         content.
  -p, --port INTEGER     Server port. [default: 5000]
  -h, --host TEXT        Server host. [default: 127.0.0.1]
  --no-conda             If specified, will assume that MLModel/MLProject is
                         running within a Conda environmen with the necessary
                         dependencies for the current project instead of
                         attempting to create a new conda environment.
  --help                 Show this message and exit.


Predict loads the input data and expects the output data computed by the ML algorithm:

root@0cf24699bef0:/# mlflow pyfunc predict --help
Usage: mlflow pyfunc predict [OPTIONS]

  Load a pandas DataFrame and runs a python_function model saved with MLflow
  against it. Return the prediction results as a CSV-formatted pandas
  DataFrame.

  If a ``run-id`` is specified, ``model-path`` is treated as an artifact
  path within that run; otherwise it is treated as a local path.

Options:
  -m, --model-path PATH   Path to the model. The path is relative to the run
                          with the given run-id or local filesystem path
                          without run-id.  [required]
  -r, --run-id ID         ID of the MLflow run that generated the referenced
                          content.
  -i, --input-path TEXT   CSV containing pandas DataFrame to predict against.
                          [required]
  -o, --output-path TEXT  File to output results to as CSV file. If not
                          provided, output to stdout.
  --no-conda              If specified, will assume that MLModel/MLProject is
                          running within a Conda environmen with the necessary
                          dependencies for the current project instead of
                          attempting to create a new conda environment.
  --help                  Show this message and exit.


3. KubeFlow

Kubeflow has the same aim of easing the typical ML workflow, but started from the way Google ran TensorFlow computation internally on its Kubernetes infrastructure, and has since then evolved to a more general purpose. Because of this Kubeflow, is in fact a collection of related libraries and frameworks:

  • Jupyter (Hub) - no need for further explanation;
  • TensorFlow - to natively train models on the Kubernetes distributed infrastructure;
  • TensorFlow Hub - to publish reusable ML modules, i.e., TF Graphs and Model Weights, this can also be directly used from wrapping libraries such as Keras;
  • TensorFlow Serve - to ramp up a scalable web service to wrap the model,  i.e. to perform inference on the model using a standard API, as data is being ingested;
  • SeldonCore - to deploy ML models on K8s that are not necessarily related to TensorFlow (e.g., scikit-learn, Spark MLlib)
  • Katib/Vizier - for hyperparameter tuning on K8s via black-box optimization
  • Ambassador - a reverse proxy for K8s
  • Argo -  a container-native workflow management system for K8s (link)

And yes, it seems like a lot of stuff, and may feel like a lot of mess too.

3.1 Installation

KubeFlow can be installed on an existing K8s cluster. Shall you need a test cluster, minikube is always the suggested solution, basically installing K8s in a local VM.
At the time of writing, KubeFlow is installed using a download.sh and a kfctl.sh setup script. See  the script I uploaded here if you want to jump right at the end of setup.

Once installed the available K8s namespaces can be returned with:

$ kubectl get namespaces
NAME          STATUS    AGE
default       Active    18m
kube-public   Active    18m
kube-system   Active    18m
kubeflow      Active    18m

And finally have a view of the pods:

$ kubectl --namespace=kubeflow get pods
NAME                                             READY     STATUS    RESTARTS   AGE
ambassador-7fb86f6bc5-plcsn                      3/3       Running   1          18m
argo-ui-7b6585d85d-r424v                         1/1       Running   0          18m
centraldashboard-79645788-pftdc                  1/1       Running   0          18m
minio-84969865c4-768fp                           1/1       Running   0          18m
ml-pipeline-7f4c96bfc4-vtckr                     1/1       Running   1          18m
ml-pipeline-persistenceagent-7ccd95bc65-pddhf    1/1       Running   1          18m
ml-pipeline-scheduledworkflow-7994cff76c-hdgxd   1/1       Running   0          18m
ml-pipeline-ui-6b767c487c-wc7bm                  1/1       Running   0          18m
mysql-c4c4c8f69-knkcp                            1/1       Running   0          18m
spartakus-volunteer-77458746f4-x92nq             1/1       Running   0          18m
tf-hub-0                                         1/1       Running   0          18m
tf-job-dashboard-7cddcdf9c4-4j6tq                1/1       Running   0          18m
tf-job-operator-v1alpha2-6566f45db-b9lsr         1/1       Running   0          18m
workflow-controller-59c7967f59-xlplb             1/1       Running   0          18m

And similarly for the services:

$ kubectl --namespace=kubeflow get services
NAME                         TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
ambassador                   ClusterIP   10.98.175.240    <none>        80/TCP              19m
ambassador-admin             ClusterIP   10.109.108.165   <none>        8877/TCP            19m
argo-ui                      NodePort    10.100.99.141    <none>        80:31861/TCP        18m
centraldashboard             ClusterIP   10.111.78.198    <none>        80/TCP              18m
k8s-dashboard                ClusterIP   10.99.70.75      <none>        443/TCP             19m
minio-service                ClusterIP   10.107.145.101   <none>        9000/TCP            18m
ml-pipeline                  ClusterIP   10.105.170.246   <none>        8888/TCP,8887/TCP   18m
ml-pipeline-tensorboard-ui   ClusterIP   10.102.53.154    <none>        80/TCP              18m
ml-pipeline-ui               ClusterIP   10.108.159.180   <none>        80/TCP              18m
mysql                        ClusterIP   10.98.154.235    <none>        3306/TCP            18m
statsd-sink                  ClusterIP   10.105.74.2      <none>        9102/TCP            19m
tf-hub-0                     ClusterIP   None             <none>        8000/TCP            18m
tf-hub-lb                    ClusterIP   10.106.122.192   <none>        80/TCP              18m
tf-job-dashboard             ClusterIP   10.104.122.161   <none>        80/TCP              18m

We can now port forward the pod or the service using kubectl port-forward:

$ kubectl port-forward --namespace=kubeflow svc/centraldashboard 9991:80
Forwarding from 127.0.0.1:9991 -> 8082
Forwarding from [::1]:9991 -> 8082

Handling connection for 9991
Handling connection for 9991

And I connect to my K8s cluster with: ssh -L 9991:localhost:9991 pilillo@192.168.1.9

So i can now access the service locally at localhost:9991. This is however limited to the dashboard service. We could do this for all services in the namespace if we wanted. A smarter thing to do is however to use our local kubectl to point to the K8s Cluster. To do this we have to introduce the concept of Kubernetes context. A context consists of a cluster, a namespace and a user to use to access cluster resources.

The current selected context can be shown with:
kubectl config current-context

The context can be selected with:
kubectl config use-context <context-name>

The context configuration is done in a KUBECONFIG colon-delimited environment variable or a config file, whose content can be shown with: kubectl config view, for instance for the cluster this is:

apiVersion: v1
clusters:
- cluster:
    certificate-authority: /home/pilillo/.minikube/ca.crt
    server: https://192.168.39.249:8443
  name: minikube
contexts:
- context:
    cluster: minikube
    user: minikube
  name: minikube
current-context: minikube
kind: Config
preferences: {}
users:
- name: minikube
  user:
    client-certificate: /home/pilillo/.minikube/client.crt
    client-key: /home/pilillo/.minikube/client.key

As visible, the configuration file consists of a cluster and a context, along with authentication details.

The file is commonly located at $HOME/.kube/config.

An additional cluster (other than local Minikube) can be added to my local Kubectl config with:
kubectl config set-cluster NAME [--server=server] [--certificate-authority=path/to/certificate/authority][--insecure-skip-tls-verify=true] [options]

For instance:

kubectl config set-cluster 192.168.1.9 \
--insecure-skip-tls-verify=true \
--server=https://192.168.1.9


Similarly, a set-context command can be used to define a new context:
kubectl config set-context NAME [--cluster=cluster_nickname] [--user=user_nickname] [--namespace=namespace] [options]

I will show in the following sections how services can be accessed from outside the cluster, and go through those that are of main interest to the scope of this post.

3.2 Accessing the services

While we showed how to port-forward services to local ports on our remote cluster, as well as connect our local kubeflow to the remote cluster using contexts, the general approach to connect to the services is to use Ambassador. Ambassador offers one point of interaction with KubeFlow, as a reverse proxy service, while allowing distributed configuration of the individual services, instead of centralized. This means that to access any service it is enough to port forward Ambassador:

kubectl port-forward --namespace=kubeflow svc/ambassador 9992:80 &

As shown, this can be done either directly from a client (e.g. my MAC) or we can tunnel to the K8s cluster with a -L 9992:localhost:9992 or even -D <port> and use an utility like foxy proxy to redirect the entire traffic to the tunnel.

The main Kubeflow interface:

The Jupyter Hub, where Jupyter lab sessions can be spawned:



The pipeline section, where experiments can be also managed:

3.4 SeldonCore

This is probably the nicest component available for K8s within Kubeflow. Seldon allows for packing code from different ML libraries in a format that can be served and monitored alike. Specifically it provides an API for:

  • Models - to wrap models from various known ML libraries
  • Routers - to control where requests to the gateway are directed, i.e. which model or model ensemble (e.g. for AB-testing or a multi-armed bandwitdth selector)
  • Combiners - to group multiple models into an ensemble
  • Transformers - to preprocess the request (e.g. feature extraction) and the response 

I should be writing a post only about this component later on. Until then, please enjoy the video below:




Hope this post gave an idea of the growing potential of DataOps. My feeling is that this is pretty much alpha at this time, but we are getting there, eventually.

No comments:

Post a Comment