Sunday, August 26, 2018

A TLDR Introduction to Kubernetes

A TLDR (Too Long Don't Read) Introduction to Kubernetes

Site Reliability Engineering is the Google approach towards Software Development and Operations (DevOps). Software Engineers with a mixed skillset on Unix and Networking is hired to perform tasks that have in the past been typically performed manually by operations.
Engineers will tend to automate their operation tasks and get involved for 50% of their time in software development. The mixed skillset and the higher degree of automation also allows for implicit documentation across the team and a quicker release into production of new software artifacts. Ideally Google's SRE principles are very similar to those of DevOps. The team members homogeneously share responsibilities over "the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s)". 
As such, monitoring i.e. logging, event processing and automated alerts (along with dashboards) are necessary to allow promptly response and guarantee that service-level agreements or indicators (e.g., latency, error rate, throughput, availability) are respected. Moreover, postmortems are to be written for each recorded incident, in order to trigger an investigation that can determine the root cause and potentially yield to better understanding of what in the monitoring capabilities did not work.
Demand forecasting and capacity planning are important to verify (mainly by performing load testing) that sizing of resources has been done properly.

In SRE the cluster management software has a crucial role, since we do not act on individual nodes but simply submit jobs to a master which is in charge for finding suitable resources and monitoring the job execution. The Kubernetes cluster manager which we describe in this post is a derived of Google's Borg, on which Google's site reliability engineering methodology is based.


Kubernetes (K8s) is an open source orchestrator for containerized applications (i.e. based on containers, see my previous blog post on Docker). The reason for containers is mainly runtime immutability and reproducibility, which provides projects with higher release speed and stability, as well as allows for better decoupling between load balancers, application APIs and actual service implementations, by generally by using declarative languages to configure set points and service-level agreements for the cluster.




Kubernetes Architecture

This is how the cluster is being organized:
  • Nodes - distinguished in a master, generally running the API server and managing/scheduling the cluster resources, and a set of workers running the actual application containers;
  • Domain name system (DNS) service to index all services available in the cluster and allow for service discovery;
  • Kubernetes UI showing resources and running services;
  • Proxy running on every cluster node, it routes the traffic to the individual services running on the node;
This is how a Kubernetes application is organized:
  • Namespace, used to organize cluster resources and distinguish them across applications
  • Pods, with a pod managing containers and volumes on an individual computing environment (i.e., single cluster node) and as such it is the smallest deployable unit on K8s. A pod groups resources that should necessarily run on the same node. Within a pod, resources share the same IP/hostname. Each pod has a YAML or JSON manifest, used to configure the pod in a declarative manner. Based on the manifest, K8s seeks a machine where it can fit and instantiates it. The pod manifest lists the containers the pod has to run and the minimum (requests) and maximum (limits) resources for each, as well as the exposed ports and mounted volumes. The pod can also expose a REST interface for readiness and liveliness health checks. For instance:

    apiVersion: v1
    kind: Pod
    metadata:
         name: podname
    spec:
        volumes:
            - name: "volumeName"
              hostPath:
                  path: "/path/on/host"
        containers:
            - image: containerImageName
              name: containerName
              resources:
                  requests:
                      memory: "256Mi"
                  limits:
                      memory: "512Mi"
              volumeMounts:
                  - mountPath: "/mount/point"
                    name: "volumeName"
              ports:
                  - containerPort: 8080
                    name: portname
                    protocol: TCP
              readinessProbe:
                  httpGet:
                      path: /pathrds
                      port: 8080
                  periodSeconds: 5
                  initialDelaySeconds: 0

                  timeoutSeconds: 1             
                  successThreshold: 1

                  failureThreshold: 3
              livelinessProbe:
                  httpGet:
                      path: /pathlvl
                      port: 8080
                  periodSeconds: 5
                  initialDelaySeconds: 0 

                  timeoutSeconds: 1             
                  successThreshold: 1

                  failureThreshold: 3
                  

    with the readiness and liveliness health checks reachable at the defined ports and path (which need to be forwarded). Specifically, periodSeconds defines the interval for the check, failure- and successThreshold define the numbers of trials respectively before considering the check failed or succeeded.
  • ReplicaSets, which specify how many replicas for a pod should be instantiated to meet higher demand and provide the service with self-healing capabilities. ReplicaSets are used to define scalable services. In a micro service architecture, a replicaset can be used to control a specific micro service. The pod manifest specifies in a declarative way the desired state in terms of number of replicas for the same pod, while a control loop is used to monitor the current state, in order to potentially terminate and restart unresponsive pods and keep the system in the desired state. ReplicaSets manage set of Pods but are not directly coupled to them, i.e., they start and control them using the k8s API but are not directly related to the specific instances, since they are meant to control stateless services, which also allows for adopting existing pods or add additional ones at run time (elastic scaling). An example specification is explained here and reported below:

    apiVersion: v1
    kind: ReplicaSet
    metadata:
        name: ReplicaSetUniqueName
    spec:
        replicas: 2
        template:
            metadata:
                labels:
                    label1: "value1"
            spec:
                containers:
                    - name: serviceName
                      image: "dockerImage:version"


    as visible, the replicaSet simply wraps the pod template by adding additional metadata and a number of pod replicas to be kept. The labels are used to discover and distinguish the pods and running containers belonging to a specific replicaSet. Based on those labels, the k8s API will be returning a different set of pods, consequently leading to different control actions (e.g. ramping up additional pods if not enough). A ReplicaSet can also be autoscaled, for instance based on certain computing resources, such as CPU or memory (see kubectl table below).
  • DaemonSets are used, contrarily to ReplicaSets, to define a service that should run on each individual node (thus the name daemon rather than replica) and does not require coordination across different nodes (as the replicaset). An example specification is reported below:

    apiVersion: v1
    kind: DaemonSet
    metadata:
        name: uniqueDSName
        namespace: nsname
        labels:
            label: "value"
    spec:
        template:
            metadata:
                labels:
                    label1: "value1"
            spec:
                nodeSelector:
                    labelName: "value"
                containers:
                     ...


    as previously, the daemonset wraps a pod specification. The daemonset does instantiate a pod on each cluster node, unless a nodeSelector is specified. In the example the labelName is set to value, so that only nodes having that label set to that value can be used to host a pod (see table for how to set labels to nodes). A rolling update can be set to automatically update all pods in the set.
  • Jobs are processes that are expected to terminate upon fulfilling their purpose. In case of failure of a controlled pod, this will be re-created based on the template specification. A job can be defined using the restart=OnFailure option and started using the run command:

    kubectl run jobname \
    --image=dockerimage \
    --restart=OnFailure \
    -- --flag1 \
        --var1 val1


    alternatively a yaml file can be specified, as usual:

    apiVersion: v1
    kind: Job
    metadata:
        name: jname
        labels:
            ...
    spec:
        parallelism: 2
        template:
            metadata:
                ...
            spec:
                containers:
                    ...
         the parallelism option allows for setting the number of replicas to parallelize the job.
  • Deployments can be used to manage multiple versions and specify releases, which can be rolled out without any down time. A deployment can be specified using the usual yaml format:

    apiVersion: v1
    kind: Deployment
    metadata:
        ...
    spec:
        replicas: 2
        selector:
            ...
        strategy:
            type: RollingUpdate
            rollingUpdate:
                maxUnavailable: 1
                maxSurge: 1
        template:
            ...


    with the strategy part defining how the rollout should take place, either RollingUpdate which rolls out a different version without any down time, or Recreate, which terminates all pods and creates them using the newer version. For the RollingUpdate, the maxunavailable defines the maximum number of pods that can be unavailable during a rolling update. A maxUnvailable set to 100% of pods means that no pod should become unavailable and consequently additional pods should be initiated before older ones are terminated. The maxSurge defines how many additional pods can be rolled out in the process. Setting this value to the one of the pods in the deployment would mean ramping up all those for the newer version before terminating any of the older one.

Installing Kubernetes

The "kubeadm init" command can be used to firstly start a master node, and then join it from each worker node using the "kubeadm join". Visit this link to get started.

Running Kubernetes

Kubernetes can be ran directly on one of the public cloud providers such as Google (i.e., the Google container service) and Azure (i.e, Azure container service) or installed on the Amazon EC2 (see tutorial here). Kubernetes can be tested quickly using minikube, a single node Kubernetes cluster, which can be simply started by running "minikube start".

Deploying a Service

In K8s a deployment can be used to easily group services into a releasable artifact, which can be deployed without any downtime.

service can be deployed using the run command:

kubectl run deploymentname \
--image=dockerimage \
--replicas=n \
--labels="lab1=val1,lab2=val2"

where a docker image is specified to be started, along with a number of replicas and a set of labels to be used as metadata to annotate the instance.

The labels being used in the system can be retrieved with:
kubectl get deployments --show-labels

Similarly we can retrieve the objects using a certain label with:
kubectl get deployments -L labelname

Labels can also be added using the label command:
kubectl label deployments deploymentname "label=value"

The label can be similarly added to replica and daemon sets, as well as nodes:
kubectl label nodes nodename "label=value"

Kubernetes Commands


A Kubernetes cluster is controlled from the kubectl CLI. Here are some commands for a quick reference:

Kubectl Command Description
kubectl get nodes Lists all the cluster worker nodes
kubectl describe nodes node-id Returns information on a node (e.g. resources,
number of pods)
kubectl --namespace=nsname The flag should be used to select a specific namespace
kubectl get pods Lists all pods running on the cluster
kubectl apply -f pod-manifest.yaml Instantiates a pod based on a manifest yaml file
kubectl logs podname Retrieves the logs for the podname pod
kubectl describe pods podname Returns info on the running podname pod
kubectl describe rs rsname Returns info on the running replicaset rsname
kubectl exec podname command Executes the command in the podname pod
kubectl exec -it podname command Executes the command and opens an interactive
session in the podname pod (e.g. bash)
kubectl port-forward podname lport:rport Forwards lport:rport on podname
kubectl delete pods/podname Gracefully terminates and then deletes podname
kubectl delete -f pod-manifest.yaml Gracefully terminates and then deletes the pod
or replicaset defined in pod-manifest.yaml
kubectl edit deployment/deploymentname Fetches the deployment, opens its manifest
in an editor
kubectl scale rsname --replicas=n Forces the number of replicas to scale to n
kubectl scale deploymentname
--replicas=n
Forces the number of replicas to scale to n
kubectl autoscale rs rsname --min=1
--max=10 --cpu-percent=80
Autoscales the number of pods in (1, 10)
based on a 80% cpu threshold
kubectl get hpa Returns the defined autoscalers
kubectl delete rs rsname Deletes the replicaset rsname
kubectl describe daemonset dsname Returns info for the dsname daemonset
kubectl label nodes nodename "label=value" Adds the label to the node metadata
kubectl rollout status deployments dname Returns the status of a deployment rollout
kubectl rollout pause deployments dname Pauses a deployment rollout
kubectl rollout resume deployments dname Resumes a deployment rollout
kubectl rollout history deployment dname Retrieves the deployment rollout history

Have fun! 

Andrea.


Bibliography

  1. K. Hightower et Al. Kubernetes - Up & Running. Dive into the future of infrastructure. O'Reilly 2017. Online available at https://landing.google.com/sre/book.
  2. B. Beyer et Al. Site Reliability Engineering. How Google runs production systems. O'Reilly 2017.

3 comments: