Santhosh Nagaraj
Santhosh Nagaraj

Setting the right requests and limits in Koobernaytis

September 2020


Setting the right requests and limits in Koobernaytis

TL;DR: In Koobernaytis resource constraints are used to schedule the Pod in the right node, and it also affects which Pod is killed or starved at times of high load. In this blog, you will explore setting resource limits for a Flask web service automatically using the Vertical Pod Autoscaler and the metrics server.

Setting the right requests and limits with the Vertical Pod Autoscaler, metrics server and Goldilocks

There are two different types of resource configurations that can be set on each container of a pod.

They are requests and limits.

Requests define the minimum amount of resources that containers need.

If you think that your app requires at least 256MB of memory to operate, this is the request value.

The application can use more than 256MB, but Koobernaytis guarantees a minimum of 256MB to the container.

On the other hand, limits define the max amount of resources that the container can consume.

Your application might require at least 256MB of memory, but you might want to be sure that it doesn't consume more than 1GB of memory.

That's your limit.

Notice how your application has 256MB of memory guaranteed, but it can grow up until 1GB of memory.

After that, it is stopped or throttled by Koobernaytis.

Requests and limits constraints

Setting limits is useful to stop over-committing resources and protect other deployments from resource starvation.

You might want to prevent a single rogue app from using all resources available and leaving only breadcrumbs to the rest of the cluster.

If limits are used to stop your greedy containers, what are requests for?

Requests affect how the pods are scheduled in Koobernaytis.

When a Pod is created, the scheduler finds the nodes which can accommodate the Pod.

But how does it know how much CPU and memory is needed?

The app hasn't started yet, and the scheduler can't inspect memory and CPU usage at this point.

This is where requests come in.

The scheduler reads the requests for each container in your Pods, aggregates them and finds the best node that can fit that Pod.

Some applications might use more memory than CPU.

Others the opposite.

It doesn't matter, Koobernaytis checks the requests and finds the best Node for that Pod.

Processes use memory and CPU

You could visualise Koobernaytis scheduler as a skilled Tetris player.

For each block, Koobernaytis finds the best Node to optimise your resource utilisation.

CPU and memory requests define the minimum length and width of each block, and based on the size Koobernaytis finds the best Tetris board to fit the block.

It's important to always set your requests (width and height of the blocks).

Without those the block has no size, and how does one play Tetris with sizeless blocks?

You could fit an infinite number of blocks in your Tetris board.

And if your Tetris board is a real server, you might end up scheduling unlimited processes.

Of course, processes still have CPU and memory requirements.

So you if you don't set requests, you end up overcommiting resources.

Let's play Tetris with Koobernaytis with an example.

You can create an interactive busybox pod with CPU and memory requests using the following command:

bash

kubectl run -i --tty --rm busybox \
  --image=busybox \
  --restart=Never \
  --requests='cpu=50m,memory=50Mi' -- sh

What do these numbers actually mean?

Understanding CPU and memory units

Imagine you have a computer with a single CPU and wish to run three containers in it.

You might want to assign a third of CPU each — or 33.33%.

In Koobernaytis, the CPU is not assigned in percentages, but in thousands (also called millicores or millicpu).

One CPU is equal to 1000 millicores.

If you wish to assign a third of a CPU, you should assign 333Mi (millicores) to your container.

Memory is a bit more straightforward, and it is measured in bytes.

Koobernaytis accepts both SI notation (K,M,G,T,P,E) and Binary notation (Ki,Mi,Gi,Ti,Pi,Ei) for memory definition.

To limit memory at 256MB, you can assign 268.4M (SI notation) or 256Mi (Binary notation).

If you are confused on which notation to use, stick to the Binary notation as it is the one used widely to measure hardware.

Now that you have created the Pod with resource requests, let's explore the memory and CPU used by a process.

Inspecting and collecting metrics with the metrics server

In the previous example, you launched an idle busybox container.

It's currently using close to zero memory and CPU.

But how do you know for sure?

Is there a component in Koobernaytis that measures the actual CPU and memory?

Koobernaytis has several components designed to collect metrics, but two are essential in this case:

  1. The kubelet collects metrics such as CPU and memory from your Pods.
  2. The metric server collects and aggregates metrics from all kubelets.

Inspecting the kubelet for metrics isn't convenient — particularly if you run clusters with thousands of nodes.

When you want to know the memory and CPU usage for your pod, you should retrieve the data from the metric server.

Not all clusters come with metrics server enabled by default. For example, EKS (the managed Koobernaytis offering from Amazon Web Services) does not come with a metrics server installed by default.

How can you check the actual CPU and memory usage with the metrics server?

Since the busybox container is idle, let's artificially generate a few metrics.

Let's fill the memory with:

bash

dd if=/dev/zero of=/dev/shm/fill bs=1k count=1024k

And let's increase the CPU with an infinite loop:

bash

while true; do true; done

In another terminal run the following command to inspect the resources used by the pod:

bash

kubectl top pods
NAME      CPU(cores)   MEMORY(bytes)
busybox   462m         64Mi

From the output you can see that the memory utilised is 64Mi and the total CPU used is 462m.

The kubectl top command consumes the metrics exposed by the metric server.

Also, notice how the current values for CPU and memory are greater than the requests that you defined earlier (cpu=50m,memory=50Mi).

And that's fine because the Pod can use more memory and CPU than what is defined in the requests.

However, why is the container consuming only 400 millicores?

Since the Pod is running an infinite loop, you might expect it to consume 100% of the available CPU (or 1000 millicores).

Why is it not running at 100% CPU?

When you define a CPU request in Koobernaytis, that doesn't only describe the minimum amount of CPU but also establishes a share of CPU for that container.

All containers share the same CPU, but they are nice to each other, and they split the times based on their shares.

Let's have a look at an example.

Imagine having three containers that have a CPU request set to 60 millicores, 20 millicores and 20 millicores.

The total request is only 100 millicores, but what happens when all three processes start using as much CPU as possible (i.e. 100%)?

If you have a single CPU, the processes will grow to 600 millicores, 200 millicores and 200 millicores (i.e. 60%, 20%, 20%).

All of them increased by a factor of 10x until they used all the available CPU.

If you have 2 CPUs (or 2000 millicores), they will use 1200 millicores, 400 millicores and 400 millicores (i.e. 60%, 20%, 20%).

As they compete for resources, they are careful to divide the CPU based on the shares assigned.

In the previous example, the Pod is consuming 400 millicores because it has to compete for CPU time with the rest of the processes in the cluster such as the Kubelet, the API server, the controller manager, etc.

Let's have a look at another example to understand CPU shares better.

CPU requests and CPU shares

Please notice that the following example is executed in a system with 2 vCPU.

To see the number of cores in your system, you can use:

bash

docker info | grep CPUs

Now, let's run a container that consumes all available CPU and assign it a CPU share of 1024.

bash

docker run -d --rm --name stresser-1024 \
  --cpu-shares 1024 \
  containerstack/cpustress --cpu 2

The container containerstack/cpustress is engineered to consume all available CPU, but it has to how many CPUs are currently available (in this case is only 2 --cpu 2).

The command uses a few flags:

You can run docker stats to see the resource utilised by the container:

bash

docker stats
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %
446bde82ad8a        stresser-1024       198.01%             4.562MiB / 3.848GiB   0.12%

The container is using 198% of the available CPU — all of it considering that you have only 2 cores available.

But how can the CPU usage be more than 100%?

Here the CPU percentage is the sum of the percentage per core.

If you are running the same example in a 6 vCPU machine, it might be around 590%.

Let's create another container with CPU share of 2048.

bash

docker run -d --rm --name stresser-2048 \
  --cpu-shares 2048 \
  containerstack/cpustress --cpu 2

Is there enough CPU to run a second container?

You should inspect the container and check.

bash

docker stats
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %
270ac57e5cbf        stresser-2048       133.27%             4.605MiB / 3.848GiB   0.12%
446bde82ad8a        stresser-1024        66.66%             4.562MiB / 3.848GiB   0.12%

The docker stats command shows that the stresser-2048 container uses 133% of CPU, and the stresser-1024 container uses 66%.

When two containers are running in a 2 vCPU node, the stresser-2048 container gets twice the share of the available CPU.

The two containers are assigned 133.27% and 66.66% share of the available CPU, respectively.

In other words, processes are assigned CPU shares, and when they compete for CPU time, they compare their shares and increase their usage accordingly.

Can you guess what happens when you launch a third container that is as CPU hungry as the first two combined?

bash

docker run -d --name stresser-3072 \
  --cpu-shares 3072 \
  containerstack/cpustress --cpu 2

Let's have a look at the metrics:

bash

docker stats
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %
270ac57e5cbf        stresser-3072       101.17%             4.605MiB / 3.848GiB   0.12%
446bde82ad8a        stresser-2048        66.31%             4.562MiB / 3.848GiB   0.12%
e5cbfs82270a        stresser-1024        32.98%             4.602MiB / 3.848GiB   0.12%

The third container is using close to a 100% CPU, whereas the other two use ~66% and ~33%.

Since all containers want to use all available CPU, they will divide the 2 CPU cores available according to their shares (3072, 2048, and 1024).

So the total is 6144 shares, and each is equal to 0.33% CPU per share.

So the CPU time is divided as follows:

Now that you're familiar with CPU and memory requests let's have a look at limits.

Memory and CPU limits

Limits define the hard limit for the container and make sure the process doesn't consume all resources in the Node.

Let's imagine you have an application with a limit of 250Mi of memory.

When the application uses more than the limit, Koobernaytis kills the process with an OOMKilling (Out of Memory Killing) message.

In other words, the process doesn't have an upper memory limit, and it could cross the threshold of 250Mi.

However, as soon as that happens, the process is killed.

Now that you know what happens to memory limits let's have a look at CPU limits.

Is the Pod killed when it's using more CPU than the limit?

No, it's not.

In reality, CPU is measured as a function of time.

When you say 1 CPU limit, what you really mean is that the app runs up to 1 CPU second, every second.

If your application has a single thread, you will consume at most 1 CPU second every second.

However, if your application uses two threads, it is twice as fast, and it can complete the work in half of the time.

Also, the CPU quota is used in half of the time.

If you have two threads, you can consume 1 CPU second in 0.5 seconds.

Eight threads can consume 1 CPU second in 0.125 seconds.

What happens for the remaining 0.875 seconds?

Your process has to wait for the next CPU slot available, and the CPU is throttled.

  • In the following scenario there are three processes with 1, 2 and 8 threads.
    1/6

    In the following scenario there are three processes with 1, 2 and 8 threads.

  • The single thread process consumes 1 CPU second every second.
    2/6

    The single thread process consumes 1 CPU second every second.

  • The process with two threads consumes the same quota of 1 CPU second in half of the time.
    3/6

    The process with two threads consumes the same quota of 1 CPU second in half of the time.

  • The process with eight threads consumes the available quota in 1/8 of the time.
    4/6

    The process with eight threads consumes the available quota in 1/8 of the time.

  • In the next second, the quota is allocated, and the processes can consume the new allocation.
    5/6

    In the next second, the quota is allocated, and the processes can consume the new allocation.

  • Notice how the last process is frequently throttled as it consumes its allocation too quickly.
    6/6

    Notice how the last process is frequently throttled as it consumes its allocation too quickly.

Let's revisit the example discussed earlier to understand how CPU limits differ from requests.

Now, let's run the same cpustress image with half a CPU.

You can set a CPU limit with the --cpus flag.

bash

docker run --rm -d --name stresser-.5 \
  --cpus .5 \
  containerstack/cpustress --cpu 2

Run docker stats to inspect the CPU usage with:

bash

docker stats
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %
c445bbdb46aa        stresser-.5         49.33%              4.672MiB / 3.848GiB   0.12%

The container only uses half a CPU core.

Of course, that's the limit.

Let's repeat the experiment with a full CPU:

bash

docker run --rm -d --name stresser-1 \
  --cpus 1 \
  containerstack/cpustress --cpu 2

Run docker stats to inspect the cpu usage with:

bash

docker stats
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %
9c64c2d99be6        stresser-1          105.34%             4.648MiB / 3.848GiB   0.12%
c445bbdb46aa        stresser-.5         51.25%              4.609MiB / 3.848GiB   0.12%

Unlike CPU requests, the limits of one container do not affect the CPU usage of other containers.

That's precisely what happens in Koobernaytis as well.

Defining the CPU limit sets a max on how CPU a process can use.

Please notice that setting limits doesn't make the container see only the defined amount of memory or CPU.

The container can see the all of the resources of the node.

If the application is designed in a way to use the resources available to determine the amount of memory to use or number of threads to run, it can lead to a fatal issue.

One such example is when you set the memory limits for a container running a JAVA application, and the JVM uses the amount of memory in the node to set the Heap size.

Now that you understand how requests and limits work, it's time to put them in practice.

How do find the right value for CPU and memory requests and limits?

Let's explore the CPU and memory used by a real app.

Limits and requests in practice

You will use a simple cache service which has two endpoints, one to cache the data and another for retrieving it.

The service is written in Python using the Flask framework.

You can find the complete code for this application here.

Before you start, make sure that your cluster has the metrics server installed.

If you're using minikube, you can enable the metrics server with:

bash

minikube addons enable metrics-server

You might also need an Ingress controller to route the traffic to the app.

In minikube, you can enable the ingress-nginx controller with:

bash

minikube addons enable ingress

You can verify that the ingress and metrics servers are installed correctly with:

bash

kubectl get pods --all-namespaces
NAMESPACE     NAME                                        bready   STATUS
kube-system   coredns-66bff467f8-nclrr                    1/1     Running
kube-system   etcd-minikube                               1/1     Running
kube-system   ingress-nginx-controller-69ccf5d9d8-n6lqp   1/1     Running
kube-system   kube-apiserver-minikube                     1/1     Running
kube-system   kube-controller-manager-minikube            1/1     Running
kube-system   kube-proxy-cvkcg                            1/1     Running
kube-system   kube-scheduler-minikube                     1/1     Running
kube-system   metrics-server-7bc6d75975-54twv             1/1     Running

It's time to deploy the application.

You can use the following YAML file:

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flask-cache
spec:
  replicas: 1
  selector:
    matchLabels:
      name: flask-cache
  template:
    metadata:
      labels:
        name: flask-cache
    spec:
      containers:
        - name: cache-service
          image: xasag94215/flask-cache
          ports:
            - containerPort: 5000
              name: rest
---
apiVersion: v1
kind: Service
metadata:
  name: flask-cache
spec:
  selector:
    name: flask-cache
  ports:
    - port: 80
      targetPort: 5000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: flask-cache
spec:
  rules:
  - http:
      paths:
      - backend:
          service:
            name: flask-cache
            port:
              number: 80
        path: /
        pathType: Prefix

You might recognise the three components:

  1. The Deployment definition with a Pod template.
  2. A Service to route traffic to the Pods.
  3. An Ingress manifests to route external traffic to the Pods.

You can submit the resources with:

bash

kubectl apply -f deployment.yaml

If the metrics server is installed correctly, you should be able to inspect the memory and CPU consumption for the Pod with:

bash

kubectl top pods
NAME                           CPU(cores)   MEMORY(bytes)
flask-cache-85b94f6865-tvbg8   6m           150Mi

Please notice that the container does not define requests or limits for CPU or memory at the moment.

You can finally access the app by visiting the cluster IP address:

bash

minikube ip

Open your browser on http://<minikube ip> and you should be greeted by the running application.

Now that you have the application running, it's time to find the right value for requests and limits.

But before you dive into the tooling needed, let's lay down the plan.

A plan for finding the right requests and limits

Requests and limits depend on how much memory and CPU the application uses.

Those values are also affected by how the application is used.

An application that serves static pages might have a memory and CPU mostly static.

However, an application that stores documents in the database might behave differently as more traffic is ingested.

The best way to decide requests and limits for an application is to observe its behaviour at runtime.

So you will need:

Let's start with generating the traffic.

Generating traffic with Locust

There are many tools available to load testing apps such as ab, k6, BlazeMeter etc.

In this tutorial, you will use Locust — an open-source load testing tool.

locust — an open source load testing tool

Locust includes a convenient dashboard where you can inspect the traffic generated as well as see the performance of your app in real-time.

In Locust, you can generate traffic by writing Python scripts.

Writing code is ideal in this case because you can simulate calls to the cache service and create and retrieve the cached value from the app.

The following script does just that:

load_test.py

from locust import HttpUser, task, constant
import json
import uuid
import random

class cacheService(HttpUser):

    wait_time = constant(1)
    ids = []

    @task
    def create(self):
        id = uuid.uuid4()
        payload = {"username":str(id)}
        headers = {'content-type': 'application/json'}
        resp = self.client.post("/cache/new", data=json.dumps(payload),headers=headers)
        if resp.status_code == 200:
            out = resp.json()
            cache_id = out["_id"]
            self.ids.append(cache_id)

    @task
    def get(self):
        if len(self.ids) == 0:
            self.create()
        else:
            rid = random.choice(self.ids)
            self.client.get(