Dive into managing Kubernetes computational resources

Published in

HMH Engineering

28 min readMay 25, 2022

A hamster in the middle of accumulating resources so that nobody can take them away. Image by Bierfritze@pixabay

Not that long ago, we used to develop our applications to run standalone on a single machine and use as many resources as needed and available. But we needed to evolve to become more efficient in utilizing compute resources. Multi-tenancy became a standard.

Experience tells us that good fences make good neighbours and this applies to real-world as well as Kubernetes (K8s) deployments. K8s, like any other distributed system designed to share computational resources between applications, is responsible for sharing those resources fairly. But we, the users, need to help K8s do the hard work by carefully allocating resources to our applications to ensure they don’t interfere with other deployments.

As the systems built to manage resources got better at what they do, not surprisingly, they also got more advanced and more challenging to understand. It takes a bit of practice and experience to get a good handle on how to configure K8s deployments to provide just enough computational resources — enough to ensure the application runs efficiently while at the same time avoiding unnecessary expenditure for unused resources. Without good foundations, it is easy to make mistakes. Incorrect resource configuration will likely lead to many problems that aren’t easy to troubleshoot.

Here are some of the key questions that we need to answer:

how can we control resource allocation for an instance?
is it possible to let the application consume more resources if it needs them?
how K8s handles application instances deployed on the same worker node so that they aren’t competing for the same resources?

I hope this article will help you in your journey to becoming an efficient user of K8s. In the first part, we will start by discussing the types of computational resources used in K8s. Then we will learn about limits and requests, why they are there and how they are used by the scheduler and container runtime. Finally, we will look into the difference between the compressible and non-compressible types of resources, and find out what happens when the app uses more resources than it asked for and much more.

1. Resource types in K8s

1.1 Knowing current resource consumption

At any time (providing our account has been granted enough privileges) we can use kubectl to check how much resources are consumed by the applications deployed to our K8s cluster.

$ kubectl top pod prints information about all pods in a given namespace:

$ kubectl top pod --containers will also include information about containers in pods including sidecars:

The output of kubectl top pod command with view on containers

1.2 Setting resource constraints in K8s

K8s uses two types of constraints that are set per container in a pod to allocate CPU and memory: requests and limits. Those resources can be set for all standard service types such as Deployments, StatefulSets, Jobs and DaemonSets or directly on a Pod as in the example below.

If you are new to managing resources in K8s, these are the right questions to ask when looking at the resource configuration section in the pod manifest above:

what is going to happen when:
- we don’t specify resources?
- we don’t request enough resources?
- we request too many resources?
what are those strange units such as “m” for CPU and “Mi” for memory?

Keep reading and you will know the answers to those questions!

1.3 CPU units

Most commonly we use mCPU, called also millicores as CPU units in K8s. One millicore is 1/1000 of a core.

But what is the K8s Core? How many cores do the worker nodes that run your deployments have?

In most cases, 1 K8s core is 1 CPU unit on the host which might represent 1 physical CPU core on the physical host or one virtual CPU on a virtual machine.

On AWS, the vCPU represents the number of threads on a physical CPU core.

Ok, we now know what a K8s core is but what is a vCPU on a cloud provider?

The illustration depicting a processor below has 4 physical cores, which means it has 8 threads and therefore 8 vCPUs. Note that there are some exceptions, for example, T2 type EC2 instances as they have one thread per core. To verify how many threads an EC2 type you use has, refer to this page: cpu-options-supported-instances-values.

The diagram below illustrates the number of vCPUs on a processor with 4 physical cores.

Physical cores mapping into vCPUs in the cloud environment (ref: *https://aws.amazon.com/ec2/physicalcores/*)

When setting up our CPU requirements in the manifest file we provide values representing cores, most often fractions of cores. Here are some examples of “CPU math”:

1000m (mCPU/millicores) = 1 core (1 vCPU)
2000m = 2 cores, 500m = half core
250m = ¼ of a core

A few more practical examples:

a worker node with 4 cores provides 4000m capacity in total
a single core can run 4 x 250m pods
A 4 core node can run 16 pods each having 250m

Note that CPU is requested as an absolute quantity, not as a relative quantity, 250m is the same amount of CPU on a single-core, dual-core, or 48-core machine (we will come back to this later in the CPU shares section).

Pods will not be scheduled if they require more than the node’s capacity. A pod cannot have a configuration that requires 3000m on a 2-core node.

In the example snippet below we are creating a resource requirement for 100 millicores (1/10 of a core) for request and 250 millicores (1/4 of a core) for the limits. More on what the requests and limits are later in the article.

Deployment manifest snippet specifying CPU resource requirements

We will return to CPU resources in a later section of this article to learn about CPU shares and CPU quota.

1.4 Memory units

We are used to working with kilobytes, megabytes and gigabytes but it became a standard practice to adopt kibibytes, mebibytes and gigabytes when working with distributed platforms.

Kibibytes were designed to replace their kilobyte counterpart in those computer science contexts in which the term kilobyte is used to mean 1024 bytes. The interpretation of kilobyte to denote 1024 bytes collides with the SI (International System of Units) definition of the prefix kilo which means 1000. So to make this more clear to everyone, we now have a distinction between kilo, mega, giga, tera using base 10 and kibi, mebi, gibi and tebi using base 2.

Here are a few examples of translation between those units:

Mebibyte (Mi): 1 MiB = 1.048 MB
Gigibyte (Gi): 1 GiB = 1.074 GB
Tebibytes (Ti): 1 TiB = 1.0995 TB

K8s actually accepts both SI notation (K,M,G,T,P,E) and Binary notation (Ki,Mi,Gi,Ti,Pi,Ei) for memory definition. To limit memory to 256 MB, you can assign 268.4M (SI notation) or 256Mi (Binary notation).

Google makes it easy to translate from one notation to another:

Units conversion on the Google search page

Here is an example of the configuration that specifies 128 Mebibytes for the memory request and 256 Mebibyte for the memory limit.

Deployment manifest snippet specifying memory resource requirements

1.5 Other types of resources

Apart from CPU and memory, there are other types of resources that we can use in K8s such as local ephemeral storage (locally-attached writeable devices like EBS in AWS), storage time, storage space, storage operations, network bandwidth, network operations (many of them are yet to be implemented in K8s). In this article, we will focus just on CPU and memory.

Pod manifest snippet with ephemeral storage resource requirements

2. Limits and requests

2.1 Requesting and limiting resources

As we saw in the little code snippets in the previous paragraph, K8s uses the concept of a “Resource Request” and a “Resource Limit” when defining how many resources a container within a pod should receive.

Deployment manifest snippet specifying resource requests and limits

Resource request — specifies the minimum amount of resources a container needs to successfully run. This is a guarantee from K8s that you’ll always have this amount of either CPU or memory allocated to the container.

We set resource requests to declare how many resources our containers need to run in a normal operation.

Resource limit — the maximum amount of CPU or memory that can be used by a container. Limits prevent containers from taking up more resources on the cluster than you’re willing to let them.

We set resource limits to declare how much memory or CPU our containers can occasionally use.

requests ≤ limits

Requests can never be higher than limits, but they can be equal, which will make the pod QoS class Guaranteed.

If this still doesn’t make sense, it might be easier to understand this by looking at the visualisation below.

Requests and limits meaning for the worker node

The resource requirements are specified per container in a pod. If you have multiple containers, each of them needs its own resource requirements specification.

The CPU and memory requests values provide the guaranteed amount of resources that the container will always get. The limits value represents the maximum the container can ever get. Going off-limits on the CPU will mean the CPU will get throttled and for memory, the container will be killed. More on that later.

Another perspective on requests and limits is that requests are important at schedule time as it allows K8s to find a worker node that can accommodate this pod and limits are important at runtime as they are used by K8s to enforce usage constraints when the host system is under memory pressure.

Deployment manifest snippet specifying the purpose of resource requests and limits

K8s delegates managing limits to the container runtime (Docker/containerd) and the container runtime delegates it to the Linux kernel, specifically to the cgroups (more on that later).

2.2 Resource request and the scheduler

To really understand how resource request works, we need to take a deeper dive under the hood of K8s and understand the role of the scheduler.

The scheduler is one of the key components of the K8s’ control plane responsible, as the name suggests, for scheduling pods into nodes.

K8s architecture highlighting Control Plane and the scheduler component

The scheduler watches the API server component for newly created pods (to be precise, pod objects in etcd, the distributed key-value store) that have no nodeName param assigned, which basically means they haven’t been scheduled to a worker node yet, this is just an intention so far. For every pod that the scheduler discovers, it becomes responsible for finding the best node to run that pod on.

The role of the scheduler and its relationship with nodes and pods

But the question is — how to find the best node?

The scheduler will select the best node for that pod by asking two questions.

Question 1: Do you have what it takes to run this pod?

It will then go and filter the list of available nodes to exclude those that the pod can’t fit. It will go through the list of decisions (called predicates) that will resolve to either true (yes, deploy the pod on that node) or false (no, don’t deploy on that node).

Examples of predicates:

does a node have enough CPU, memory and ports
does a node have the desired label (if we specified label requirement)
- for example, a dedicated node with high IOPS SSD for DBs only
and more

To get a better idea how what a predicate looks like, we can look at K8s scheduler source code and PodFitsResource:

The scheduler source code snippet with predicate

The scheduler doesn’t look at how much of each individual resource is being used at the time of scheduling but at the sum of resources requested by the existing pods deployed on the node.

Even though existing pods may be using less than what they’ve requested, scheduling another pod based on actual resource consumption would break the guarantee given to the already deployed pods (the values they put into their resource request). There is one exception to that — a higher priority pod can force eviction of lower priority pods.

Can’t schedule a pod on a node that has too much memory reserved

On the diagram above we can see two pods running on the worker node and while based on those pods' resource usage it may seem that the node has enough resources to accommodate another pod, this will not happen as those unused resources are “reserved”.

Question 2: Are you a better candidate to run this pod?

The list of nodes that have passed the predicates checks now goes through the second question. This time the scheduler asks whether a given node is a better candidate to run this pod than others and uses so-called “priorities”. A “priorities” decision returns a score and the node with the highest score is chosen for the pod deployment.

Some examples of priorities:

are there pods for this deployment that are already on that node (spread function)?
has an image for containers in this pod been downloaded already on that node?
LeastRequestedPriority — favour nodes with fewer requested resources (with a greater amount of unallocated resources)
- spread CPU load evenly across all nodes
MostRequestedPriority — favour nodes that have the most requested resources (a smaller amount of unallocated CPU and memory)
- this is to guarantee that K8s will use the smallest possible number of nodes while still providing each pod with the amount of CPU/memory it requests
and more

2.3 Pod scheduling scenario

Imagine we have a cluster with two nodes and we want to schedule four pods on it.

The first three pods are scheduled without any problem as there are enough resources to accommodate them. But the fourth pod wasn’t so lucky, it required 400 millicores and 300MiB of memory but neither node 1 nor 2 have enough CPU capacity. What will happen is that this pod will stay in the “pending” state until there are some resources.

kubectl get pods output showing pod’s pending status

kubectl describe pod output showing a pod not scheduled due to insufficient resources

This pod may get eventually deployed if:

Any other pods on existing nodes are terminated (deployment is scaled in)
New nodes are available (scaling out manually or with autoscaling)

2.4 System pods on nodes

Not all node’s resources are available for users’ pods. There are also pre-allocated system pods in the kube-system namespace, system tasks as well as other deployments that might be part of your K8s stacks such as mesh, ingress, observability agents, kv and secret storage.

kubectl describe node output showing system deployments

3. Resources configuration

Three Scenarios

In this part, we are going to cover three scenarios to learn what is going to happen if we deploy our service with or without specifying the resource and the limits values.

Scenario 1: no request and no limits

As an illustration, imagine the scheduler’s responsibility is to align Tetris blocks (pods) on a board (node).

Tetris (championship edition) on ZX Spectrum — my first ever personal computer

If we don’t set the size for Tetris blocks, the player would be able to fit an infinite number of blocks on the board. Similarly, if we don’t set resource requests and limits for our containers, the K8s scheduler would be trying to schedule an infinite number of pods on a node. Containers will be able to use as many resources as available on the node until some other pods with resource requests set in their configurations are scheduled on that same node — they will take all of those resources right away.

Pods that are important for you should have containers with requests set otherwise the container can be starved of CPU time by a pod with a higher QoS class.

The illustration below presents the scenario concerning CPU resources where we have two pods, pod A has no requests specified and pod B has requests set. For as long as pod B doesn’t need them, pod A can use all resources available. But as soon as pod B takes resources it reserved, they are taken away from pod A.

Visualisation of a scenario when a pod is scheduled without specifying limits

Fortunately, K8s admins usually set clusters in such a way that if resources aren’t specified in your deployment manifest, then some predefined default values will be applied. They may not be enough for what your app requires or they could be more than is needed, so it is always best to be specific and set them in the deployment configuration.

There are scenarios for having pods with no CPU requests set on containers. One example is batch processing without any SLA, the batch process will use up any free CPU resources for as long as they are available and might get throttled at any time when important pods need those resources.

I will end this paragraph by sharing a brilliant blog post from gototeleport that describes the incident that they run into where none of the services deployed had requests or limits defined. They had an application consuming 100% CPU utilization over a 40 minuted period. As a consequence of “cannibalizing” all CPU resources, the node suffered from severe degradation of services. Using requests and limits, we give K8s a weapon to keep rogue applications in line to ensure stability.

Bad neighbour consuming all resources (Y scale) for 40 minutes (source)

Scenario 2a: with CPU request, no limits

In this scenario we have a deployment configuration with CPU requests value set and CPU limits value not set, as in the snippet below:

# Pod A
...
   resources:
      requests:
         cpu: 200m
         memory: 50Mi
      # limits not set# Pod B
...
   resources:
      requests:
         cpu: 1000m
         memory: 50Mi
      # limits not set

To understand the consequences of this configuration, let's imagine we have a node with 2 cores (2000 millicores) and two pods scheduled on it, pod A requesting 200m and pod B requesting 1000m:

If one pod wants to use up as much CPU as it can, while the other one is sitting idle, the first one will be allowed to use the whole CPU time
If both pods consume as much CPU as they can, the first pod will get one-sixth of the CPU time and the other one the remaining five-sixths
Because your first pod requested 200m and the other one 1000m, any unused CPU will be split among the two pods in a 1 to 5 ratio

Example of a scenario when a pod is scheduled with requests by no limits (inspired by an example from the Kubernetes in Action book)

Scenario 2b: with memory request, no limits

In this scenario we have a deployment configuration with memory requests value set and memory limits value not set, as in the snippet below:

...
   resources:
      requests:
         cpu: 200m
         memory: 50Mi      # limits not set

With this configuration, a container (in a pod) running on a worker node may eat up all the available memory which in turn will affect other pods on that same node. Memory leaks aren’t uncommon and the consequences of not having the limit set will likely be dramatic.

Scenario 3: with limits but no requests

When we specify limits but no requests, the limits values will be also used for requests.

Deployment manifest snippet specifying only the values for resource limits

In this case, the application will get as many resources as requests and no more than that as the limit value is equal to the request value. If all containers in a pod have their limits values set for requests as well, this will make the pod QoS class Guaranteed.

4. Resource limits

4.1 Compressible vs incompressible resources

What will happen if the application tries to use more resources than what is set in the limits? This will depend on the type of resource.

CPU is a compressible resource (also called a shareable resource). This means that it can be throttled if used over what’s requested without affecting the process in an adverse way. The application will continue to function, it will just not get more CPU time than it requested.
Memory is in-compressible (also called non-shareable). Once a process is given a chunk of memory, that memory can’t be taken away from it until it’s released by the process itself. Unlike CPU, memory can’t be throttled.

Elastic vs rigid material analogy for compressible and in-compressible compute resources

4.2 Limits overcommitment

Resource limits can be overcommitted. Setting the limits values larger than the request allows some over-subscription of resources as long as there is spare capacity on the node. The sum of all limits of all the pods on a node is allowed to exceed 100% of the node’s capacity. But when 100% of the node’s resources are used up, certain containers will need to be CPU throttled or OOM-killed. To avoid this admins might set a requirement to only allow deployments in the QoS class Guaranteed to enforce requests equal to limits.

4.3 Memory limits from the container’s perspective

Applications (or frameworks they run on) often need to know how much memory is available to them so that they can allocate and manage their memory requirements. Is it possible for an application running in a container environment to see the limits values?

What value will we see if we run the top command from within the container? Let’s check it with the exec command: $ kubectl exec -it your_pod_name -c your_container_name -- top.

If we do our test for the container with the following resource configuration:

...
   resources:
      limits:
         cpu: 500m
         memory: 50Mi

This is the outcome we should get:

The top command screen executed from the container showing the resources of the entire worker node

What this means is that the top command shows the memory of the whole node the container is running on. Even though you set a limit on how much memory is available to the container, the app running within it will not be aware of this limit.

This has an unfortunate effect on apps that look up the amount of memory available on the system and use that information to decide how much memory they want to reserve. Fortunately, we can control the application’s memory allocation by setting memory limits when we execute it with its runtime. For example:

use -xmx to define max heap size in Java
use –max-old-space-size in Node.js

Helm chart example with Java options setting xmx value corresponding to the memory requests and limits

The limits are controlled by the container runtime and hence the limits value is stored in the file that we can access from inside of the container by reading this file: /sys/fs/cgroup/memory/memory.limit_in_bytes:

Reading the limits values from inside of the container

4.4 CPU limits from the container’s perspective

We know that applications might misread the information about available memory, how about reading CPU limits?

Setting the CPU limits a full core doesn’t expose that entire core to the container. Instead, the CPU limits constrains the amount of CPU time the container can use.

As an example, let’s deploy a NodeJS app and allocate 500m CPU limits.

...
   resources:
      limits:
         cpu: 500m # (half a core)

If we exec into the container, enter into NodeJS REPL and invoke the built-in os.cpus() method, what we will see is the information about the CPU on a worker node (like with memory) rather than what has been allocated to the container.

This matches exactly the resources provided by the worker node the app is running on:

Details of the node that is currently running the pod

5. CPU management

Understanding how K8s and container engine handles CPUs is going to be very helpful when troubleshooting issues or interpreting infrastructure metrics for your application. It is worth diving a bit deeper to get familiar with CPU quota and shares.

5.1 CPU quota and CPU shares

CPU management is delegated to the OS Scheduler on a node (specifically cgroups). The scheduler uses two different mechanisms for the requests and the limits enforcement: CPU shares and CPU quota.

Deployment manifest snippet specifying CPU resource requests and limits and their relation to CPU shares and CPU quota systems

5.2 CPU shares system for managing requests

A container’s cpu_share is a relative value used for scheduling CPU time between containers.

K8s divides CPU shares of each core into 1024 slices and will guarantee that each process receives its proportional slice share.

Processes are assigned CPU shares, and when they compete for CPU time, they compare their shares and adjust their usage accordingly. Setting cpu_share to 512 to container A and 1024 to container B means that container B will get double the amount of CPU time as container A.

Let’s review the scenario where we have three containers that will be scheduled on the same node and container A requests 1024 CPU shares and containers B and C request 512 CPU shares each.

If we deploy these containers to a worker node with two cores (2000 millicores), we will get the following breakdown:

container A with 1024 shares: 1000 millicores
containers B and C with 512 shares: 500 millicores each

Pod CPU shares as weights distributed among the CPU cores

Let’s see a more real-life example. We have a deployment manifest with two containers, one to run Redis and another one to run NodeJS app using Redis as a cache. We request 500m for Redis and 100m for the NodeJS app.

Redis container requests 500 millicores which is half of a core and half of the node’s total shares
- 1024 * 0.5 = 512 shares
NodeJS app container requests 100 millicores which is one-tenth of a core and one-tenth of the node’s total shares
- 1024 * 0.1 = 102 shares

We can clearly see the relativity of the CPU shares mechanism, proportionally sharing the node’s resources between the containers. But take a note that CPU shares don’t really say anything about the actual amount of CPU time each container will get.

CPU shares help in allocating worker node’s resources to containers but they won’t help in enforcing upper bounds (the limits). You might remember from the earlier paragraph that if we only set requests and don’t set the limits then if one process doesn’t use its share the other is free to use them.

5.3 CPU Quota system for managing limits

K8s uses Completely Fair Scheduler (CFS) to enforce CPU limits for the pods running an application. CFS is a process scheduler that handles CPU resource allocation for executing processes, based on the time period and not based on available CPU power.

For example, a container with a one-core CPU limit running on a 4-core CPU will get 1/4th of the overall CPU time.

Even though its limit is set to one core, this doesn’t mean that the container’s processes will run on that one core only. At different points in time, its code will be executed on different cores.

Let’s understand the CPU time a bit better by looking at the following example.

We have a single-threaded job that requires 200 ms of processing time to finish its task. First, we deploy it with no limit set. Since the app can take all CPU time uninterrupted the task completes after 200 ms.

Now let's deploy the same pod but set the CPU limit to 250m. This translates into 25 milliseconds of CPU time.

The way how the container runtime controls CPU time is that it divides it into segments of 100 milliseconds.

We can read those values from inside of the container by doing exec into a container and reading cpu.cfs_period_us and cpu.cfs_quota_us:

Reading CPU total and allocated time slice durations

Now the task’s access to the CPU is interrupted after 25ms and then it needs to wait for the next 100ms time interval to get access again.

The task that was previously completed within 200ms now will complete its work after 725ms, as per the following diagram below:

It is important to note that K8s doesn't consider the CPU type or power here. As an example, 100 millicores on m5.large is much less powerful than 100 millicores on c5.large. If you have a powerful processor on your node, the task will do much more computation within the same time interval than on a node with a lower-spec processor.

As an interesting fact, many users complained that the default period (100ms) is too large and inefficient, hence CustomCPUCFSQuotaPeriod has been added as an experimental feature.

5.4 Demo 1: exceeding the CPU limit

Equipped with our fresh knowledge, let’s try to break something by deploying an application with CPU limits set that will run a CPU intensive task. We know that when we set the CPU limits for a container, the process isn’t given more CPU time than the configured limit and if it tries to use more, we should observe that the CPU usage is throttled.

You can deploy your own task, or as in the previous articles on health checks (part 1 and part 2), you can use the demo app I shared earlier (source, docker image). The app has a dedicated endpoint debug/cputask/{duration_ms} that you can call with the POST request to invoke the execution of a processor-intensive task for a specified amount of time.

Upon finishing execution, the endpoint will return the number of cycles it managed to complete within the requested timeframe.

demo-njs-app CPU intensive task source code

You can invoke this endpoint in a number of different ways, like reaching the app through the service, port-forwarding or calling curl from inside the container. Here is how you can make the demo app run that task for the duration of 10 seconds: $ kubectl exec -it demo-njs-app -- curl -X POST localhost:8080/debug/cputask/10000.

After running the app, you can also run the top command from that container in the following way: $ kubectl exec -it demo-njs-app -- top.

We will do three tests, each time we will increase the number of CPU limits which translates into the amount of CPU time as you know by now.

Take a note that the highlighted column %CPU in the top utility output is the task’s share of the CPU time, as a % of total CPU time.

Test 1 results:

Test 2 results:

Test 3 results:

As you can notice from the screenshots above, every time we increased the CPU limit, the application was receiving more CPU time, thus more CPU cycles which in turn made it able to finish more loop iterations (represented by the count value in the endpoint response) in the given amount of time.

At any time, the process was able to consume as much CPU time as it was available as per the specified limit.

100m = 2,088,094 cycles
250m = 5,639,719 cycles,
500m = 11,459,973 cycles in 10 sec timeframe

5.5 Demo 2: Exceeding the memory limit

When a process tries to allocate memory over its limit, it is OOM-killed. If the pod’s restart policy is set to Always or OnFailure, the process is restarted immediately.

We can again use demo-njs-app to demonstrate breaching the memory limits. If so, then we can use the memalloc endpoint to make the app allocate the required amount of memory: $ kubectl exec -it demo-njs-app -- curl -X POST localhost:8080/debug/memalloc/10.

The endpoint calls this basic method which uses the buffer to allocate a desired chunk of memory:

As before, you can also run the top command from the container in the following way: $ kubectl exec -it demo-njs-app -- top .

We will do this in three steps, first, let's deploy the app with the manifest limiting its memory to 200MiB.

We can see that at first, the application consumes 41.6MiB of memory.

In step 2 we call the endpoint to allocate an additional 100MiB of memory.

With 142 MiB consumed, we are still under the limit of 200 MiB so let’s request an additional 100 MiB and see what happens.

As soon as the app attempted to initiate another block of memory, it got immediately terminated with SIGKILL, as indicated in the application log above.

If we also look at the output of kubectl get pods we should see that the pod got OOMKilled:

kubectl get pods output showing the pod was OOMKilled

We certainly don’t want this to happen to our production deployments on K8s!

6. Considerations for values for requests and limits

6.1 Memory requests and limits and OOM

Let’s review two cases for the following deployment resource configuration:

Case 1: the application consumes more than the requests but less than the limit

In this case, the application is going to be fine for as long as the node doesn’t come under memory pressure. If this happens the kubelet may evict the pods that are using more than requested respectively to whether they are on or under the limit value.

Example: the container uses 129MiB which is over than requested 128MiB but less than the limit of 256MiB. The app might get evicted.

Case 2: the application consumes over the limit

The container that exceeds the hard limit will be OOM-killed regardless of the amount of free memory on the worker node it is running on

Example: the container uses 257MiB, over its limit of 256MiB, the pod will get terminated with OOMKilled.

6.2 Managing footprint

In most cases, developers set the initial values for requests and limits based on guesstimates and then either forget if the deployment isn’t crashing or underperforming or iterate to fine-tune based on metrics observations. Quite likely pods will have a resource footprint that is higher than required which isn’t cost-effective.

If we request too much, worker nodes will be underutilized, those resources will be blocked which means we will be burning money. This is sometimes called “slack” — the difference between the request value and actual usage.

Slack is the difference between the requested values and actual usage

Sometimes this leads to another issue where the resources are stranded. We can have a node that has lots of free memory but not much CPU as it has been reserved by currently running pods, hence no other pod can be scheduled.

When resources are stranded on a node they can’t be utilized

On the other side, if we request not enough, our apps will be either CPU starved or OOM killed.

6.3 Coming up with the right values for requests

The trick is to find the sweet spot, as there are consequences for going too low and for going too high.

Finding the sweet spot for resource requests

6.4 Coming up with the right values for limits

Similarly, there is a fair margin for the limits as well.

Finding the sweet spot for resource limits

Some teams set limits higher than resources to speed up application boot time, for example, Spring Boot tends to require a lot of CPU for launching (see here). The downside to overprovisioning just for faster launch time is that those resources won’t be utilized post-launch time at runtime.

6.5 Testing with Docker

We can start testing our application resource requirements locally with Docker. This is exactly the same mechanism that K8s uses on worker nodes.

Use--cpus to set a CPU utilization limit. This is equivalent to the resource limits in K8s (CPU quota).
In the example below we allocate half a core to the container:

$ docker run --cpus 0.5 <image>

Use --cpu-shares to provide the container with a proportion of CPU cycles. By default, this is set to 1024. This is equivalent to the resource request in K8s (CPU shares).
In the example belong we allocate 700 shares to the container:

$ docker run --cpu-shares 700 <image>

Use --memory to set the memory limits to test OOM:

$ docker run --memory "1G"

Use docker stats to find out the current resource usage statistics for the containers running locally:

$ docker stats

Docker stats output showing computational resources statistics for running containers

An interesting blog post on testing a Java/Spring application locally before deployment to K8s: java-application-optimization-on-kubernetes-on-the-example-of-a-spring-boot-microservice

6.6 Operators perspective

In the previous paragraphs, we mostly discussed the perspective and responsibilities of the K8s users in configuring their deployments and resource constraints. How about the K8s admins, do they just care about the worker nodes, cluster size and autoscaling or do they also go to the lower level to monitor individual pods and containers? Right-sizing the infrastructure and optimizing the footprint of individual containers and their pods is one of the admin and operator’s concerns. One of the most important objectives from operational and business perspectives is the system’s stability and performance on one hand and cost containment on the other. We want to avoid cases where we are paying for resources that aren’t utilized (e.g. resource requests that are set too high). The key weapons in our hands are observability and automation tools.

What are the options?

manually: historical resource requests, usage and utilization data analysis

Observability service dashboard providing monitoring for K8s cluster and its deployments

automatically: Vertical Pod Autoscaler (VPA)
- option A: get a report on usage with recommendations for rightsizing
- option B: set resource requests automatically based on usage reports (this is not possible if HPA is also running)

One such tool that helps in identifying ineffectively configured deployments at the pod level is Kube Resource Report: https://codeberg.org/hjacobs/kube-resource-report

Kube Resource Report screenshot showing resource cost report for deployments

7. Final thoughts

Allocation of resources for containers and their pods represents only one dimension of managing K8s deployments. Production applications are running on multiple replicas and with K8s features like Horizontal Pod Autoscaler, we rely on performance testing and observability in finding the sweet spot between application performance and cost savings. As with many other things, it will take some practice and iterations to get there.

We should always remember to review the resources allocated to the containers, as we release newer versions of our applications with new or updated features, it is likely that resource requirements changed as well.

In multiple paragraphs of this article, I mentioned concepts such as QoS classes and pod priority without going into detail about what they are and what is their role in the process of scheduling, eviction and preemption. When discussing resources, it is also important to be aware of namespace constraints such as LimitRange and ResourceQuota and the role of the admission controller in the K8s API server. Those are set and managed by K8s admins to build fences between “neighbours” representing company departments or functions and to provide tighter resource and cost controls. If you want to see those topics covered in future articles, as always, please leave some feedback!

If you are interested in K8s you may also want to see other articles that we have published to date:

Dive into Kubernetes health checks: part 1 and part 2
Spring Boot Configuration and Secret Management Patterns on Kubernetes

Don’t forget to check hmh.engineering for more interesting content from HMH engineers on various topics related to education, technology, accessibility and developer’s life.

Huge thanks to Francislainy Campos, Mickael Meausoone and Kris Iyer for spending their time reviewing this post!