Photo by Eric Prouzet on Unsplash
EKS Autoscaling: Cluster Autoscaler
How to configure nodes autoscaling for EKS with Terraform: part 1
There is an Italian version of this article; if you'd like to read it click here.
A special use case for EKS
A while ago I had an interesting use case: an application to allow users to run simulations based on algorithms and parameters selected from a database.
The simulations were Kubernetes jobs that required a fixed amount of resources in terms of CPU and memory, to be comparable: the final purpose was to evaluate which algorithm was the best among those executed in the simulations.
We wanted to make users happy to execute their workload independently, without waiting for the end of another workload due to a lack of free computational resources. Still, the ultimate goal was to keep costs as low as possible.
The solution was built on AWS EKS with two managed node groups: a more "stable" node group to host the application, and a more "volatile" node group to run the simulations, where the nodes have more computing resources but are instantiated only upon request for new simulations. This node scaling is managed by Cluster Autoscaler which also takes care of down-sizing the node group to 0.
Cluster Autoscaler
Cluster Autoscaler is a vendor-neutral, open-source tool, included in the Kubernetes project and, to date, is the de-facto standard for autoscaling clusters, with implementations delivered by most cloud providers.
Cloud providers integration
Depending on the cloud provider, it can be integrated by default or not: for example, Google Cloud Platform provides it by default in GKE:
This deep integration with the cloud provider is not surprising: Google basically invented Kubernetes, releasing in mid-2014 the open-source version of its Borg project, which was born internally in the company in 2003-2004.
AWS has historically released one innovative service after another over the years, and it decided to focus on its container orchestrator proprietary product, ECS, which has the undeniable advantage of offering the user the possibility of using it even without having the experience needed to manage Kubernetes; and it is, of course, well integrated with all AWS services.
However, more than any other service, ECS has highlighted the intrinsic lock-in to the platform that many users prefer to avoid. In addition, many enterprise customers use more than one cloud provider and appreciate being able to leverage the same technical and operational knowledge across multiple platforms, and Kubernetes arguably has the best feature set and the promise of portability between cloud and datacenter. The momentum of Kubernetes has reached the point of setting the stakes for companies in their evaluation of which cloud providers to adopt; so AWS finally decided to release EKS in GA in mid-2018, starting behind the competitor.
This brief historical parenthesis explains the context: EKS is young compared to its competitor, but over the years it has been enriched with many features that make it absolutely up to being configured to support applications in production.
Cluster Autoscaler, however, you have to install it yourself :-)
How to configure Cluster Autoscaler for EKS
EKS uses the AutoScaling Group (ASG) functionality to integrate with Cluster Autoscaler and execute its requests for adding and removing nodes.
Cluster Autoscaler does not directly measure CPU and memory usage values to make a sizing decision. Instead, it checks every 10 seconds for pods in a pending state, so it infers that the scheduler cannot assign them to a node due to insufficient computational capacity.
Even with the help of the excellent EKS module for Terraform, configuring the cluster for autoscaling to work requires a set of configurations that are not out-of-the-box.
You can find a complete configuration at this link. In this article, we'll see only some details of the code.
The key part is the node group configuration. For the use case shown at the beginning of this article, I have configured two managed node groups with different features:
the first node group has a lower autoscaling range because it has a more stable use and is not subject to particular load fluctuations;
the second node group uses machines with higher computational resources with a spot pricing model. The autoscaling group configuration indicates that, without any jobs to be performed, no machines are turned on. I also added a taint to ensure that only applications that explicitly request this node group run on it.
The values in this code are purely illustrative. Always check the most appropriate values for each specific use case.
The most notable piece of configuration is the explicit association of some tags to each of the node groups, without which the autoscaling does not work.
locals {
name = "eks-cas"
nodegroup1_label = "group1"
nodegroup2_label = "group2"
}
eks_managed_node_groups = {
nodegroup1 = {
desired_size = 1
max_size = 2
min_size = 1
instance_types = ["t3.medium"]
capacity_type = "ON_DEMAND"
update_config = {
max_unavailable_percentage = 50
}
labels = {
role = local.nodegroup1_label
}
tags = {
"k8s.io/cluster-autoscaler/enabled" = "true"
"k8s.io/cluster-autoscaler/${local.name}" = "owned"
"k8s.io/cluster-autoscaler/node-template/label/role" = "${local.nodegroup1_label}"
}
}
nodegroup2 = {
desired_size = 0
max_size = 10
min_size = 0
instance_types = ["c5.xlarge", "c5a.xlarge", "m5.xlarge", "m5a.xlarge"]
capacity_type = "SPOT"
update_config = {
max_unavailable_percentage = 100
}
labels = {
role = local.nodegroup2_label
}
taints = [
{
key = "dedicated"
value = local.nodegroup2_label
effect = "NO_SCHEDULE"
}
]
tags = {
"k8s.io/cluster-autoscaler/enabled" = "true"
"k8s.io/cluster-autoscaler/${local.name}" = "owned"
"k8s.io/cluster-autoscaler/node-template/taint/dedicated" = "${local.nodegroup2_label}:NoSchedule"
"k8s.io/cluster-autoscaler/node-template/label/role" = "${local.nodegroup2_label}"
}
}
}
It's not over. These tags must be present not only in the node group configuration but also on the actual nodes created by the autoscaling group on AWS (whose lifecycle is managed by EKS and not by Terraform), so another piece of code is needed:
locals {
eks_asg_tag_list_nodegroup1 = {
"k8s.io/cluster-autoscaler/enabled" : true
"k8s.io/cluster-autoscaler/${local.name}" : "owned"
"k8s.io/cluster-autoscaler/node-template/label/role" : local.nodegroup1_label
}
eks_asg_tag_list_nodegroup2 = {
"k8s.io/cluster-autoscaler/enabled" : true
"k8s.io/cluster-autoscaler/${local.name}" : "owned"
"k8s.io/cluster-autoscaler/node-template/label/role" : local.nodegroup2_label
"k8s.io/cluster-autoscaler/node-template/taint/dedicated" : "${local.nodegroup2_label}:NoSchedule"
}
}
resource "aws_autoscaling_group_tag" "nodegroup1" {
for_each = local.eks_asg_tag_list_nodegroup1
autoscaling_group_name = element(module.eks.eks_managed_node_groups_autoscaling_group_names, 0)
tag {
key = each.key
value = each.value
propagate_at_launch = true
}
}
resource "aws_autoscaling_group_tag" "nodegroup2" {
for_each = local.eks_asg_tag_list_nodegroup2
autoscaling_group_name = element(module.eks.eks_managed_node_groups_autoscaling_group_names, 1)
tag {
key = each.key
value = each.value
propagate_at_launch = true
}
}
Finally, we use the Helm provider for Terraform for the cluster-autoscaler installation:
locals {
k8s_service_account_namespace = "kube-system"
k8s_service_account_name = "cluster-autoscaler"
}
resource "helm_release" "cluster-autoscaler" {
name = "cluster-autoscaler"
namespace = local.k8s_service_account_namespace
repository = "https://kubernetes.github.io/autoscaler"
chart = "cluster-autoscaler"
version = "9.10.7"
create_namespace = false
set {
name = "awsRegion"
value = local.region
}
set {
name = "autoDiscovery.clusterName"
value = local.name
}
set {
name = "autoDiscovery.enabled"
value = "true"
}
...
}
How to test Cluster Autoscaler? While the main purpose in my original use case was to run batch jobs, I will use a deployment here because it is easier to manipulate replicas manually (so it is more useful for illustrating functionality).
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
spec:
replicas: 0
selector:
matchLabels:
app: inflate
template:
metadata:
labels:
app: inflate
spec:
terminationGracePeriodSeconds: 0
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
resources:
requests:
cpu: 1
nodeSelector:
role: "group2"
tolerations:
- key: "dedicated"
operator: "Equal"
value: "group2"
effect: "NoSchedule"
The nodeSelector
and tolerations
properties force this application to run on the nodegroup2.
When this deployment is created, its replicas number is 0:
kubectl apply -f deployment.yaml
Finally, everything is ready to see the autoscaling at work!
Scale-up on EKS with Cluster Autoscaler
To see autoscaling in action, let's increase the deployment replicas number:
kubectl scale deployment inflate --replicas 5
General observations:
the ASG configuration determines the instance type of the node to create. In the configuration, we have indicated a list of possible instance types: this list is always scrolled in order, so without any specific problems on the cloud provider, a machine of the first type in the list will be created
if more than one workload requires resources at the same time (as in my example, where I changed the number of
replicas
from 0 to 5), Cluster Autoscaler uses a sequential logic: when it detects the first pod in the pending state, it requires a new node; for the second pod, it checks that there are enough resources on the requested node to assign the pod to it; and it continues in this way for the other pods until the node's resources are exhausted; then it requires an additional node. In our test, since the node is ac5.xlarge
instance (4 vCPUs), and since the container requires 1 CPU, and I have requested the creation of 5 pod replicas, the first pod will generate the creation of a node, the next two pods will be assigned to the same node; then the fourth pod will not have enough space and will cause the creation of a second node, and finally the fifth will also be assigned to this second node.
Observations about execution time:
- a few seconds pass from the scaling command execution to the request to the cloud provider to change the ASG
desired_size
. Cluster Autoscaler documentation declares these SLOs:
No more than 30 sec latency on small clusters (less than 100 nodes with up to 30 pods each), with the average latency of about 5 sec. No more than 60 sec latency on big clusters (100 to 1000 nodes), with average latency of about 15 sec.
the actual time taken by the cloud provider to create the node is 40-50 seconds, on average
if the VPC CNI plugin provided on EKS as an add-on is active, there is an additional time wait (even up to 60 seconds) to instantiate a new Network Interface (ENI) and assign it to the node as a secondary interface
overall, in a normal situation, the availability of the new node in ready mode on the cluster takes a time equal to the sum of all these actions
Scale-down on EKS with Cluster Autoscaler
What happens when the need for computational resources decreases?
Cluster Autoscaler periodically checks if some nodes are underutilized. A node is considered a candidate for removal when all the following conditions are met:
the sum of CPU and memory requests of all pods running on this node is less than 50% of the node's available resources
all pods running on the node can be moved to other nodes
the node does not have an annotation to disable the scale-down
Cluster Autoscaler is able to consolidate workloads on a minimum number of nodes, calculating which nodes to nominate for removal and appropriately moving the pods elsewhere to free up resources.
If a node is not needed (i.e. it has no pods) for more than 10 minutes (default value), Cluster Autoscaler sends a request to decrease the ASG desired_size
to the cloud provider to remove the empty node. Once that value is changed, the node is excluded from the cluster and then terminated. Cluster Autoscaler candidates to the removal just one non-empty node at a time, to reduce the risk of not having enough resources to restart the pods.
The use case I talked about at the beginning is relatively simple, because the pods running on the nodes are batch jobs, with a lifetime that is finite by definition; when all the jobs on a node have ended their execution, the node is empty and is terminated (always after 10 minutes of inactivity) without the need to move any workloads.
What's new: Karpenter
In the last few months, a new autoscaling tool has appeared on the scene: Karpenter. My curiosity pushed me to repeat my scenario using this new tool to evaluate any benefits in terms of:
ease of configuration
functionality
speed
How did it go? You can read about Karpenter in my next article!