EKS Autoscaling: Karpenter

How to configure nodes autoscaling for EKS with Terraform: part 2

There is an Italian version of this article; if you'd like to read it click here.

Karpenter

Karpenter is a new node lifecycle management solution developed by AWS Labs and released in GA during re:Invent 2021.

It is a young but promising open-source product, born as a response to some of the Cluster Autoscaler complexities.

Cluster Autoscaler, honestly, with all the tags that must correspond among the various types of cloud resources, and several permissions and roles to be associated, is quite tricky to configure (so much that our Google Cloud friends have seen fit to provide Cluster Autoscaler by default on GKE...). Once up and running it is reliable, but its initial configuration complexity (compared to other solutions) can often be daunting.

Compared to Cluster Autoscaler (you can read my article about it here), Karpenter has a different approach to autoscaling:

  • it no longer relies on ASG; instead it requests the node creation directly to the cloud provider (in the case of AWS, via a Launch Template);

  • there are far fewer configurations to do than Cluster Autoscaler (at least on AWS)

  • when it detects the need to allocate computational resources in the cluster, it autonomously calculates which is the most appropriate instance type and launches a single new node that can, by itself, satisfy all the computational requirements; therefore you don't need to decide in advance which instance types to configure in the node group

Let's see how it works!

How to configure Karpenter on EKS

Ok, I said that you have to configure fewer tags than you configure on Cluster Autoscaler: but you still have to assign some :-)

You can find a complete configuration at this link. In this article, we'll see only some details of the code.

As a prerequisite, an additional tag must be added to the subnets of the VPC where the EKS nodes are created. Therefore, including the tags that are normally assigned to work with EKS, the configuration will be:

public_subnet_tags = {
    "kubernetes.io/cluster/${local.name}" = "shared"
    "kubernetes.io/role/elb"              = 1
}

private_subnet_tags = {
    "kubernetes.io/cluster/${local.name}" = "shared"
    "kubernetes.io/role/internal-elb"     = 1
    # Tags subnets for Karpenter auto-discovery
    "karpenter.sh/discovery" = local.name
}

The node group definition is simpler, because you need only to configure the first node of the cluster. Everything else will be handled by Karpenter:

eks_managed_node_groups = {
    karpenter = {
      instance_types                        = ["t3.medium"]
      create_security_group                 = false
      attach_cluster_primary_security_group = true
      enable_monitoring                     = true

      min_size     = 1
      max_size     = 1
      desired_size = 1

      iam_role_additional_policies = [
        # Required by Karpenter
        "arn:${local.partition}:iam::aws:policy/AmazonSSMManagedInstanceCore"
      ]
    }
  }

The Karpenter installation on this first node can always be done with Helm via Terraform: ``` resource "helm_release" "karpenter" { namespace = "karpenter" create_namespace = true

name = "karpenter" repository = "charts.karpenter.sh" chart = "karpenter" version = "v0.13.2"

... } ```

At this point, after Helm has installed the Karpenter provisioner CRD, we can enter its configuration:

These values are purely illustrative. Always check which values are most appropriate for each specific use case.

resource "kubectl_manifest" "karpenter_provisioner" {
  yaml_body = <<-YAML
  apiVersion: karpenter.sh/v1alpha5
  kind: Provisioner
  metadata:
    name: nodegroup2
  spec:
    requirements:
      # Include general purpose instance families
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values: [c5, m5, r5]
     # Exclude some instance sizes
      - key: karpenter.k8s.aws/instance-size
        operator: NotIn
        values: [nano, micro, small, 24xlarge, 18xlarge, 16xlarge, 12xlarge]
      # Exclude a specific instance type
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot"]
    taints:
      - key: dedicated
        value: ${local.nodegroup2_label}
        effect: "NoSchedule"
    limits:
      resources:
        cpu: 16000
    provider:
      subnetSelector:
        karpenter.sh/discovery: ${local.name}
      securityGroupSelector:
        karpenter.sh/discovery: ${local.name}
      tags:
        karpenter.sh/discovery: ${local.name}
        role: group2
    ttlSecondsAfterEmpty: 30
  YAML
}

This Karpenter provisioner configuration example includes some choices:

  • a subset of instance_family is included

  • a subset of instance_size is excluded (those too small to avoid performance problems, those too big to avoid billing problems)

  • the provisioner is limited to managing up to 16 CPUs; beyond this value, it does not take on new autoscaling requests

Other configuration properties are available in the official documentation.

Scale-up on EKS with Karpenter

By repeating the same test performed with Cluster Autoscaler, with Karpenter the time lapse between the new computational resources request and the new node launch is barely perceptible, while Cluster Autoscaler has some latency performing this operation.

Also, having requested 5 new replicas all at once, Karpenter calculated which was the best choice for instance type at that time and for that region:

! [karpenter-logs.png](cdn.hashnode.com/res/hashnode/image/upload/.. align = "left")

Cluster Autoscaler, on the other hand, in the same situation would have created multiple nodes to satisfy the demand for computational resources.

Scale-down on EKS with Karpenter

Karpenter's scale-down is just as fast: when a node is empty, it is terminated within ttlSecondsAfterEmpty seconds (a value, to always be set, otherwise there is no scale-down).

WARNING: the functions described below are under development. This description relates to the time the article is written (July 19, 2022).

But pay attention to this: Karpenter DOES NOT manage the workload consolidation when nodes are underutilized.

The use case I mentioned in the Cluster Autoscaler article is peculiar: since the workloads are batch jobs, once the execution of all the jobs that insist on a node is finished, there is no need to migrate any pods, so the node is unused and therefore terminated with no disruptions.

Instead, in the case of non-batch workloads, such as the deployment used for testing, when a replicas number is reduced, Karpenter does not consolidate the pods on a number of nodes big enough to host them; instead, the cluster distributes the active pods among the present nodes, and, consequently, no node is empty to be terminated.

So, if at some point it needs a high number of replicas (for example for a traffic peak), Karpenter creates a node with high computing resources, but this node will never be terminated because there is no consolidation mechanism.

The instance type limitations included in the provisioner configuration, therefore, are useful to limit the chance to create very expensive nodes that risk never being terminated.

Specific pod affinity, topology spread constraints, pod disruption budget configurations can be added in some cases to mitigate unexpected results. Always check your use case.

Workload Consolidation: Preview

Luckily, the workload consolidation functionality is in the roadmap: it is tracked in an issue on Github that already has an associated pull request to lay the foundation.

We tried this patch in preview: after the installation, a new property is available in the configuration:

apiVersion: karpenter.sh/v1alpha5
  kind: Provisioner
  metadata:
    name: nodegroup2
  spec:
    consolidation:
      enabled: true
...

At the moment this feature preview does not include any configuration parameters. The result is that the consolidation function may not always be effective:

  • when Karpenter makes all pods run on as few nodes as possible, using all the resources available in the cluster, sometimes it leaves no space in the nodes, so any cronjobs run cause the cluster to scale up and down every few minutes

  • if two nodes have a similar number of pods, some pods continue to be moved from one node to another to balance the workload, because there is no control mechanism to candidate a single node at a time for removal. Thus we return to the situation in which no node is emptied nor can it be terminated

Scores, finally!

| | Cluster Autoscaler | Karpenter | | ------------------------------- | ----------------- --- | ------------- | | configuration complexity | medium/high (on AWS) | low | | speed | medium/high (configurable) | very high | | functionality maturity | high | medium/low |

Karpenter seems to be promising. Its reaction speed is impressive and, for sure, it will carve out its space among Kubernetes integrated tools, thanks to the AWS team that is developing it and that has every interest in enriching it. It can be an interesting alternative for non-critical clusters or specific use cases.

For clusters in a production environment, Karpenter gives the feeling that it is not yet mature enough, but it is rapidly developing, and it deserves to be kept under the radar to evaluate its future evolutions.