AWS Karpenter (https://github.com/aws/karpenter)

Recently I came across this project called Karpenter (https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performance-kubernetes-cluster-autoscaler/) and it intrigued me greatly.

Diving into Documentations

Based on it's official documentations (https://aws.github.io/aws-eks-best-practices/karpenter/), it describe Karpenter as an open-source cluster autoscaler that automatically provisiones new nodes in response to unschedulable pods by evaluating the aggregate resource requirements on the pending pods and chooses the optimal instance type to run them. It also supports a consolidation feature which will actively move pods around and either delete or replace nodes with cheaper versions to reduce costs.

If you have read my previous article on EKS cluster in AWS (https://alexlogy.io/creating-eks-cluster-in-aws-with-terraform/), you will see that my clusters consist of different managed node groups with a fixed instance type per group. A Cluster Autoscaler (CAS) was then used to scale the cluster according to the workload requirements. This method is tested and proven everywhere, but it doesn't optimize the costs in your cluster. You have to use different node groups and taints to separate between different workload requirements and you have to use bigger instance type to house your bigger workloads, resulting in wastage in capacity.

A closer dive into the description about Karpenter, it states that:

Karpenter brings scaling management closer to Kubernetes native APIs than do Autoscaling Groups (ASGs) and Managed Node Groups (MNGs). ASGs and MNGs are AWS-native abstractions where scaling is triggered based on AWS level metrics, such as EC2 CPU load. Cluster Autoscaler bridges the Kubernetes abstractions into AWS abstractions, but loses some flexibility because of that, such as scheduling for a specific availability zone.
Karpenter removes a layer of AWS abstraction to bring some of the flexibility directly into Kubernetes. Karpenter is best used for clusters with workloads that encounter periods of high, spiky demand or have diverse compute requirements. MNGs and ASGs are good for clusters running workloads that tend to be more static and consistent. You can use a mix of dynamically and statically managed nodes, depending on your requirements.

If you come to think of it, this is exactly what is lacking with CAS. In addition, it provides interruption handling which was why we had AWS node termination handler installed in our clusters. Based on the documentation:

Karpenter supports native interruption handling, enabled through the aws.interruptionQueue value in Karpenter settings. Interruption handling watches for upcoming involuntary interruption events that would cause disruption to your workloads such as: - Spot Interruption Warnings - Scheduled Change Health Events (Maintenance Events) - Instance Terminating Events - Instance Stopping Events
When Karpenter detects one of these events will occur to your nodes, it automatically cordons, drains, and terminates the node(s) ahead of the interruption event to give the maximum amount of time for workload cleanup prior to interruption. It is not advised to use AWS Node Termination Handler alongside Karpenter as explained here.

This means that with Karpenter, I can achieve flexible scaling, cost optimization, spot termination handling all-in-one!

Deploying Karpenter

As such, I decided to try out Karpenter on our testing cluster to see it's capability. The documentation is pretty comprehensive, so I will not go through the steps in detail. You can just refer to https://karpenter.sh/v0.27.0/getting-started/.

As I have CAS installed in my cluster, I need to follow the "Migrating from Cluster Autoscaler" documentation. However, I realised it doesn't include the AWS SQS queue and Cloudwatch events subscription for events such as Spot Termination. Thus, I have modified the CloudFormation template in the documentations for the migration process.

AWSTemplateFormatVersion: "2010-09-09"
Description: Resources used by https://github.com/aws/karpenter for Interruption Queue
Parameters:
  ClusterName:
    Type: String
    Description: "EKS cluster name"
Resources:
  KarpenterControllerPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      ManagedPolicyName: !Sub "KarpenterControllerPolicy-${ClusterName}"
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Resource: "*"
            Action:
              # Write Operations
              - ec2:CreateFleet
              - ec2:CreateLaunchTemplate
              - ec2:CreateTags
              - ec2:DeleteLaunchTemplate
              - ec2:RunInstances
              - ec2:TerminateInstances
              # Read Operations
              - ec2:DescribeAvailabilityZones
              - ec2:DescribeImages
              - ec2:DescribeInstances
              - ec2:DescribeInstanceTypeOfferings
              - ec2:DescribeInstanceTypes
              - ec2:DescribeLaunchTemplates
              - ec2:DescribeSecurityGroups
              - ec2:DescribeSpotPriceHistory
              - ec2:DescribeSubnets
              - pricing:GetProducts
              - ssm:GetParameter
          - Effect: Allow
            Action:
              # Write Operations
              - sqs:DeleteMessage
              # Read Operations
              - sqs:GetQueueAttributes
              - sqs:GetQueueUrl
              - sqs:ReceiveMessage
            Resource: !GetAtt KarpenterInterruptionQueue.Arn
          - Effect: Allow
            Action:
              - iam:PassRole
            Resource: !Sub "arn:${AWS::Partition}:iam::${AWS::AccountId}:role/KarpenterNodeRole-${ClusterName}"
          - Effect: Allow
            Action:
              - eks:DescribeCluster
            Resource: !Sub "arn:${AWS::Partition}:eks:${AWS::Region}:${AWS::AccountId}:cluster/${ClusterName}"
  KarpenterInterruptionQueue:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: !Sub "${ClusterName}"
      MessageRetentionPeriod: 300
  KarpenterInterruptionQueuePolicy:
    Type: AWS::SQS::QueuePolicy
    Properties:
      Queues:
        - !Ref KarpenterInterruptionQueue
      PolicyDocument:
        Id: EC2InterruptionPolicy
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - events.amazonaws.com
                - sqs.amazonaws.com
            Action: sqs:SendMessage
            Resource: !GetAtt KarpenterInterruptionQueue.Arn
  ScheduledChangeRule:
    Type: 'AWS::Events::Rule'
    Properties:
      EventPattern:
        source:
          - aws.health
        detail-type:
          - AWS Health Event
      Targets:
        - Id: KarpenterInterruptionQueueTarget
          Arn: !GetAtt KarpenterInterruptionQueue.Arn
  SpotInterruptionRule:
    Type: 'AWS::Events::Rule'
    Properties:
      EventPattern:
        source:
          - aws.ec2
        detail-type:
          - EC2 Spot Instance Interruption Warning
      Targets:
        - Id: KarpenterInterruptionQueueTarget
          Arn: !GetAtt KarpenterInterruptionQueue.Arn
  RebalanceRule:
    Type: 'AWS::Events::Rule'
    Properties:
      EventPattern:
        source:
          - aws.ec2
        detail-type:
          - EC2 Instance Rebalance Recommendation
      Targets:
        - Id: KarpenterInterruptionQueueTarget
          Arn: !GetAtt KarpenterInterruptionQueue.Arn
  InstanceStateChangeRule:
    Type: 'AWS::Events::Rule'
    Properties:
      EventPattern:
        source:
          - aws.ec2
        detail-type:
          - EC2 Instance State-change Notification
      Targets:
        - Id: KarpenterInterruptionQueueTarget
          Arn: !GetAtt KarpenterInterruptionQueue.Arn

Run the CloudFormation script with the following command:

aws cloudformation deploy --parameter-overrides "ClusterName=${CLUSTER_NAME}" --template-file cloudformation.yaml --stack-name "Karpenter-devops-cluster" --capabilities CAPABILITY_NAMED_IAM

Once the CloudFormation task is completed, you have to edit the global settings to add the SQS queue name into Karpenter.

kubectl edit configmap karpenter-global-settings -n karpenter

Test it out

To test it out, I followed the instruction in the documentation to use the pause image.

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 0
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            requests:
              cpu: 1
EOF
kubectl scale deployment inflate --replicas 5
kubectl logs -f -n karpenter -l app.kubernetes.io/name=karpenter -c controller

Looking at the logs of Karpenter, I can see the following logs:

2023-03-14T07:34:06.063Z	INFO	controller.provisioner	found provisionable pod(s)	{"commit": "dc3af1a", "pods": 5}
2023-03-14T07:34:06.063Z	INFO	controller.provisioner	computed new node(s) to fit pod(s)	{"commit": "dc3af1a", "nodes": 1, "pods": 5}
2023-03-14T07:34:06.063Z	INFO	controller.provisioner	launching machine with 5 pods requesting {"cpu":"5155m","memory":"120Mi","pods":"10"} from types t3a.2xlarge, t3.2xlarge	{"commit": "dc3af1a", "provisioner": "default"}
2023-03-14T07:34:06.329Z	DEBUG	controller.provisioner.cloudprovider	discovered kubernetes version	{"commit": "dc3af1a", "provisioner": "default", "kubernetes-version": "1.25"}
2023-03-14T07:34:06.367Z	DEBUG	controller.provisioner.cloudprovider	discovered new ami	{"commit": "dc3af1a", "provisioner": "default", "ami": "ami-033ca1a1a1e57d186", "query": "/aws/service/eks/optimized-ami/1.25/amazon-linux-2/recommended/image_id"}
2023-03-14T07:34:06.509Z	DEBUG	controller.provisioner.cloudprovider	created launch template	{"commit": "dc3af1a", "provisioner": "default", "launch-template-name": "Karpenter-devops-cluster-16268544510950142166", "launch-template-id": "lt-08dc1eeb40a739941"}
2023-03-14T07:34:08.641Z	INFO	controller.provisioner.cloudprovider	launched new instance	{"commit": "dc3af1a", "provisioner": "default", "id": "i-0c65db3e5c29dbf75", "hostname": "ip-10-150-101-82.ap-southeast-1.compute.internal", "instance-type": "t3.2xlarge", "zone": "ap-southeast-1b", "capacity-type": "spot"}
2023-03-14T07:37:28.222Z	DEBUG	controller.aws	deleted launch template	{"commit": "dc3af1a"}

Verifying with AWS EC2 console, I can see that an instance of t3.2xlarge was provisioned for the deployment of the pause containers.

Upon deleting the deployment, I can see the consolidation taking place automatically and the newly provisioned instances were terminated.

2023-03-14T07:57:58.089Z	INFO	controller.deprovisioning	deprovisioning via consolidation delete, terminating 1 nodes ip-10-150-101-82.ap-southeast-1.compute.internal/t3.2xlarge/spot	{"commit": "dc3af1a"}
2023-03-14T07:57:58.109Z	INFO	controller.termination	cordoned node	{"commit": "dc3af1a", "node": "ip-10-150-101-82.ap-southeast-1.compute.internal"}
2023-03-14T07:57:58.462Z	INFO	controller.termination	deleted node	{"commit": "dc3af1a", "node": "ip-10-150-101-82.ap-southeast-1.compute.internal"}
2023-03-14T07:57:58.752Z	INFO	controller.termination	deleted node	{"commit": "dc3af1a", "node": "ip-10-150-101-82.ap-southeast-1.compute.internal"}
Karpenter Performance Dashboard in Prometheus

Final Thoughts

This was a short test on Karpenter capability. I will need more time to understand the concepts and test it out on our development clusters to simulate a real-life scenarios. I will update this article in due course when I'm ready.

Cheers!