Recently I came across this project called Karpenter (https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performance-kubernetes-cluster-autoscaler/) and it intrigued me greatly.
Diving into Documentations
Based on it's official documentations (https://aws.github.io/aws-eks-best-practices/karpenter/), it describe Karpenter as an open-source cluster autoscaler that automatically provisiones new nodes in response to unschedulable pods by evaluating the aggregate resource requirements on the pending pods and chooses the optimal instance type to run them. It also supports a consolidation feature which will actively move pods around and either delete or replace nodes with cheaper versions to reduce costs.
If you have read my previous article on EKS cluster in AWS (https://alexlogy.io/creating-eks-cluster-in-aws-with-terraform/), you will see that my clusters consist of different managed node groups with a fixed instance type per group. A Cluster Autoscaler (CAS) was then used to scale the cluster according to the workload requirements. This method is tested and proven everywhere, but it doesn't optimize the costs in your cluster. You have to use different node groups and taints to separate between different workload requirements and you have to use bigger instance type to house your bigger workloads, resulting in wastage in capacity.
A closer dive into the description about Karpenter, it states that:
Karpenter brings scaling management closer to Kubernetes native APIs than do Autoscaling Groups (ASGs) and Managed Node Groups (MNGs). ASGs and MNGs are AWS-native abstractions where scaling is triggered based on AWS level metrics, such as EC2 CPU load. Cluster Autoscaler bridges the Kubernetes abstractions into AWS abstractions, but loses some flexibility because of that, such as scheduling for a specific availability zone.
Karpenter removes a layer of AWS abstraction to bring some of the flexibility directly into Kubernetes. Karpenter is best used for clusters with workloads that encounter periods of high, spiky demand or have diverse compute requirements. MNGs and ASGs are good for clusters running workloads that tend to be more static and consistent. You can use a mix of dynamically and statically managed nodes, depending on your requirements.
If you come to think of it, this is exactly what is lacking with CAS. In addition, it provides interruption handling which was why we had AWS node termination handler installed in our clusters. Based on the documentation:
Karpenter supports native interruption handling, enabled through the aws.interruptionQueue
value in Karpenter settings. Interruption handling watches for upcoming involuntary interruption events that would cause disruption to your workloads such as: - Spot Interruption Warnings - Scheduled Change Health Events (Maintenance Events) - Instance Terminating Events - Instance Stopping Events
When Karpenter detects one of these events will occur to your nodes, it automatically cordons, drains, and terminates the node(s) ahead of the interruption event to give the maximum amount of time for workload cleanup prior to interruption. It is not advised to use AWS Node Termination Handler alongside Karpenter as explained here.
This means that with Karpenter, I can achieve flexible scaling, cost optimization, spot termination handling all-in-one!
Deploying Karpenter
As such, I decided to try out Karpenter on our testing cluster to see it's capability. The documentation is pretty comprehensive, so I will not go through the steps in detail. You can just refer to https://karpenter.sh/v0.27.0/getting-started/.
As I have CAS installed in my cluster, I need to follow the "Migrating from Cluster Autoscaler" documentation. However, I realised it doesn't include the AWS SQS queue and Cloudwatch events subscription for events such as Spot Termination. Thus, I have modified the CloudFormation template in the documentations for the migration process.
AWSTemplateFormatVersion: "2010-09-09"
Description: Resources used by https://github.com/aws/karpenter for Interruption Queue
Parameters:
ClusterName:
Type: String
Description: "EKS cluster name"
Resources:
KarpenterControllerPolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
ManagedPolicyName: !Sub "KarpenterControllerPolicy-${ClusterName}"
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Resource: "*"
Action:
# Write Operations
- ec2:CreateFleet
- ec2:CreateLaunchTemplate
- ec2:CreateTags
- ec2:DeleteLaunchTemplate
- ec2:RunInstances
- ec2:TerminateInstances
# Read Operations
- ec2:DescribeAvailabilityZones
- ec2:DescribeImages
- ec2:DescribeInstances
- ec2:DescribeInstanceTypeOfferings
- ec2:DescribeInstanceTypes
- ec2:DescribeLaunchTemplates
- ec2:DescribeSecurityGroups
- ec2:DescribeSpotPriceHistory
- ec2:DescribeSubnets
- pricing:GetProducts
- ssm:GetParameter
- Effect: Allow
Action:
# Write Operations
- sqs:DeleteMessage
# Read Operations
- sqs:GetQueueAttributes
- sqs:GetQueueUrl
- sqs:ReceiveMessage
Resource: !GetAtt KarpenterInterruptionQueue.Arn
- Effect: Allow
Action:
- iam:PassRole
Resource: !Sub "arn:${AWS::Partition}:iam::${AWS::AccountId}:role/KarpenterNodeRole-${ClusterName}"
- Effect: Allow
Action:
- eks:DescribeCluster
Resource: !Sub "arn:${AWS::Partition}:eks:${AWS::Region}:${AWS::AccountId}:cluster/${ClusterName}"
KarpenterInterruptionQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: !Sub "${ClusterName}"
MessageRetentionPeriod: 300
KarpenterInterruptionQueuePolicy:
Type: AWS::SQS::QueuePolicy
Properties:
Queues:
- !Ref KarpenterInterruptionQueue
PolicyDocument:
Id: EC2InterruptionPolicy
Statement:
- Effect: Allow
Principal:
Service:
- events.amazonaws.com
- sqs.amazonaws.com
Action: sqs:SendMessage
Resource: !GetAtt KarpenterInterruptionQueue.Arn
ScheduledChangeRule:
Type: 'AWS::Events::Rule'
Properties:
EventPattern:
source:
- aws.health
detail-type:
- AWS Health Event
Targets:
- Id: KarpenterInterruptionQueueTarget
Arn: !GetAtt KarpenterInterruptionQueue.Arn
SpotInterruptionRule:
Type: 'AWS::Events::Rule'
Properties:
EventPattern:
source:
- aws.ec2
detail-type:
- EC2 Spot Instance Interruption Warning
Targets:
- Id: KarpenterInterruptionQueueTarget
Arn: !GetAtt KarpenterInterruptionQueue.Arn
RebalanceRule:
Type: 'AWS::Events::Rule'
Properties:
EventPattern:
source:
- aws.ec2
detail-type:
- EC2 Instance Rebalance Recommendation
Targets:
- Id: KarpenterInterruptionQueueTarget
Arn: !GetAtt KarpenterInterruptionQueue.Arn
InstanceStateChangeRule:
Type: 'AWS::Events::Rule'
Properties:
EventPattern:
source:
- aws.ec2
detail-type:
- EC2 Instance State-change Notification
Targets:
- Id: KarpenterInterruptionQueueTarget
Arn: !GetAtt KarpenterInterruptionQueue.Arn
Run the CloudFormation script with the following command:
aws cloudformation deploy --parameter-overrides "ClusterName=${CLUSTER_NAME}" --template-file cloudformation.yaml --stack-name "Karpenter-devops-cluster" --capabilities CAPABILITY_NAMED_IAM
Once the CloudFormation task is completed, you have to edit the global settings to add the SQS queue name into Karpenter.
kubectl edit configmap karpenter-global-settings -n karpenter
Test it out
To test it out, I followed the instruction in the documentation to use the pause image.
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
spec:
replicas: 0
selector:
matchLabels:
app: inflate
template:
metadata:
labels:
app: inflate
spec:
terminationGracePeriodSeconds: 0
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
resources:
requests:
cpu: 1
EOF
kubectl scale deployment inflate --replicas 5
kubectl logs -f -n karpenter -l app.kubernetes.io/name=karpenter -c controller
Looking at the logs of Karpenter, I can see the following logs:
2023-03-14T07:34:06.063Z INFO controller.provisioner found provisionable pod(s) {"commit": "dc3af1a", "pods": 5}
2023-03-14T07:34:06.063Z INFO controller.provisioner computed new node(s) to fit pod(s) {"commit": "dc3af1a", "nodes": 1, "pods": 5}
2023-03-14T07:34:06.063Z INFO controller.provisioner launching machine with 5 pods requesting {"cpu":"5155m","memory":"120Mi","pods":"10"} from types t3a.2xlarge, t3.2xlarge {"commit": "dc3af1a", "provisioner": "default"}
2023-03-14T07:34:06.329Z DEBUG controller.provisioner.cloudprovider discovered kubernetes version {"commit": "dc3af1a", "provisioner": "default", "kubernetes-version": "1.25"}
2023-03-14T07:34:06.367Z DEBUG controller.provisioner.cloudprovider discovered new ami {"commit": "dc3af1a", "provisioner": "default", "ami": "ami-033ca1a1a1e57d186", "query": "/aws/service/eks/optimized-ami/1.25/amazon-linux-2/recommended/image_id"}
2023-03-14T07:34:06.509Z DEBUG controller.provisioner.cloudprovider created launch template {"commit": "dc3af1a", "provisioner": "default", "launch-template-name": "Karpenter-devops-cluster-16268544510950142166", "launch-template-id": "lt-08dc1eeb40a739941"}
2023-03-14T07:34:08.641Z INFO controller.provisioner.cloudprovider launched new instance {"commit": "dc3af1a", "provisioner": "default", "id": "i-0c65db3e5c29dbf75", "hostname": "ip-10-150-101-82.ap-southeast-1.compute.internal", "instance-type": "t3.2xlarge", "zone": "ap-southeast-1b", "capacity-type": "spot"}
2023-03-14T07:37:28.222Z DEBUG controller.aws deleted launch template {"commit": "dc3af1a"}
Verifying with AWS EC2 console, I can see that an instance of t3.2xlarge was provisioned for the deployment of the pause containers.
Upon deleting the deployment, I can see the consolidation taking place automatically and the newly provisioned instances were terminated.
2023-03-14T07:57:58.089Z INFO controller.deprovisioning deprovisioning via consolidation delete, terminating 1 nodes ip-10-150-101-82.ap-southeast-1.compute.internal/t3.2xlarge/spot {"commit": "dc3af1a"}
2023-03-14T07:57:58.109Z INFO controller.termination cordoned node {"commit": "dc3af1a", "node": "ip-10-150-101-82.ap-southeast-1.compute.internal"}
2023-03-14T07:57:58.462Z INFO controller.termination deleted node {"commit": "dc3af1a", "node": "ip-10-150-101-82.ap-southeast-1.compute.internal"}
2023-03-14T07:57:58.752Z INFO controller.termination deleted node {"commit": "dc3af1a", "node": "ip-10-150-101-82.ap-southeast-1.compute.internal"}
Final Thoughts
This was a short test on Karpenter capability. I will need more time to understand the concepts and test it out on our development clusters to simulate a real-life scenarios. I will update this article in due course when I'm ready.
Cheers!