Use Karpenter to hurry up Amazon EMR on EKS autoscaling

0
7
Adv1


Adv2

Amazon EMR on Amazon EKS is a deployment choice for Amazon EMR that permits organizations to run Apache Spark on Amazon Elastic Kubernetes Service (Amazon EKS). With EMR on EKS, the Spark jobs run on the Amazon EMR runtime for Apache Spark. This will increase the efficiency of your Spark jobs in order that they run quicker and price lower than open supply Apache Spark. Additionally, you possibly can run Amazon EMR-based Apache Spark functions with different kinds of functions on the identical EKS cluster to enhance useful resource utilization and simplify infrastructure administration.

Karpenter was launched at AWS re:Invent 2021 to supply a dynamic, excessive efficiency, open-source cluster auto scaling answer for Kubernetes. It routinely provisions new nodes in response to unschedulable pods. It observes the combination useful resource requests of unscheduled pods and makes selections to launch new nodes and terminate cease them to cut back scheduling latencies in addition to infrastructure prices.

To configure Karpenter, you create provisioners that outline how Karpenter manages the pods which are pending and expires nodes. Though most use instances are addressed with a single provisioner, a number of provisioners are helpful in multi-tenant use instances similar to isolating nodes for billing, utilizing completely different node constraints (similar to no GPUs for a staff), or utilizing completely different deprovisioning settings. Karpenter launches nodes with minimal compute sources to suit un-schedulable pods for environment friendly binpacking. It really works in tandem with the Kubernetes scheduler to bind un-schedulable pods to the brand new nodes which are provisioned. The next diagram illustrates the way it works.

This put up reveals combine Karpenter into your EMR on EKS structure to realize quicker and capacity-aware auto scaling capabilities to hurry up your huge knowledge and machine studying (ML) workloads whereas decreasing prices. We run the identical workload utilizing each Cluster Autoscaler and Karpenter, to see among the enhancements we focus on within the subsequent part.

Enhancements in comparison with Cluster Autoscaler

Like Karpenter, Kubernetes Cluster Autoscaler (CAS) is designed so as to add nodes when requests are available to run pods that may’t be met by present capability. Cluster Autoscaler is a part of the Kubernetes undertaking, with implementations by main Kubernetes cloud suppliers. By taking a contemporary take a look at provisioning, Karpenter presents the next enhancements:

  • No node group administration overhead – As a result of you may have completely different useful resource necessities for various Spark workloads together with different workloads in your EKS cluster, you should create separate node teams that may meet your necessities, like occasion sizes, Availability Zones, and buy choices. This will rapidly develop to tens and tons of of node teams, which provides further administration overhead. Karpenter manages every occasion straight, with out using further orchestration mechanisms like node teams, taking a group-less strategy by calling the EC2 Fleet API on to provision nodes. This enables Karpenter to make use of various occasion sorts, Availability Zones, and buy choices by merely making a single provisioner, as proven within the following determine.
  • Fast retries – If the Amazon Elastic Compute Cloud (Amazon EC2) capability isn’t accessible, Karpenter can retry in milliseconds as an alternative of minutes. That is generally is a actually helpful in the event you’re utilizing EC2 Spot Situations and also you’re unable to get capability to particular occasion sorts.
  • Designed to deal with full flexibility of the cloud – Karpenter has the flexibility to effectively deal with the complete vary of occasion sorts accessible by means of AWS. Cluster Autoscaler wasn’t initially constructed with the flexibleness to deal with tons of of occasion sorts, Availability Zones, and buy choices. We advocate being as versatile as you may be to allow Karpenter get the just-in-time capability you want.
  • Improves the general node utilization by binpacking – Karpenter batches pending pods after which binpacks them primarily based on CPU, reminiscence, and GPUs required, making an allowance for node overhead (for instance, daemon set sources required). After the pods are binpacked on probably the most environment friendly occasion kind, Karpenter takes different occasion sorts which are related or bigger than probably the most environment friendly packing, and passes the occasion kind choices to an API known as EC2 Fleet, following among the finest practices of occasion diversification to enhance the probabilities of getting the request capability.

Finest practices utilizing Karpenter with EMR on EKS

For normal finest practices with Karpenter, discuss with Karpenter Finest Practices. The next are further issues to contemplate with EMR on EKS:

  • Keep away from inter-AZ knowledge switch price by both configuring the Karpenter provisioner to launch in a single Availability Zone or use node selector or affinity and anti-affinity to schedule the driving force and the executors of the identical job to a single Availability Zone. See the next code:
    nodeSelector:
      topology.kubernetes.io/zone: us-east-1a

  • Price optimize Spark workloads utilizing EC2 Spot Situations for executors and On-Demand Situations for the driving force by utilizing the node selector with the label karpenter.sh/capacity-type within the pod templates. We advocate utilizing pod templates to specify driver pods to run on On-Demand Situations and executor pods to run on Spot Situations. This lets you consolidate provisioner specs since you don’t want two specs per job kind. It additionally follows the perfect apply of utilizing customization outlined on workload sorts and to maintain provisioner specs to help a broader variety of use instances.
  • When utilizing EC2 Spot Situations, maximize the occasion diversification within the provisioner configuration to stick to the finest practices. To pick appropriate occasion sorts, you need to use the ec2-instance-selector, a CLI device and go library that recommends occasion sorts primarily based on useful resource standards like vCPUs and reminiscence.

Resolution overview

This put up offers an instance of arrange each Cluster Autoscaler and Karpenter in an EKS cluster and evaluate the auto scaling enhancements by working a pattern EMR on EKS workload.

The next diagram illustrates the structure of this answer.

We use the Transaction Processing Efficiency Council-Resolution Assist (TPC-DS), a call help benchmark to sequentially run three Spark SQL queries (q70-v2.4, q82-v2.4, q64-v2.4) with a set variety of 50 executors, towards 17.7 billion data, roughly 924 GB compressed knowledge in Parquet file format. For extra particulars on TPC-DS, discuss with the eks-spark-benchmark GitHub repo.

We submit the identical job with completely different Spark driver and executor specs to imitate completely different jobs solely to watch the auto scaling habits and binpacking. We advocate you right-size your Spark executors primarily based on the workload traits for manufacturing workloads.

The next code is an instance Spark configuration that ends in pod spec requests of 4 vCPU and 15 GB:

--conf spark.executor.situations=50 --conf spark.driver.cores=4 --conf spark.driver.reminiscence=10g --conf spark.driver.memoryOverhead=5g --conf spark.executor.cores=4 --conf spark.executor.reminiscence=10g  --conf spark.executor.memoryOverhead=5g

We use pod templates to schedule Spark drivers on On-Demand Situations and executors on EC2 Spot Situations (which might save as much as 90% over On-Demand Occasion costs). Spark’s inherent resiliency has the driving force launch new executors to switch those that fail as a consequence of Spot interruptions. See the next code:

apiVersion: v1
form: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: spot
  containers:
  - title: spark-kubernetes-executor


apiVersion: v1
form: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: on-demand
  containers:
  - title: spark-kubernetes-driver

Stipulations

We use an AWS Cloud9 IDE to run all of the directions all through this put up.

To create your IDE, run the next instructions in AWS CloudShell. The default Area is us-east-1, however you possibly can change it if wanted.

# clone the repo
git clone https://github.com/black-mirror-1/karpenter-for-emr-on-eks.git
cd karpenter-for-emr-on-eks
./setup/create-cloud9-ide.sh

Navigate to the AWS Cloud9 IDE utilizing the URL from the output of the script.

Set up instruments on the AWS Cloud9 IDE

Set up the next instruments required on the AWS Cloud9 setting by the working a script:

Run the next directions in your AWS Cloud9 setting and never CloudShell.

  1. Clone the GitHub repository:
    cd ~/setting
    git clone https://github.com/black-mirror-1/karpenter-for-emr-on-eks.git
    cd ~/setting/karpenter-for-emr-on-eks

  2. Arrange the required setting variables. Be at liberty to regulate the next code in accordance with your wants:
    # Set up envsubst (from GNU gettext utilities) and bash-completion
    sudo yum -y set up jq gettext bash-completion moreutils
    
    # Setup env variables required
    export EKSCLUSTER_NAME=aws-blog
    export EKS_VERSION="1.23"
    # get the hyperlink to the identical model as EKS from right here https://docs.aws.amazon.com/eks/newest/userguide/install-kubectl.html
    export KUBECTL_URL="https://s3.us-west-2.amazonaws.com/amazon-eks/1.23.7/2022-06-29/bin/linux/amd64/kubectl"
    export HELM_VERSION="v3.9.4"
    export KARPENTER_VERSION="v0.18.1"
    # get the newest matching model of the Cluster Autoscaler from right here https://github.com/kubernetes/autoscaler/releases
    export CAS_VERSION="v1.23.1"

  3. Set up the AWS Cloud9 CLI instruments:
    cd ~/setting/karpenter-for-emr-on-eks
    ./setup/c9-install-tools.sh

Provision the infrastructure

We arrange the next sources utilizing the availability infrastructure script:

  1. Create the EMR on EKS and Karpenter infrastructure:
    cd ~/setting/karpenter-for-emr-on-eks
    ./setup/create-eks-emr-infra.sh

  2. Validate the setup:
    # Ought to have outcomes which are working
    kubectl get nodes
    kubectl get pods -n karpenter
    kubectl get po -n kube-system -l app.kubernetes.io/occasion=cluster-autoscaler
    kubectl get po -n prometheus

Understanding Karpenter configurations

As a result of the pattern workload has driver and executor specs which are of various sizes, we have now recognized the situations from c5, c5a, c5d, c5ad, c6a, m4, m5, m5a, m5d, m5ad, and m6a households of sizes 2xlarge, 4xlarge, 8xlarge, and 9xlarge for our workload utilizing the amazon-ec2-instance-selector CLI. With CAS, we have to create a complete of 12 node teams, as proven in eksctl-config.yaml, however can outline the identical constraints in Karpenter with a single provisioner, as proven within the following code:

apiVersion: karpenter.sh/v1alpha5
form: Provisioner
metadata:
  title: default
spec:
  supplier:
    launchTemplate: {EKSCLUSTER_NAME}-karpenter-launchtemplate
    subnetSelector:
      karpenter.sh/discovery: {EKSCLUSTER_NAME}
  labels:
    app: kspark
  necessities:
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["on-demand","spot"]
    - key: "kubernetes.io/arch" 
      operator: In
      values: ["amd64"]
    - key: karpenter.k8s.aws/instance-family
      operator: In
      values: [c5, c5a, c5d, c5ad, m5, c6a]
    - key: karpenter.k8s.aws/instance-size
      operator: In
      values: [2xlarge, 4xlarge, 8xlarge, 9xlarge]
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["{AWS_REGION}a"]

  limits:
    sources:
      cpu: "2000"

  ttlSecondsAfterEmpty: 30

We now have arrange each auto scalers to scale down nodes which are empty for 30 seconds utilizing ttlSecondsAfterEmpty in Karpenter and --scale-down-unneeded-time in CAS.

Karpenter by design will attempt to obtain probably the most environment friendly packing of the pods on a node primarily based on CPU, reminiscence, and GPUs required.

Run a pattern workload

To run a pattern workload, full the next steps:

  1. Lets assessment the AWS Command Line Interface (AWS CLI) command to submit a pattern job:
    aws emr-containers start-job-run 
      --virtual-cluster-id $VIRTUAL_CLUSTER_ID 
      --name karpenter-benchmark-${CORES}vcpu-${MEMORY}gb  
      --execution-role-arn $EMR_ROLE_ARN 
      --release-label emr-6.5.0-latest 
      --job-driver '{
      "sparkSubmitJobDriver": {
          "entryPoint": "native:///usr/lib/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar",
          "entryPointArguments":["s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned","s3://'$S3BUCKET'/EMRONEKS_TPCDS-TEST-3T-RESULT-KA","/opt/tpcds-kit/tools","parquet","3000","1","false","q70-v2.4,q82-v2.4,q64-v2.4","true"],
          "sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.BenchmarkSQL --conf spark.executor.situations=50 --conf spark.driver.cores="$CORES" --conf spark.driver.reminiscence='$EXEC_MEMORY'g --conf spark.executor.cores="$CORES" --conf spark.executor.reminiscence='$EXEC_MEMORY'g"}}' 
      --configuration-overrides '{
        "applicationConfiguration": [
          {
            "classification": "spark-defaults", 
            "properties": {
              "spark.kubernetes.node.selector.app": "kspark",
              "spark.kubernetes.node.selector.topology.kubernetes.io/zone": "'${AWS_REGION}'a",
    
              "spark.kubernetes.container.image":  "'$ECR_URL'/eks-spark-benchmark:emr6.5",
              "spark.kubernetes.driver.podTemplateFile": "s3://'$S3BUCKET'/pod-template/karpenter-driver-pod-template.yaml",
              "spark.kubernetes.executor.podTemplateFile": "s3://'$S3BUCKET'/pod-template/karpenter-executor-pod-template.yaml",
              "spark.network.timeout": "2000s",
              "spark.executor.heartbeatInterval": "300s",
              "spark.kubernetes.executor.limit.cores": "'$CORES'",
              "spark.executor.memoryOverhead": "'$MEMORY_OVERHEAD'G",
              "spark.driver.memoryOverhead": "'$MEMORY_OVERHEAD'G",
              "spark.kubernetes.executor.podNamePrefix": "karpenter-'$CORES'vcpu-'$MEMORY'gb",
              "spark.executor.defaultJavaOptions": "-verbose:gc -XX:+UseG1GC",
              "spark.driver.defaultJavaOptions": "-verbose:gc -XX:+UseG1GC",
    
              "spark.ui.prometheus.enabled":"true",
              "spark.executor.processTreeMetrics.enabled":"true",
              "spark.kubernetes.driver.annotation.prometheus.io/scrape":"true",
              "spark.kubernetes.driver.annotation.prometheus.io/path":"/metrics/executors/prometheus/",
              "spark.kubernetes.driver.annotation.prometheus.io/port":"4040",
              "spark.kubernetes.driver.service.annotation.prometheus.io/scrape":"true",
              "spark.kubernetes.driver.service.annotation.prometheus.io/path":"/metrics/driver/prometheus/",
              "spark.kubernetes.driver.service.annotation.prometheus.io/port":"4040",
              "spark.metrics.conf.*.sink.prometheusServlet.class":"org.apache.spark.metrics.sink.PrometheusServlet",
              "spark.metrics.conf.*.sink.prometheusServlet.path":"/metrics/driver/prometheus/",
              "spark.metrics.conf.master.sink.prometheusServlet.path":"/metrics/master/prometheus/",
              "spark.metrics.conf.applications.sink.prometheusServlet.path":"/metrics/applications/prometheus/"
             }}
        ]}'

  2. Submit 4 jobs with completely different driver and executor vCPUs and reminiscence sizes on Karpenter:
    # the arguments are vcpus and reminiscence
    export EMRCLUSTER_NAME=${EKSCLUSTER_NAME}-emr
    ./sample-workloads/emr6.5-tpcds-karpenter.sh 4 7
    ./sample-workloads/emr6.5-tpcds-karpenter.sh 8 15
    ./sample-workloads/emr6.5-tpcds-karpenter.sh 4 15
    ./sample-workloads/emr6.5-tpcds-karpenter.sh 8 31 

  3. To watch the pods’s autoscaling standing in actual time, open a brand new terminal in Cloud9 IDE and run the next command (nothing is returned at the beginning):
    watch -n1 "kubectl get pod -n emr-karpenter"

  4. Observe the EC2 occasion and node auto scaling standing in a second terminal tab by working the next command (by design, Karpenter schedules in Availability Zone a):
    watch -n1 "kubectl get node --label-columns=node.kubernetes.io/instance-type,karpenter.sh/capacity-type,topology.kubernetes.io/zone,app -l app=kspark"

Examine with Cluster Autoscaler (Non-obligatory)

We now have arrange Cluster Autoscaler throughout the infrastructure setup step with the next configuration:

  • Launch EC2 nodes in Availability Zone b
  • Include 12 node teams (6 every for On-Demand and Spot)
  • Scale down unneeded nodes after 30 seconds with --scale-down-unneeded-time
  • Use the least-waste expander on CAS, which might choose the node group that can have the least idle CPU for binpacking effectivity
  1. Submit 4 jobs with completely different driver and executor vCPUs and reminiscence sizes on CAS:
    # the arguments are vcpus and reminiscence
    ./sample-workloads/emr6.5-tpcds-ca.sh 4 7
    ./sample-workloads/emr6.5-tpcds-ca.sh 8 15
    ./sample-workloads/emr6.5-tpcds-ca.sh 4 15
    ./sample-workloads/emr6.5-tpcds-ca.sh 8 31

  2. To watch the pods’s autoscaling standing in actual time, open a brand new terminal in Cloud9 IDE and run the next command (nothing is returned at the beginning):
    watch -n1 "kubectl get pod -n emr-ca"

  3. Observe the EC2 occasion and node auto scaling standing in a second terminal tab by working the next command (by design, CAS schedules in Availability Zone b):
    watch -n1 "kubectl get node --label-columns=node.kubernetes.io/instance-type,eks.amazonaws.com/capacityType,topology.kubernetes.io/zone,app -l app=caspark"

Observations

The time from pod creation to being scheduled on common is much less with Karpenter than CAS, as proven within the following determine; you possibly can see a noticeable distinction whenever you run giant scale workloads.

As proven within the following figures, as the roles had been accomplished, Karpenter was in a position to scale down the nodes that aren’t wanted inside seconds. In distinction, CAS takes minutes, as a result of it sends a sign to the node teams, including further latency. This in flip helps scale back general prices by decreasing the variety of seconds unneeded EC2 situations are working.

Clear up

To scrub up your setting, delete all of the sources created in reverse order by working the cleanup script:

export EKSCLUSTER_NAME=aws-blog
cd ~/setting/karpenter-for-emr-on-eks
./setup/cleanup.sh

Conclusion

On this put up, we confirmed you use Karpenter to simplify EKS node provisioning, and velocity up auto scaling of EMR on EKS workloads. We encourage you to attempt Karpenter and supply any suggestions by making a GitHub difficulty.

Additional studying


Concerning the Authors

Changbin Gong is a Principal Options Architect at Amazon Internet Providers. He engages with clients to create revolutionary options that deal with buyer enterprise issues and speed up the adoption of AWS providers. In his spare time, Changbin enjoys studying, working, and touring.

Sandeep Palavalasa is a Sr. Specialist Containers SA at Amazon Internet Providers. He’s a software program expertise chief with over 12 years of expertise in constructing large-scale, distributed software program methods. His skilled profession began with a concentrate on monitoring and observability and he has a robust cloud structure background. He likes engaged on distributed methods and is happy to speak about microservice structure design. His present pursuits are within the areas of container providers and serverless applied sciences.

Adv3