Giter Site home page Giter Site logo

aws-cdk-for-emr-on-eks's Introduction

EMR on EKS via SDK

This project helps you demonstrate EMR on EKS using the CDK for automation.

It deploys the following:

  • An EKS cluster in a new VPC across 3 subnets
    • The Cluster has a default managed node group set to scale between 3 and 10 nodes. This node group can use regular instances or Graviton2.
    • It also has a Fargate profile set to use the sparkfg namespace
  • Two EMR virtual clusters on the EKS cluster
    • The first virtual cluster uses the sparkns namespace on a managed node group
    • The second virtual cluster uses the sparkfg namespace on a Fargate profile
    • All EMR on EKS configuration is done, including a cluster role bound to an IAM role
  • A default job execution role that has full CloudWatch and S3 access
  • A CloudWatch log group for use with Spark job runs
  • Optionally, sets up Apache Airflow and EMR Studio

Deployment

We tested this deployment using CDK version 1.99.0.

After cloning the repo, download the Airflow plugin zip file:

wget https://elasticmapreduce.s3.amazonaws.com/emr-containers/airflow/plugins/emr_containers_airflow_plugin.zip
mv emr_containers_airflow_plugin.zip emr_eks_cdk/mwaa_plugins/

Now run:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cdk bootstrap aws://<account>/<region> --context prefix=<prefix list for inbound access to MWAA> --context instance=m5.xlarge --context username=<SSO user name>
cdk synth --context prefix=<prefix list for inbound access to MWAA> --context instance=m5.xlarge --context username=<SSO user name>
cdk ls --context prefix=<prefix list for inbound access to MWAA> --context instance=m5.xlarge --context username=<SSO user name> # list stacks
cdk deploy <stack name> --context prefix=<prefix list for inbound access to MWAA> --context instance=m5.xlarge  --context username=<SSO user name>

If deployment fails due to an error creating the EKS cluster, just redeploy. This is a known issue in the CDK.

Available stacks include:

  • emr-eks-cdk: The base stack
  • mwaa-cdk: Adds Airflow
  • studio-cdk: Adds EMR Studio prerequisites (requires SSO enabled in your account)
  • studio-cdk-live: Adds EMR Studio (requires SSO enabled in your account)

Note that EMR Studio doesn't yet support Graviton2 nodes. Do not choose a Graviton2 instance type if you want to use Studio.

Register EKS in kube config

The CDK output should include the command you use to update your kube config to run commands on the cluster. It'll look like this:

emr-eks-cdk.EksForSparkConfigCommandB4B8E93B = aws eks update-kubeconfig --name EksForSparkCF45D836-7d1075f7618943f2b56095bcbdd13709 --region us-west-2 --role-arn arn:aws:iam::<account>:role/emr-eks-cdk-EksForSparkMastersRole19842074-1QPZ722TZ38S

If you run multiple k8s clusters, you can list and switch cluster contexts like this:

kubectl config get-contexts
kubectl config use-context "arn:aws:eks:us-west-2:<account>:cluster/EksForSparkCF45D836-7d1075f7618943f2b56095bcbdd13709"

The argument to use-context is the context name as reported by get-contexts.

Test a job run

First, identify your virtual cluster ID:

aws emr-containers list-virtual-clusters

The virtual cluster IDs are also in the CDK stack output.

You can run a test application with this command:

This command uses local:// scheme to refer to a pi.py file as the entryPoint for the job. This file is already pre-built into the docker image and can be referenced using local:// scheme. Note that the local:// scheme can only be used to reference files that are pre-built into the docker image. It can not be used to refer to files in your local filesystem.

aws emr-containers start-job-run \
    --virtual-cluster-id <virtual cluster ID> \
    --name sample-job-name \
    --execution-role-arn <job role ARN from CDK output> \
    --release-label emr-6.2.0-latest \
    --job-driver '{"sparkSubmitJobDriver": {"entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py","sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"}}' \
    --configuration-overrides '{"monitoringConfiguration": {"cloudWatchMonitoringConfiguration": {"logGroupName": "<log group from CDK output>", "logStreamNamePrefix": "SparkEMREKS"}}}'

You can track job completion in the EMR console.

Testing with Airflow

Go to the MWAA console and open the Airflow UI. Activate the DAG by moving the slider to On. Add two Airflow variables:

* cluster_id = your virtual cluster ID
* role_arn = your job role ARN

Then trigger the DAG.

Running on Graviton2 node group

Deploy the stack with m6g.xlarge as the instance type rather than m5.xlarge. Then include a node selector when you start the job.

This command uses local:// scheme to refer to a pi.py file as the entryPoint for the job. This file is already pre-built into the docker image and can be referenced using local:// scheme. Note that the local:// scheme can only be used to reference files that are pre-built into the docker image. It can not be used to refer to files in your local filesystem.

aws emr-containers start-job-run \
    --virtual-cluster-id <virtual cluster ID> \
    --name sample-job-name \
    --execution-role-arn <job role ARN from CDK output> \
    --release-label emr-6.2.0-latest \
    --job-driver '{"sparkSubmitJobDriver": {"entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py","sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1 --conf spark.kubernetes.node.selector.kubernetes.io/arch=arm64"}}' \
    --configuration-overrides '{"monitoringConfiguration": {"cloudWatchMonitoringConfiguration": {"logGroupName": "<log group from CDK output>", "logStreamNamePrefix": "SparkEMREKS"}}}'

EMR Studio

Deploy the studio-cdk script. Wait for it to deploy and check to make sure that the endpoint is active:

aws emr-containers list-managed-endpoints --virtual-cluster-id <cluster ID> | jq '.endpoints[].state'

If the endpoint fails to create, try creating it again manually:

aws emr-containers create-managed-endpoint \
    --type JUPYTER_ENTERPRISE_GATEWAY \
    --virtual-cluster-id <virtual cluster ID> \
    --name <example-endpoint-name> \
    --execution-role-arn <job role ARN>
    --release-label <emr-6.2.0-latest>
    --certificate-arn <certificate ARN>

Now deploy the studio-live-cdk script. The script will output the URL for your Studio environment.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.