Giter Site home page Giter Site logo

mohade09 / amazon-emr-vscode-toolkit Goto Github PK

View Code? Open in Web Editor NEW

This project forked from awslabs/amazon-emr-vscode-toolkit

0.0 0.0 0.0 749 KB

A VS Code Extension to make it easier to manage and develop Spark jobs on EMR

Home Page: https://marketplace.visualstudio.com/items?itemName=AmazonEMR.emr-tools

License: Apache License 2.0

Shell 0.11% Python 0.55% TypeScript 95.14% CSS 1.97% Dockerfile 2.24%

amazon-emr-vscode-toolkit's Introduction

Amazon EMR Toolkit for VS Code (Developer Preview)

EMR Toolkit is a VS Code Extension to make it easier to develop Spark jobs on EMR.

Requirements

  • A local AWS profile
  • Access to the AWS API to list EMR and Glue resources
  • Docker (if you want to use the devcontainer)

Features

Amazon EMR Explorer

The Amazon EMR Explorer allows you to browse job runs and steps across EMR on EC2, EMR on EKS, and EMR Serverless. To see the Explorer, choose the EMR icon in the Activity bar.

Glue Catalog Explorer

The Glue Catalog Explorer displays databases and tables in the Glue Data Catalog. By right-clicking on a table, you can select View Glue Table that will show the table columns.

PySpark EMR Development Container

The toolkit provides an EMR: Create local Spark environment command that creates a development container based off of an EMR on EKS image for the EMR version you choose. This container can be used to develop Spark and PySpark code locally that is fully compatible with your remote EMR environment.

You choose a region and EMR version you want to use, and the extension creates the relevant Dockerfile and devcontainer.json.

Once the container is created, follow the instructions in the emr-local.md file to authenticate to ECR and use the Dev--Containers: Reopen in Container command to build and open your local Spark environment.

You can choose to configure AWS authentication in the container in 1 of 3 ways:

  • Use existing ~/.aws config - This mounts your ~/.aws directory to the container.
  • Environment variables - If you already have AWS environment variables configured in your shell, the container will reference those variables.
  • .env file - Creates a .devcontainer/aws.env file that you can populate with AWS credentials.

Spark Shell Support

The EMR Development Container is configured to run Spark in local mode. You can use it like any Spark-enabled environment. Inside the VS Code Terminal, you can use the pyspark or spark-shell commands to start a local Spark session.

Jupyter Notebook Support

By default, the EMR Development Container also supports Jupyter. Use the Create: New Jupyter Notebook command to create a new Jupyter notebook. The following code snippet shows how to initialize a Spark Session inside the notebook. By default, the Container environment is also configured to use the Glue Data Catalog so you can use spark.sql commands against Glue tables.

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("EMRLocal")
    .getOrCreate()
)

EMR Serverless Deployment

You can deploy and run a single PySpark file on EMR Serverless with the EMR Serverless: Deploy and run PySpark job command. You'll be prompted for the following information:

  • S3 URI - Your PySpark file will be copied here
  • IAM Role - A job runtime role that can be used to run your EMR Serverless job
  • EMR Serverless Application ID - The ID of an existing EMR Serverless Spark application
  • Filename - The name of the local PySpark file you want to run on EMR Serverless
emr-serverless-deploy.mp4

Future Considerations

  • Allow for the ability to select different profiles
  • Persist state (region selection)
  • Create a Java environment
  • Automate deployments to EMR
    • Create virtualenv and upload to S3
    • Pack pom into jar file
  • Link to open logs in S3 or CloudWatch
  • Testing :) https://vscode.rocks/testing/

Feedback Notes

I'm looking for feedback in a few different areas:

  • How do you use Spark on EMR today?
    • EMR on EC2, EMR on EKS, or EMR Serverless
    • PySpark, Scala Spark, or SparkSQL
  • Does the tool work as expected for browsing your EMR resources
  • Do you find the devcontainer useful for local development
  • What functionality is missing that you would like to see

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.