Giter Site home page Giter Site logo

data_engineering_coding_exercise's Introduction

Data Engineering Coding Exercise

This exercise illustrates Python/AWS application that reads hit-level data files as input and helps us understand revenue sources for their products with respect to a search engine and search keywords. This application runs and creates infrastructure with AWS Cloudformation and uses event-driven serverless computing Lambda function and Glue Job Services to process input file and give output with a revenue source for their products.

Prerequisites

  1. AWS CLI - Require to perform operations using local machine.
  2. AWS S3 bucket - Need to store files in zip format to load in Lambda function and Glue job.

Application Components details

deployment_template.yml - AWS CloudFormation template to provision infrastructure as a code to run this application (Create IAM roles, S3 buckets (input and output) , Glue ETL Job, Lambda event based Function, and Permission to access lambda and glue).

src/app.py - Python code to process AWS Glue job.

src/requirements.txt - Dependencies to run ETL/AWS Glue job.

lambda_function.py - Python code (lambda function) to trigger ETL (AWS glue job).

lambda.zip - Compressed ZIP file containing lambda function.

Conceptual Process Flow

Conceptual-Process-Flow stages

Serverless deployment process flow (created in AWS couldformation - designer)

Process stages

Deployment Steps -

  1. Git clone the repository
https://github.com/SourabhShrivas/Data_Engineering_Coding_Exercise
  1. Upload the lambda function to an S3 bucket
aws s3 cp lambda.zip s3://s3-adobe-repository/
  1. Upload the ETL (Glue Job app.py on application repository S3 bucket - (s3-adobe-repository)
aws s3 cp src/app.py s3://s3-adobe-repository/etl/app.py
  1. Trigger the cloudformation stack with deployment_template.yml template file.
aws cloudformation deploy --template-file deployment_template.yml --stack-name infrastructure --capabilities CAPABILITY_NAMED_IAM
  1. Run the process by uploading the input file in the Input S3 bucket (inputs3bucket-adobe).
aws s3 cp /Users/soura/Downloads/data.tsv s3://inputs3bucket-adobe/
  1. Monitor jobs -
    6.1 - AWS Lambda > ColudWatch > Log Groups
    6.2 - AWS Glue Job > AWS Glue Studio > Monitoring \

Deleting the cloudFormation stack

Download the output from S3 bucket (outputs3bucket-adobe) and then empty input and output buckets and than delete the cloudFormation stack.

  1. Empty all the s3 bucket creted by cloudformation infrastructure as code.
aws s3 rm s3://inputs3bucket-adobe --recursive
aws s3 rm s3://outputs3bucket-adobe --recursive
  1. Deleting the stack
aws cloudformation delete-stack --stack-name infrastructure

data_engineering_coding_exercise's People

Contributors

sourabhshrivas avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.