Giter Site home page Giter Site logo

sgouda0412 / data-lake-using-spark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ruslanmv/data-lake-using-spark

0.0 0.0 0.0 1.83 MB

Developed an ETL pipeline for a Data Lake that extracts data from S3, processes the data using Spark, and loads the data back into S3 as a set of dimensional tables. Lake Processing: Spark

License: MIT License

Python 12.84% Jupyter Notebook 87.16%

data-lake-using-spark's Introduction

Project: Data Lakes with Spark

Project Description: A music streaming company has grown their user base and song database and want to move their data from a data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. As their data engineer, I was tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.

Choice of a Data Lake on AWS:

There are several options when it comes to creating Data Lakes on AWS. Essentially it boils down to choosing what type of storage we want, the processing engine we want and whether we are using AWS-Managed solution or a Vendor managed solution.

The dotted red line is the option we will be using for this project: Spark for processing and AWS EMR as our managed solution, and S3 for retrieving and storing our data. A pictorial representation is shown below.

datalakeoptions

  • Lake Storage: S3
  • Lake Processing: Spark

The focus of this project is on the dotted blue line. The key thing to note here is that even though I am using AWS EMR, I am not using HDFS for lake storage, instead I am using S3 as my storage. Data is loaded into EMR for processing and the query/processing results are stored back into S3. Data Wrangling is done using pyspark and the processed data is stored back to S3 as parquet files.

focus

Data description

There are two main datasets that reside in S3.

  • Song data: s3://<bucket>/song-data/
  • Log data: s3://<bucket>/log-data/

The log format is captured in s3://<bucket>/log_json_path.json

Song dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.

The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

And below is an example of what the data in a log file, 2018-11-12-events.json, looks like.

Project Structure

  • etl.py: The script reads song_data and load_data from S3, transforms them to create five different tables, and writes them to partitioned parquet files in table directories on S3.
  • schema.py: The script captures the column names as per star-schema design.
  • dl.cfg: Contains credentials for accessing S3.
  • Local.ipynb: Notebook for trying things out locally before putting them into a script.
  • data: A sample of song_data and log_data saved locally for testing before going to S3.
  • spark-warehouse: Output tables written to parquet files. Each table has its own directory.
    • artist table files
    • users table files
    • songs table files are partitioned by year and artist.
    • time table files are partitioned by year and month.
    • songplays table is partitioned by year and month.

Methodology

etl.py script makes use of PySpark to wrangling the data. It starts off by reading song_data and load_data from S3, transforms them to create five different tables, and writes them to partitioned parquet files in table directories on S3.

A simple star schema was employed for designing the tables.

Technology

How to run this project on AWS EMR?

Create a Data Lake with Spark and AWS EMR.

To create an Elastic Map Reduce data lake on AWS, use the following steps:

  1. Create a ssh key-pair to securely connect to the EMR cluster that we are going to create. Go to your EC2 dashboard and click on the key-pairs. Create a new one, you will get a .pem file that you need to use to securely connect to the cluster.

  2. Next, we will create an EMR cluster with the following configuration. Here EMR operates Spark in YARN cluster mode and comes with Hadoop Distributed file system. We will use a 4 cluster node with 1 master and 3 worker nodes.

emr-setup

  1. We will be using the default security group and the default EMR role EMR_DefaultRole for our purposes. Ensure that default security group associated with Master node has SSH inbound rule open, otherwise you won't be able to ssh.

Submitting Spark Jobs to the cluster

  1. Once the cluster is ready, we can ssh into the master node from your terminal session. This will connect you to the EMR cluster. ssh-emr

  2. Now you can transfer your files to /home/hadoop and then run the spark-submit command to submit your spark jobs. Make sure you update the dl.cfg with your S3 creds.

scripts

Note: Since the master node does not have some python libraries, you may have to sudo pip install python-library.

How to run this project locally?

As the size of the data located on S3 is quite large, I have also provided an option to run this project locally. That is, instead of retrieving the data from S3, you can retrieve the data from data directory, then process it on the EMR cluster and store the result back locally to spark-warehouse. This is especially useful to iteratively develop your processing logic, without waiting for a long time to load data from S3.

To do so, run the etl-local.py

data-lake-using-spark's People

Contributors

ruslanmv avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.