Giter Site home page Giter Site logo

pinterest-data-pipeline294's Introduction

Pinterest Data Pipeline

Table of contents

Table of contents generated with markdown-toc

Description

Cloud Pinterest Pipeline Architecture

This project is based on the Pinterest data pipeline on AWS data engineering services.

The data include:

  • Pin (the Pinterest post).
  • Geo (the geolocation).
  • User (the user details).

AWS data engineering services include:

  • Confluent.io Amazon S3 Connector.
  • Kafka for batch processing.
  • Airflow DAG for scheduling batch processing.
  • Kinesis for stream processing.

Learnings

  • git update-ref -d HEAD to revert an initial git commit, stackoverflow
  • git reset --hard <last_known_good_commit> to undo pushed commits, stackoverflow
  • git rm --cached <file> to unstage.
  • TypeError: Object of type datetime is not JSON serializable, stackoverflow
  • For Cron scheduling, https://crontab.guru/
  • Python Match Case, enterprisedna
  • The engine connection object is a facade that uses a DBAPI connection internally in order to
    communicate with the database.
  • PySpark - fillna, projectpro
  • PySpark - loop through rows, sparkbyexamples
  • PySpark - drop columns, stackoverflow
  • PySpark - orderBy and sort, sparkbyexamples
  • PySpark - retrieve top n, sparkbyexamples
  • Stream data from Kinesis to Databricks with PySpark, medium
  • Learn how to process Steaming Data with DataBricks and Amazon Kinesis [ hands on Demo ], youtube
  • Apache Spark’s Structured Streaming with Amazon Kinesis on Databricks, databricks
  • PySpark partitionBy() – Write to Disk Example, sparkbyexamples
  • Advancing Spark - Databricks Delta Streaming, youtube

Milestones

  • Milestone 1: Set up the environment.
  • Milestone 2: Get Started.
  • Milestone 3: Batch Processing: Configure the EC2 Kafka client.
  • Milestone 4: Batch Processing: Connect an Managed Streaming for Apache Kafka (MSK) cluster to
    an S3 bucket.
  • Milestone 5: Batch Processing: Configure an API in API Gateway.
  • Milestone 6: Batch Processing: Databricks.
  • Milestone 7: Batch Processing: Spark on Databricks.
  • Milestone 8: Batch Processing: AWS Managed Workflows for Apache Airflow (MWAA).
  • Milestone 9: Stream Processing: AWS Kinesis.

Installation

  • Python3
  • Pandas
    • Run pip install pandas
  • requests
    • Run pip install requests
  • sqlalchemy
    • Run pip install SQLAlchemy
  • bson
    • Run pip install bson
  • yaml
    • Run pip install PyYAML
  • Create credentials.yaml and config.yaml in the parent directory to have credentials as per the AWSDBConnector class and DAG file.
  • Create <IAM-user-name>-key-pair.pem with the RSA private key for EC2 instance connection by SSH.

Usage

  • cd to the pinterest-data-pipeline294 directory.
  • To emulate sending data to Kafka topics:
    • Connect to the EC2 by SSH client:
      • Log into the AWS console.
      • Enter the account ID, IAM user name and password.
      • Select the relevant EC2.
      • Click connect and under the SSH Client tab, copy the SSH command to the terminal.
      • Replace root with ec2-user
      • Example, run ssh -i "<IAM-user-name>-key-pair.pem" ec2-user@ec2-<IP-address>.compute-1.amazonaws.com
    • Start the REST proxy:
      • Run confluent-7.2.0/bin/kafka-rest-start /home/ec2-user/confluent-7.2.0/etc/kafka-rest/kafka-rest.properties
    • In a new terminal:
      • cd to the pinterest-data-pipeline294 directory.
      • Run python3 user_posting_emulation.py
    • To extract and transform data from Kafka:
      • Log into Databricks.
      • Copy the commands from databricks.ipynb to Databricks.
  • To emulate sending streaming data to Kinesis:
    • In a new terminal:
      • cd to the pinterest-data-pipeline294 directory.
      • Run python3 user_posting_emulation_streaming.py
    • To extract and transform data from Kinesis and load to Delta tables:
      • Log into Databricks.
      • Copy the commands from kinesis_databricks.ipynb to Databricks.

File structure of the project

pinterest-data-pipeline294
├── 0a966c04ad33_dag.py
├── cloud_pinterest_pipeline_architecture.png
├── config.yaml
├── credentials.yaml
├── database_utils.py
├── databricks.ipynb
├── kinesis_databricks.ipynb
├── LICENSE
├── README.md
├── user_posting_emulation.py
└── user_posting_emulation_streaming.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.