Pinterest Data Pipeline

Description
Learnings
Milestones
Installation
Usage
File structure of the project

Table of contents generated with markdown-toc

Description

This project is based on the Pinterest data pipeline on AWS data engineering services.

The data include:

Pin (the Pinterest post).
Geo (the geolocation).
User (the user details).

AWS data engineering services include:

Confluent.io Amazon S3 Connector.
Kafka for batch processing.
Airflow DAG for scheduling batch processing.
Kinesis for stream processing.

Learnings

git update-ref -d HEAD to revert an initial git commit, stackoverflow
git reset --hard <last_known_good_commit> to undo pushed commits, stackoverflow
git rm --cached <file> to unstage.
TypeError: Object of type datetime is not JSON serializable, stackoverflow
For Cron scheduling, https://crontab.guru/
Python Match Case, enterprisedna
The engine connection object is a facade that uses a DBAPI connection internally in order to
communicate with the database.
PySpark - fillna, projectpro
PySpark - loop through rows, sparkbyexamples
PySpark - drop columns, stackoverflow
PySpark - orderBy and sort, sparkbyexamples
PySpark - retrieve top n, sparkbyexamples
Stream data from Kinesis to Databricks with PySpark, medium
Learn how to process Steaming Data with DataBricks and Amazon Kinesis [ hands on Demo ], youtube
Apache Spark’s Structured Streaming with Amazon Kinesis on Databricks, databricks
PySpark partitionBy() – Write to Disk Example, sparkbyexamples
Advancing Spark - Databricks Delta Streaming, youtube

Milestones

Milestone 1: Set up the environment.
Milestone 2: Get Started.
Milestone 3: Batch Processing: Configure the EC2 Kafka client.
Milestone 4: Batch Processing: Connect an Managed Streaming for Apache Kafka (MSK) cluster to
an S3 bucket.
Milestone 5: Batch Processing: Configure an API in API Gateway.
Milestone 6: Batch Processing: Databricks.
Milestone 7: Batch Processing: Spark on Databricks.
Milestone 8: Batch Processing: AWS Managed Workflows for Apache Airflow (MWAA).
Milestone 9: Stream Processing: AWS Kinesis.

Installation

Python3
Pandas
- Run pip install pandas
requests
- Run pip install requests
sqlalchemy
- Run pip install SQLAlchemy
bson
- Run pip install bson
yaml
- Run pip install PyYAML
Create credentials.yaml and config.yaml in the parent directory to have credentials as per the AWSDBConnector class and DAG file.
Create <IAM-user-name>-key-pair.pem with the RSA private key for EC2 instance connection by SSH.

Usage

cd to the pinterest-data-pipeline294 directory.
To emulate sending data to Kafka topics:
- Connect to the EC2 by SSH client:
  - Log into the AWS console.
  - Enter the account ID, IAM user name and password.
  - Select the relevant EC2.
  - Click connect and under the SSH Client tab, copy the SSH command to the terminal.
  - Replace root with ec2-user
  - Example, run ssh -i "<IAM-user-name>-key-pair.pem" ec2-user@ec2-<IP-address>.compute-1.amazonaws.com
- Start the REST proxy:
  - Run confluent-7.2.0/bin/kafka-rest-start /home/ec2-user/confluent-7.2.0/etc/kafka-rest/kafka-rest.properties
- In a new terminal:
  - cd to the pinterest-data-pipeline294 directory.
  - Run python3 user_posting_emulation.py
- To extract and transform data from Kafka:
  - Log into Databricks.
  - Copy the commands from databricks.ipynb to Databricks.
To emulate sending streaming data to Kinesis:
- In a new terminal:
  - cd to the pinterest-data-pipeline294 directory.
  - Run python3 user_posting_emulation_streaming.py
- To extract and transform data from Kinesis and load to Delta tables:
  - Log into Databricks.
  - Copy the commands from kinesis_databricks.ipynb to Databricks.

File structure of the project

pinterest-data-pipeline294
├── 0a966c04ad33_dag.py
├── cloud_pinterest_pipeline_architecture.png
├── config.yaml
├── credentials.yaml
├── database_utils.py
├── databricks.ipynb
├── kinesis_databricks.ipynb
├── LICENSE
├── README.md
├── user_posting_emulation.py
└── user_posting_emulation_streaming.py

pchan2 / pinterest-data-pipeline294 Goto Github PK

pinterest-data-pipeline294's Introduction

Pinterest Data Pipeline

Table of contents

Description

Learnings

Milestones

Installation

Usage

File structure of the project

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent