Table of contents generated with markdown-toc
This project is based on the Pinterest data pipeline on AWS data engineering services.
The data include:
- Pin (the Pinterest post).
- Geo (the geolocation).
- User (the user details).
AWS data engineering services include:
- Confluent.io Amazon S3 Connector.
- Kafka for batch processing.
- Airflow DAG for scheduling batch processing.
- Kinesis for stream processing.
git update-ref -d HEAD
to revert an initial git commit, stackoverflowgit reset --hard <last_known_good_commit>
to undo pushed commits, stackoverflowgit rm --cached <file>
to unstage.- TypeError: Object of type datetime is not JSON serializable, stackoverflow
- For Cron scheduling, https://crontab.guru/
- Python Match Case, enterprisedna
- The engine connection object is a facade that uses a DBAPI connection internally in order to
communicate with the database. - PySpark - fillna, projectpro
- PySpark - loop through rows, sparkbyexamples
- PySpark - drop columns, stackoverflow
- PySpark - orderBy and sort, sparkbyexamples
- PySpark - retrieve top n, sparkbyexamples
- Stream data from Kinesis to Databricks with PySpark, medium
- Learn how to process Steaming Data with DataBricks and Amazon Kinesis [ hands on Demo ], youtube
- Apache Spark’s Structured Streaming with Amazon Kinesis on Databricks, databricks
- PySpark partitionBy() – Write to Disk Example, sparkbyexamples
- Advancing Spark - Databricks Delta Streaming, youtube
- Milestone 1: Set up the environment.
- Milestone 2: Get Started.
- Milestone 3: Batch Processing: Configure the EC2 Kafka client.
- Milestone 4: Batch Processing: Connect an Managed Streaming for Apache Kafka (MSK) cluster to
an S3 bucket. - Milestone 5: Batch Processing: Configure an API in API Gateway.
- Milestone 6: Batch Processing: Databricks.
- Milestone 7: Batch Processing: Spark on Databricks.
- Milestone 8: Batch Processing: AWS Managed Workflows for Apache Airflow (MWAA).
- Milestone 9: Stream Processing: AWS Kinesis.
- Python3
- Pandas
- Run
pip install pandas
- Run
- requests
- Run
pip install requests
- Run
- sqlalchemy
- Run
pip install SQLAlchemy
- Run
- bson
- Run
pip install bson
- Run
- yaml
- Run
pip install PyYAML
- Run
- Create
credentials.yaml
andconfig.yaml
in the parent directory to have credentials as per theAWSDBConnector
class andDAG
file. - Create
<IAM-user-name>-key-pair.pem
with the RSA private key for EC2 instance connection by SSH.
cd
to the pinterest-data-pipeline294 directory.- To emulate sending data to Kafka topics:
- Connect to the EC2 by SSH client:
- Log into the AWS console.
- Enter the account ID, IAM user name and password.
- Select the relevant EC2.
- Click connect and under the SSH Client tab, copy the SSH command to the terminal.
- Replace
root
withec2-user
- Example, run
ssh -i "<IAM-user-name>-key-pair.pem" ec2-user@ec2-<IP-address>.compute-1.amazonaws.com
- Start the REST proxy:
- Run
confluent-7.2.0/bin/kafka-rest-start /home/ec2-user/confluent-7.2.0/etc/kafka-rest/kafka-rest.properties
- Run
- In a new terminal:
cd
to thepinterest-data-pipeline294
directory.- Run
python3 user_posting_emulation.py
- To extract and transform data from Kafka:
- Log into Databricks.
- Copy the commands from
databricks.ipynb
to Databricks.
- Connect to the EC2 by SSH client:
- To emulate sending streaming data to Kinesis:
- In a new terminal:
cd
to thepinterest-data-pipeline294
directory.- Run
python3 user_posting_emulation_streaming.py
- To extract and transform data from Kinesis and load to Delta tables:
- Log into Databricks.
- Copy the commands from
kinesis_databricks.ipynb
to Databricks.
- In a new terminal:
pinterest-data-pipeline294
├── 0a966c04ad33_dag.py
├── cloud_pinterest_pipeline_architecture.png
├── config.yaml
├── credentials.yaml
├── database_utils.py
├── databricks.ipynb
├── kinesis_databricks.ipynb
├── LICENSE
├── README.md
├── user_posting_emulation.py
└── user_posting_emulation_streaming.py