Traffic Accidents on BRs

This project implements a simple data lakehouse architecture using the medallion pattern to incrementally process traffic accident data and then serve to a Dashboard.

Here’s a summary of the process:

The data is extracted from the source with a python script
A Data Pipeline built with PySpark populates the Medallion Architecture Layers
- 🥉Bronze (local): The raw CSV files are joined in a stage Delta Lake to normalize the columns.
- 🥈Silver (local): Standardize numbers and date formats, and normalize text fields.
- 🥇Gold (GCP Bucket): Aggregates the data to reduce size, complexity, and granularity.
A Dashboard built with Streamlit reads the Gold Layer data and displays the charts. Access the Dashboard here

Stack & Tools:

Repository organization

All the code was developed to run inside docker containers.

The 📁/src folder contains all pyspark scripts to execute the data pipeline.

The 📁/dashboard folder contains the Streamlit app’s python scripts.

The dashboard and the pipeline are totally independent, in terms of execution, of each other.

How to run this project

Running the Pipeline

(1) Execute the prepare_environment.sh script

./prepare_environment.sh

It will create the needed folders with the proper authorizations to run the scripts.

(2) Install the requirements.txt

pip install -r requirements.txt

(3) Put the GCP JSON credentials (to access the bucket) in the /src/credentials folder and the Bucket name in the bucket_name.txt file created.

(4) Start the Spark containers.

docker-compose up

Two containers spark containers (master and worker) will be started.

(5) Download the data python python download_files.py <year> (currently only 2007 to 2022 are available)

(6) Access the internal terminal of a spark container and cd into /src folder

(7) Each python script is a spark job. To execute any of them, just type spark-submit --packages io.delta:delta-core_2.12:2.1.0 <path_to_job_file.py>

Run the scripts in the following order: merge_raw_files_in_stage.py, raw_to_silver.py, silver_to_gold.py

Running the Dashboard

Just start the containers with docker-compose up and access the browser on localhost:8501

jaumpedro214 / traffic-accidents-br-data-project Goto Github PK

traffic-accidents-br-data-project's Introduction

Traffic Accidents on BRs

Repository organization

How to run this project

Running the Pipeline

Running the Dashboard

traffic-accidents-br-data-project's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent