Giter Site home page Giter Site logo

jaumpedro214 / traffic-accidents-br-data-project Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 43 KB

This is a project that implements a data lakehouse architecture using the medallion pattern to accommodate traffic accident data.

Shell 0.52% Python 99.28% Dockerfile 0.21%

traffic-accidents-br-data-project's Introduction

Traffic Accidents on BRs

This project implements a simple data lakehouse architecture using the medallion pattern to incrementally process traffic accident data and then serve to a Dashboard.

Here’s a summary of the process:

  • The data is extracted from the source with a python script
  • A Data Pipeline built with PySpark populates the Medallion Architecture Layers
    • 🥉Bronze (local): The raw CSV files are joined in a stage Delta Lake to normalize the columns.
    • 🥈Silver (local): Standardize numbers and date formats, and normalize text fields.
    • 🥇Gold (GCP Bucket): Aggregates the data to reduce size, complexity, and granularity.
  • A Dashboard built with Streamlit reads the Gold Layer data and displays the charts. Access the Dashboard here

Stack & Tools:

Repository organization

All the code was developed to run inside docker containers.

The 📁/src folder contains all pyspark scripts to execute the data pipeline.

The 📁/dashboard folder contains the Streamlit app’s python scripts.

The dashboard and the pipeline are totally independent, in terms of execution, of each other.

How to run this project

Running the Pipeline

(1) Execute the prepare_environment.sh script

./prepare_environment.sh

It will create the needed folders with the proper authorizations to run the scripts.

(2) Install the requirements.txt

pip install -r requirements.txt

(3) Put the GCP JSON credentials (to access the bucket) in the /src/credentials folder and the Bucket name in the bucket_name.txt file created.

(4) Start the Spark containers.

docker-compose up

Two containers spark containers (master and worker) will be started.

(5) Download the data python python download_files.py <year> (currently only 2007 to 2022 are available)

(6) Access the internal terminal of a spark container and cd into /src folder

(7) Each python script is a spark job. To execute any of them, just type spark-submit --packages io.delta:delta-core_2.12:2.1.0 <path_to_job_file.py>

Run the scripts in the following order: merge_raw_files_in_stage.py, raw_to_silver.py, silver_to_gold.py

Running the Dashboard

Just start the containers with docker-compose up and access the browser on localhost:8501

traffic-accidents-br-data-project's People

Contributors

jaumpedro214 avatar

Stargazers

Jared avatar Jerson Júnior avatar

Watchers

 avatar

Forkers

toggrcpbayindir

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.