Giter Site home page Giter Site logo

apuhegde / transborderfreight-pyspark-bigquery-looker Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 59.78 MB

Batch processing ETL pipeline for ETL & analysis of North American TransBorder Freight transport

License: GNU General Public License v3.0

Python 68.11% Shell 5.67% Dockerfile 4.29% HCL 21.93%

transborderfreight-pyspark-bigquery-looker's Introduction

Analysis of Transborder freight transport between the USA, Canada and Mexico

The last few years have witnessed significant developments in North American trade, including the renegotiation of NAFTA into the USMCA, trade tensions and tariff disputes due to changing governments, and the impact of the COVID-19 pandemic. I was interested in analyzing freight transport data to assess the possible effects of these events on trade trends over a decade (2014-2023) between the three biggest countries in North America - USA, Canada and Mexico.

Objective

My objective was to build a robust data ingestion and processing pipeline that would deliver reliable, reproducible and clean data for analysis, insight generation and potential decision-making regarding North American Transborder trade.

Technologies

I used the following tools to build the pipeline:

  • Terraform for cloud infrastructure provisioning
  • Google Cloud Project (GCP) for:
    • data lake storage (Google Cloud Storage, GCS)
    • data warehouse storage (BigQuery)
    • Spark job processing (Dataproc)
    • data visualization + analysis (Looker Studio)
  • Docker/Docker Compose for containerization
  • Airflow for workflow orchestration
  • Apache Spark for distributed batch data processing

Data description

The raw data is made available by the Bureau of Transportation Statistics (BTS), Department of Transportation (DOT), Government of the United States and can be found here: Bureau of Transportation Statistics Transborder Freight Raw Data. It is collected and processed by multiple agencies and organizations (public and private), via multiple approaches including administrative data required by regulation and surveys, from both carriers and shippers. A more thorough description of the data can be found here: North American Transborder Freight Data FAQs

The data is in .csv format and is updated on the BTS website on a monthly basis. There is also a metadata file that accompanies the raw data, which details the codes used in the raw data files. The metadata is available in excel and pdf format, with the pdf file containing additional information than what is available in the excel file. Therefore, before using the metadata for this project, I manually curated and reconciled the excel and the pdf files. I’ve used this cleaned up metadata excel file as my input file, available here: TransBorderCodes.xls

Pipeline architecture

Data pipeline architecture overview

Overview of the Transborder Freight data processing pipeline

Pipeline execution

  1. Prerequisites: You are required to have the following accounts and softwares installed on your system before you can begin:

    • GCP account, GCP project and a service account that has the following permissions to the project:
      • Viewer
      • Storage Admin
      • Storage Object Admin
      • Storage Object Viewer
      • BigQuery Admin
      • Dataproc Administrator
      • Dataproc Service Agent
      • Editor Download the auth-key credentials json file for this service account and save to ~/.google/credentials/google_credentials.json.
    • Google SDK on local machine
    • Docker desktop and docker compose
    • Terraform
  2. Clone this github repo

  3. Create a bucket on GCP and copy the file "dataproc-pip-installation.sh" into it using gsutil:

gsutil cp <path-to-dataproc-pip-installation.sh> gs://accessory-bucket-name/scripts/dataproc-pip-installation.sh

For example, I've called this bucket "tbf-analysis-accessory", so my code would read:

gsutil cp /Users/ahegde/Job_misc/Portfolio/TransBorderFreightAnalysis/VM_project_files/TransBorderFreight-pySpark-BigQuery-Looker/Terraform/dataproc-pip-installation.sh gs://tbf-analysis-accessory/scripts/dataproc-pip-installation.sh

The variable "pip_initialization_script" in variables.tf file in the Terraform folder needs to be set to this GCP script path.

  1. Navigate to the Terraform directory. Replace GCP details like project name, bucket name, etc. in the variables.tf file with yours, and set up cloud infrastructure using the following Terraform commands:

    terraform init
    terraform plan
    terraform apply
    
  2. Navigate to the Airflow directory, make necessary changes to GCP project name, etc in the docker-compose.yaml and Dockerfile, and run docker compose:

    docker compose build
    docker compose up airflow-init
    docker compose up
    
  3. To check whether docker compose has generated all necessary containers, you can ssh into the worker container by running:

    docker ps
    docker exec -it <worker-container-ID> bash
    
  4. Open localhost:8080 in a browser and type username:password as admin:admin . You can now trigger your DAG and check the progress as well as the logs.

  5. After you're done running the entire project, destroy your cloud resources by running terraform destroy inside the Terraform directory (you will need to add the force_destroy = true option within your resource blocks in your main.tf file if you want Terraform to forcefully destroy the said resources even if you have data stored in the resources. I have switched this off on purpose.)

Analysis overview

Analysis overview

Analysis overview

Overview of the Transborder Freight data analysis

transborderfreight-pyspark-bigquery-looker's People

Contributors

apuhegde avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.