khoinguyen19k8 / tfl-cycling Goto Github PK

View Code? Open in Web Editor NEW

ETL pipeline that processes raw CSV data into Snowflake using Airflow + Databricks/Pyspark. Provision infrastructure with Terraform.

Dockerfile 0.72% Python 46.47% Jupyter Notebook 32.36% Shell 0.74% HCL 19.71%

tfl-cycling's Introduction

Overview

The project processes the public data provid by Transport for London, a local government body responsible for most of the transport network in London, United Kingdom. In particular, the project focuses on rented bicycles usage data from 2015 until 2022. It can be accessed here

Dataset Schema

Features	Data type	Descriptions
Rental ID	Integer	Unique rental ID
Duration	Double	Duration in seconds
Bike Id	Integer	Unique bike ID
End Date	Timestamp	The date where the bike is returned
EndStation Id	Integer	The ID of the end station
EndStation Name	String	The station name where the bike is returned
Start Date	Timestamp	The date where the bike is rented
StartStation Id	Integer	The ID of the start station
StartStation Name	String	The station name where the bike is rented

Raw dataset size: ~352 CSV - 10.4 GB.
RAW dataset record count: ~84.4 million rows.

Architectures

Below is a high-level description of the architecture. The pipeline checks for new data in the cycling.data.tfl.gov.uk bucket every month and loads new data into an S3 bucket. New data is then processed with Databricks + Pyspark then loaded into Snowflake. The project utilises Infrastructure as code (IaC) with Terraform to provision cloud infrastructure.

Snowflake dashboard

A dashboard built in Snowflake. The dashboard has date range, date bucket, and a filter for start station name.

Recommend Projects

khoinguyen19k8 / tfl-cycling Goto Github PK

tfl-cycling's Introduction

Overview

Dataset Schema

Architectures

Snowflake dashboard

tfl-cycling's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent