Giter Site home page Giter Site logo

traffic-data-pipeline's Introduction

Data warehouse tech stack with MySQL, DBT, Airflow

Introduction

We aim to construct an AI company that installs sensors in enterprises and gathers data from all areas of the industry, such as people's activities, household sensors in the building, the surroundings, and other important information. Our company is in charge of implementing all of the required sensors, collecting a continuous stream of information from them, and evaluating the information to deliver critical business insights.

Objective

Our objective is to reduce the cost of running the client facility as well as to increase the livability and productivity of workers by creating a scalable data warehouse tech-stack that will help us provide the AI service to the client.

Overview of Dataset

The data contains all of the 30-second raw sensor data from January to October 2016 for the location of the I80 corridor near Davis, CA.

- Summary statistics for each station can be found here.
- Median of total flow for each weekday. Could be used to make animation can
  be found here.
- 30-second time series for a single station (Richards Ave) near downtown Davis
  can be found here.● The medians for each observation grouped by (station, weekday, hour, half a
  minute) can be found here.
- Metadata is small- just 53 stations can be found here.
- The main data is around 35 million observations can be found here.

Techniques Used

The main goal of this project is to create a data warehouse tech stack. The Tech Technologies used are Airflow(DAG), dbt, DWH(Mysql), redash, and power BI(data visualization tool). The process I followed is first I import the data files into my database, To do that I first established a DAG in Airflow that employs the python operator. The short-circuit python operator, which allows a workflow to continue only if a condition is met, was utilized. Otherwise, the workflow will “short-circuit,” skipping downstream operations. Then I linked dbt to the data warehouse and created data transformation codes. Then I created documentation for the data models so that they can be presented using the dbt docs UI. I then connected the reporting environment and used the data to generate a dashboard.

Link to the deployed dbt documenation can be found here.

image

Challenges

One of the difficulties I've encountered while working on this project is understanding the dataset. The datasets don’t have a lot of description to do data manipulation and transformation and the lack of information made it hard to do that much analysis. In addition, limited processing capability was an issue. When I tried to run airflow It created multiple containers which slowed down my pc to eventually stacking it numerous times.

Improvement in time

If given more time, checking other dbt modules that can help with data quality monitoring and better data processing would be possible and doable. The dataset was complex, and with more time, there could have been a lot more exploration and a good transformation, and dbt could have been put to good use.

References

traffic-data-pipeline's People

Contributors

maxi1571 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.