Giter Site home page Giter Site logo

marcusle02 / google-flights-etl Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 6.2 MB

End-to-end data engineering pipeline for tracking daily changes in flight prices

Makefile 4.24% Dockerfile 18.57% Shell 41.54% CSS 3.28% Python 32.37%
data-engineering hadoop spark web-scraping

google-flights-etl's Introduction

Google Flights ETL Project

Overview

This repository contains the source code and documentation for the Google Flight ETL project. The project involves daily crawling of flight data using Selenium, storing the data in MySQL, processing it with Apache Spark, storing in HDFS as a data lake, and warehousing the data with Hive. The entire environment is containerized and deployed using Docker.

System Architecture

System Architecture

Project Structure

Components

  • Selenium: Web scraping tool used for extracting flight data from Google Flights.
  • MySQL: Relational database used for storing raw flight data.
  • Apache Spark: Distributed data processing engine for data transformation.
  • HDFS: Distributed file system used as a data lake for storing processed data.
  • Hive: Data warehousing tool for querying and analyzing structured data.

Prerequisites

  • Docker installed on your machine.
  • Python and necessary libraries for Selenium web scraping.

Getting Started

  1. Clone the repository:

    git clone https://github.com/your-username/google_flight_etl.git
    cd google_flight_etl
  2. Build and run the Docker containers:

    cd docker-configuration
    docker-compose up -d
  3. Execute the data crawling process:

    python flight_selenium.py
  4. Execute the Hadoop ingestion process, remember to change execution date parameter to your current date:

    docker exec -it namenode bash
    spark-submit --master spark://spark-master:7077 --py-files pyspark-jobs/ingestion.py --tblName "flights" --executionDate "YYYY-MM-DD"
  5. Execute the Hive transformation process:

    docker exec -it namenode bash
    spark-submit --master spark://spark-master:7077 --py-files pyspark-jobs/transformation.py --executionDate "YYYY-MM-DD"
  6. Deploy Superset, connect Hive to Superset dashboard, and design with your own style:

    export SUPERSET_VERSION=<latest_version>
    
    docker pull apache/superset:$SUPERSET_VERSION
    
    docker run -d -p 8088:8088 \
             -e "SUPERSET_SECRET_KEY=$(openssl rand -base64 42)" \
             -e "TALISMAN_ENABLED=False" \
             --name superset apache/superset:$SUPERSET_VERSION

Superset Dashboard Example

System Architecture

Acknowledgments

This project is inspired by the Data Lake & Warehousing demo by Mr. Canh Tran (Data Guy Story). The architecture design, ingestion, and transformation scripts in Spark (Scala) were outlined in a video available here. I adapted the scripts to use PySpark for implementation.

Contact

For questions or support, please contact [[email protected]].

google-flights-etl's People

Contributors

marcusle02 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.