Giter Site home page Giter Site logo

zerotwodatarw / de-stream-project-random-generated-user-data Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 402.76 MB

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

Python 96.30% Shell 0.14% PowerShell 0.02% Batchfile 0.09% C 0.35% Cython 0.19% Mako 0.03% HTML 0.50% Jinja 0.01% JavaScript 0.30% CSS 0.25% TypeScript 1.29% TeX 0.01% C++ 0.50% Assembly 0.01% Lua 0.01%
airflow apachespark cassandra-database docker postgesql kafka python

de-stream-project-random-generated-user-data's Introduction

Realtime Data Streaming | End-to-End Data Engineering Project

Table of Contents

Introduction

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. Everything is containerized using Docker for ease of deployment and scalability.

System Architecture

System Architecture

The project is designed with the following components:

  • Data Source: We use randomuser.me API to generate random user data for our pipeline.
  • Apache Airflow: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.
  • Apache Kafka and Zookeeper: Used for streaming data from PostgreSQL to the processing engine.
  • Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
  • Apache Spark: For data processing with its master and worker nodes.
  • Cassandra: Where the processed data will be stored.

What You'll Learn

  • Setting up a data pipeline with Apache Airflow
  • Real-time data streaming with Apache Kafka
  • Distributed synchronization with Apache Zookeeper
  • Data processing techniques with Apache Spark
  • Data storage solutions with Cassandra and PostgreSQL
  • Containerizing your entire data engineering setup with Docker

Technologies

  • Apache Airflow
  • Python
  • Apache Kafka
  • Apache Zookeeper
  • Apache Spark
  • Cassandra
  • PostgreSQL
  • Docker

Getting Started

  1. Clone the repository:
 https://github.com/ZeroTwoDataRW/DE-Stream-Project-Random-Generated-User-Data.git
  1. Navigate to the project directory:
cd DE-Stream-Project-Random-Generated-User-Data
  1. Run Docker Compose to spin up services:
docker-compose up -d

Screenshots of Project Steps to Design

installing_venv_python3

checking_formated_data_results

kafka-and-zookeeper-connected

installed_kafka_python_library

docker-images-installed

sending_data_to_kafka_broker

kafka_topic_created

running_images

checking_airflow_working

user_automation_dag

all_docker_images_running

dag_is_running

adding_git_to_project

keyspace_and_table_created_successfully

describe_spark_streams

de-stream-project-random-generated-user-data's People

Contributors

zerotwodatarw avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.