Giter Site home page Giter Site logo

big-data's Introduction

Big Data Distributed System

In the era of data, create a robust big data system using JupyterLab, Airflow, Spark, Trino, Superset, MinIO, Kafka, Debezium, and Delta Lake. Explore storage, analytics, and computing layers for cutting-edge capabilities.

Objective

The goal of this project is to create a distributed data system capable of processing and analyzing large datasets from multiple sources and providing comprehensive reporting and data visualization for end-users.

Table of Contents

  1. Objective
  2. Data Pipeline Architecture
  3. Developing Components for the Big Data System
  4. Storage Layer
  5. Computing Layer
  6. Visualization Layer

Data Pipeline Architecture

The data platform comprises three main layers: Storage, Computing, and Visualization. A comprehensive diagram of the Data Pipeline is presented below.

Data Architecture

Storage Layer

The Storage Layer integrates Kafka and MinIO (Object Storage) for storing raw data originating from user events, backend logs, third-party raw data, and more. This layer serves as the primary repository for a variety of data types, including raw data, warehouse data, and data mart, refer to MinIO Operator Documentation.

Computing Layer

The Computing Layer encompasses four essential components: HP Query Engine, Analysis Engine, ETL System, and Spark Cluster (Executor Engine).

  • Spark Operator (Executor Engine): Assisting in executing data processing and distributed computing tasks, including batch and streaming for real-time support, enabling live streaming and real-time data analytics. For detailed documentation, refer to Spark Operator (Executor Engine) Documentation.

  • High-Performance Query Engine: Utilizing tools such as Trino, Presto, and similar software designed for efficient access and processing of data from databases or storage systems. This component optimizes speed and resource utilization for analytics tools in batch data processing, refer to High-Performance Query Engine.

  • Analytic Engine: (In developing)

  • CDC: (In developing)

  • ETL: (In developing)

Visualization Layer

The Visualization layer consists of tools that aid in visualizing data from the storage layer in the form of tables, charts, and more to make data easily understandable for users. These tools include Superset, PowerBI, and others.

Team Members

big-data's People

Contributors

dnguyenngoc avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.