NYC Taxi Azure Data Engineering Project

Introduction

In this Udemy course project created by Ramesh Retnasamy, I implemented a data engineering solution using all services available on Azure Synapse Analytics to analyze and report data on Taxi Green trips in New York City from 2020 to 2021.

What is Azure Synapse Analytics

Azure Synapse Analytics is a powerful cloud-based analytics service provided by Microsoft Azure. It integrates and combines big data and data warehousing capabilities into a unified platform. Synapse Analytics allows you to analyze large volumes of structured and unstructured data from various sources, such as databases, data lakes, and streaming data.

With Synapse Analytics, you can perform data ingestion, data preparation, data warehousing, big data processing, and data exploration all in one place. It provides tools and services for data integration, data wrangling, and data orchestration, making it easier to manage complex data workflows.

About the Dataset

This dataset contains information about NYC Green trips from 2020 and 2021. - NYC Green Trip Data

The data was transformed to different file types, and I worked with them in the Synapse Workspace to ingest (Extract), Tranform, and Load data using Serverless Pool:

Dimension Table

Taxi Zone
- CSV file with header
- CSV file without header
Calendar
- CSV file
Vendor
- CSV file
- CSV file with escaped characters( \ )
- CSV file with unquote (" ")
Rate Code
- TSV, Tab-Separated
Trip Type
- JSON file
Payment Type
- JSON file
- JSON file with array

Fact Table

Trip Data Green
- CSV file with Partitioned by Year and Month
- Parquet file with Partitioned by Year and Month
- Delta file with Partitioned by Year and Month

Services Used

Serverless SQL Pool: The Serverless SQL Pool in Azure Synapse Analytics is a powerful feature that enables you to query large volumes of data without the need for provisioning or managing dedicated resources. It offers an on-demand, serverless approach to executing SQL queries against data stored in various formats within your Azure Data Lake Storage Gen2.

Solution Architecture

Apache Spark Pool: The Apache Spark Pool in Azure Synapse Analytics is a powerful computing resource designed for large-scale data processing and analytics. It provides a highly scalable and distributed processing engine, allowing users to execute Spark jobs on massive datasets. The Spark Pool leverages the Apache Spark framework, enabling users to perform complex data transformations, machine learning tasks, and advanced analytics. With seamless integration into Azure Synapse Analytics, it offers a unified environment for data ingestion, exploration, and visualization, empowering organizations to gain valuable insights from their data quickly and efficiently.

Solution Architecture

Dedicated SQL Pool: Azure Synapse Analytics's Dedicated SQL Pool is a powerful and scalable service designed for running large-scale data warehousing and analytics workloads. It provides a dedicated set of compute and storage resources, allowing organizations to efficiently process and analyze vast amounts of data. With its massively parallel processing architecture, it can handle complex queries and deliver high-performance results. The Dedicated SQL Pool offers advanced data integration capabilities, including data loading, transformation, and querying using familiar SQL syntax. It empowers businesses to gain valuable insights from their data while maintaining robust security and data governance measures.

Solution Architecture

Synapse Link: Synapse Link is a feature within Azure Synapse Analytics that enables seamless and real-time integration between the cloud-based analytical capabilities of Azure Synapse and operational data stored in Azure Cosmos DB. It eliminates the need for data movement or duplication, allowing organizations to directly query and analyze live data from Azure Cosmos DB using familiar SQL-based tools and techniques. Synapse Link provides a unified and efficient way to bridge the gap between analytical and operational systems, enabling organizations to gain valuable insights from their real-time data without compromising on performance or scalability.

Solution Architecture

Power BI Integration: Power BI for Azure Synapse Analytics is a powerful analytics tool that combines the capabilities of Power BI and Azure Synapse Analytics. It enables organizations to extract valuable insights from their data by seamlessly integrating data ingestion, data preparation, data warehousing, and visualization. With Power BI for Azure Synapse Analytics, users can create visually appealing and interactive dashboards, reports, and data visualizations, allowing them to analyze and explore data from various sources within a unified and efficient environment. This integration empowers businesses to make data-driven decisions and gain a deeper understanding of their data at scale.

Power BI Report

Project Execution Flow

Main Steps

Discovery (Exploratory Data Analsys with T-SQL)
Create External Table (Bronze Schema) to ingest (Extract) data and create External Table/View.
Create External Table (Silver Schema) to transform data in the appropriate format.
Create External Table (Gold Schema) to join table for using for Power BI and for keeping the data in a Data Warehouse (Dedicated SQL Pool).
Create ETL Pipeline and schedule trigger runs (from Bronze to Gold).

Additional Steps

Transform data with Apache Spark Pool (Notebooks).
Query data from Azure Cosmos DB (for real-time data and saving as JSON files)
Provision Dedicated SQL Pool to keep final data (Gold).
Create Power BI report, connecting it to Azure Synapse Analytics, and publish to Power BI Service.

claudedjale / nyc-taxi-azure-data-engineering-project Goto Github PK