Use PySpark to perform data processing on a large dataset.
Dockerfile 18.55%Makefile 5.00%Python 76.45%
mini-project-10's Introduction
IDS-706-Data-Engineering ๐ป
Mini Project 10 ๐
โ๏ธ Requirements
Use PySpark to perform data processing on a large dataset.
Include at least one Spark SQL query and one data transformation.
โ๏ธ To-do List
Data processing functionality: Learn data processing functionality using PySpark.
Use of Spark SQL and transformations: Use Spark SQL and transform the dataset by adding columns or rows as required.
โ๏ธ Dataset
penguins.csv
Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. It shows three different species of penguins observed in the Palmer Archipelago, Antarctica.
In this dataset, we can observe several important variables, among which the unfamiliar 'bill_length_mm,' 'bill_depth_mm,' and 'flipper_length_mm' can be understood through the following figures.
โ๏ธ Main Progress
Section 1 Use Spark SQL and transform
Data can be processed and functionality executed through Spark SQL or transformations.
lib.py
Spark SQL: This query displays the maximum and minimum values of the bill length by species.
defspark_sql_query(spark: SparkSession, data: DataFrame):
# Creating a temporary view for queryingdata.createOrReplaceTempView("penguins")
# Executing query using Spark SQLresult=spark.sql(""" SELECT species, MAX(bill_length_mm) as max_bill_length, MIN(bill_length_mm) as min_bill_length FROM penguins GROUP BY species """)
result.show()
returnresult
Transform: The transform function categorizes the bill length as follows: lengths below 40mm are classified as 'Short', those between 40mm and 50mm are deemed 'Medium', and lengths exceeding 50mm are categorized as 'Long'.