Giter Site home page Giter Site logo

mini-project-10's Introduction

PySpark Data Processing

IDS-706-Data-Engineering ๐Ÿ’ป

Mini Project 10 ๐Ÿ“„

โ˜‘๏ธ Requirements

  • Use PySpark to perform data processing on a large dataset.
  • Include at least one Spark SQL query and one data transformation.

โ˜‘๏ธ To-do List

  • Data processing functionality: Learn data processing functionality using PySpark.
  • Use of Spark SQL and transformations: Use Spark SQL and transform the dataset by adding columns or rows as required.

โ˜‘๏ธ Dataset

penguins.csv

  • Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. It shows three different species of penguins observed in the Palmer Archipelago, Antarctica.
  • penguins.csv
  • Description of variables

    • In this dataset, we can observe several important variables, among which the unfamiliar 'bill_length_mm,' 'bill_depth_mm,' and 'flipper_length_mm' can be understood through the following figures.

โ˜‘๏ธ Main Progress

Section 1 Use Spark SQL and transform

Data can be processed and functionality executed through Spark SQL or transformations.
  • lib.py
  1. Spark SQL: This query displays the maximum and minimum values of the bill length by species.
def spark_sql_query(spark: SparkSession, data: DataFrame):
    # Creating a temporary view for querying
    data.createOrReplaceTempView("penguins")
    
    # Executing query using Spark SQL
    result = spark.sql("""
        SELECT species, 
               MAX(bill_length_mm) as max_bill_length, 
               MIN(bill_length_mm) as min_bill_length
        FROM penguins
        GROUP BY species
    """)
    result.show()
    return result
  1. Transform: The transform function categorizes the bill length as follows: lengths below 40mm are classified as 'Short', those between 40mm and 50mm are deemed 'Medium', and lengths exceeding 50mm are categorized as 'Long'.
def transform(spark: SparkSession, data: DataFrame) -> DataFrame:
    # Adding 'bill_length_category' column based on 'bill_length_mm' column values
    conditions = [
        (F.col("bill_length_mm") < 40, "Short"),
        ((F.col("bill_length_mm") >= 40) & (F.col("bill_length_mm") < 50), "Medium"),
        (F.col("bill_length_mm") >= 50, "Long")
    ]

    return data.withColumn("bill_length_category", F.when(conditions[0][0], conditions[0][1])
                                                    .when(conditions[1][0], conditions[1][1])
                                                    .otherwise(conditions[2][1]))

Section 2 See the pipeline of CI/CD

Observe the CI/CD pipeline in action, which includes steps for installation, formatting, linting, and testing.
  1. make format


  1. make lint


  1. make test


Section 3 PySpark Script

Capture the output of the PySpark command and save it to a report.
  1. Read the CSV file
Original Data:
+-------+---------+--------------+-------------+-----------------+-----------+------+
|species|   island|bill_length_mm|bill_depth_mm|flipper_length_mm|body_mass_g|   sex|
+-------+---------+--------------+-------------+-----------------+-----------+------+
| Adelie|Torgersen|          39.1|         18.7|              181|       3750|  MALE|
| Adelie|Torgersen|          39.5|         17.4|              186|       3800|FEMALE|
| Adelie|Torgersen|          40.3|         18.0|              195|       3250|FEMALE|
| Adelie|Torgersen|          NULL|         NULL|             NULL|       NULL|  NULL|
| Adelie|Torgersen|          36.7|         19.3|              193|       3450|FEMALE|
| Adelie|Torgersen|          39.3|         20.6|              190|       3650|  MALE|
| Adelie|Torgersen|          38.9|         17.8|              181|       3625|FEMALE|
| Adelie|Torgersen|          39.2|         19.6|              195|       4675|  MALE|
| Adelie|Torgersen|          34.1|         18.1|              193|       3475|  NULL|
| Adelie|Torgersen|          42.0|         20.2|              190|       4250|  NULL|
| Adelie|Torgersen|          37.8|         17.1|              186|       3300|  NULL|
| Adelie|Torgersen|          37.8|         17.3|              180|       3700|  NULL|
| Adelie|Torgersen|          41.1|         17.6|              182|       3200|FEMALE|
| Adelie|Torgersen|          38.6|         21.2|              191|       3800|  MALE|
| Adelie|Torgersen|          34.6|         21.1|              198|       4400|  MALE|
| Adelie|Torgersen|          36.6|         17.8|              185|       3700|FEMALE|
| Adelie|Torgersen|          38.7|         19.0|              195|       3450|FEMALE|
| Adelie|Torgersen|          42.5|         20.7|              197|       4500|  MALE|
| Adelie|Torgersen|          34.4|         18.4|              184|       3325|FEMALE|
| Adelie|Torgersen|          46.0|         21.5|              194|       4200|  MALE|
+-------+---------+--------------+-------------+-----------------+-----------+------+
only showing top 20 rows
  1. Spark SQL Query
Data After Spark SQL Query:
+---------+---------------+---------------+
|  species|max_bill_length|min_bill_length|
+---------+---------------+---------------+
|   Gentoo|           59.6|           40.9|
|   Adelie|           46.0|           32.1|
|Chinstrap|           58.0|           40.9|
+---------+---------------+---------------+
  1. Transform
Data After Adding Bill Length Category:
+-------+---------+--------------+-------------+-----------------+-----------+------+--------------------+
|species|   island|bill_length_mm|bill_depth_mm|flipper_length_mm|body_mass_g|   sex|bill_length_category|
+-------+---------+--------------+-------------+-----------------+-----------+------+--------------------+
| Adelie|Torgersen|          39.1|         18.7|              181|       3750|  MALE|               Short|
| Adelie|Torgersen|          39.5|         17.4|              186|       3800|FEMALE|               Short|
| Adelie|Torgersen|          40.3|         18.0|              195|       3250|FEMALE|              Medium|
| Adelie|Torgersen|          NULL|         NULL|             NULL|       NULL|  NULL|                Long|
| Adelie|Torgersen|          36.7|         19.3|              193|       3450|FEMALE|               Short|
| Adelie|Torgersen|          39.3|         20.6|              190|       3650|  MALE|               Short|
| Adelie|Torgersen|          38.9|         17.8|              181|       3625|FEMALE|               Short|
| Adelie|Torgersen|          39.2|         19.6|              195|       4675|  MALE|               Short|
| Adelie|Torgersen|          34.1|         18.1|              193|       3475|  NULL|               Short|
| Adelie|Torgersen|          42.0|         20.2|              190|       4250|  NULL|              Medium|
| Adelie|Torgersen|          37.8|         17.1|              186|       3300|  NULL|               Short|
| Adelie|Torgersen|          37.8|         17.3|              180|       3700|  NULL|               Short|
| Adelie|Torgersen|          41.1|         17.6|              182|       3200|FEMALE|              Medium|
| Adelie|Torgersen|          38.6|         21.2|              191|       3800|  MALE|               Short|
| Adelie|Torgersen|          34.6|         21.1|              198|       4400|  MALE|               Short|
| Adelie|Torgersen|          36.6|         17.8|              185|       3700|FEMALE|               Short|
| Adelie|Torgersen|          38.7|         19.0|              195|       3450|FEMALE|               Short|
| Adelie|Torgersen|          42.5|         20.7|              197|       4500|  MALE|              Medium|
| Adelie|Torgersen|          34.4|         18.4|              184|       3325|FEMALE|               Short|
| Adelie|Torgersen|          46.0|         21.5|              194|       4200|  MALE|              Medium|
+-------+---------+--------------+-------------+-----------------+-----------+------+--------------------+
only showing top 20 rows

mini-project-10's People

Contributors

suim-park avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.