dend-capstone-project

Project Summary

This project aims to be able to answers questions on US immigration such as what are the most popular cities for immigration, what is the gender distribution of the immigrants, what is the visa type distribution of the immigrants, what is the average age per immigrant and what is the average temperature per month per city. We extract data from 3 different sources, the I94 immigration dataset of 2016, city temperature data from Kaggle and US city demographic data from OpenSoft. We design 4 dimension tables: Cities, immigrants, monthly average city temperature and time, and 1 fact table: Immigration. We use Spark for ETL jobs and store the results in parquet for downstream analysis.

Data

I94 Immigration Data: comes from the U.S. National Tourism and Trade Office and contains various statistics on international visitor arrival in USA and comes from the US National Tourism and Trade Office. The dataset contains data from 2016. link
World Temperature Data: comes from Kaggle and contains average weather temperatures by city. link
U.S. City Demographic Data: comes from OpenSoft and contains information about the demographics of all US cities such as average age, male and female population. link

Data model

The star schema is chosen as the data model because it is simple and yet effective. users can write simple queries by joing fact and dimension tables to analyze the data.

Here are the tables of the schema:

Staging Tables

staging_i94_df
    id
    date
    city_code
    state_code
    age
    gender
    visa_type
    count
    
staging_temp_df
    year
    month
    city_code
    city_name
    avg_temperature
    lat
    long
    
staging_demo_df
    city_code
    state_code
    city_name
    median_age
    pct_male_pop
    pct_female_pop
    pct_veterans
    pct_foreign_born
    pct_native_american
    pct_asian
    pct_black
    pct_hispanic_or_latino
    pct_white
    total_pop

Dimension Tables

immigrant_df
    id
    gender
    age
    visa_type
    
city_df
    city_code
    state_code
    city_name
    median_age
    pct_male_pop
    pct_female_pop
    pct_veterans
    pct_foreign_born
    pct_native_american
    pct_asian
    pct_black
    pct_hispanic_or_latino
    pct_white
    total_pop
    lat
    long
    
monthly_city_temp_df
    city_code
    year
    month
    avg_temperature
    
time_df
    date
    dayofweek
    weekofyear
    month

Fact Table

immigration_df
    id
    state_code
    city_code
    date
    count

Steps necessary to pipeline the data into the chosen data model

Clean the data on nulls, data types, duplicates, etc
Load staging tables for staging_i94_df, staging_temp_df and staging_demo_df
Create dimension tables for immigrant_df, city_df, monthly_city_temp_df and time_df
Create fact table immigration_df with information on immigration count, mapping id in immigrant_df, city_code in city_df and monthly_city_temp_df and date in time_df ensuring referential integrity
Save processed dimension and fact tables in parquet for downstream query

hamadalaqeel / dend-capstone-project Goto Github PK