Giter Site home page Giter Site logo

movies-etl's Introduction

Movies-ETL

Overview of the Analysis:

Amazing Prime loves the dataset and wants to keep it updated on a daily basis. Britta needs your help to create an automated pipeline that takes in new data, performs the appropriate transformations, and loads the data into existing tables. You’ll need to refactor the code from this module to create one function that takes in the three files—Wikipedia data, Kaggle metadata, and the MovieLens rating data—and performs the ETL process by adding the data to a PostgreSQL database.

Purpose:

This new assignment consists of four technical analysis deliverables. You will submit the following:

-Deliverable 1: Write an ETL Function to Read Three Data Files

-Deliverable 2: Extract and Transform the Wikipedia Data

-Deliverable 3: Extract and Transform the Kaggle data

-Deliverable 4: Create the Movie Database

Resources

Data Source: CSV, JSON.

Software: Jupyter, SQL, pgAdmin 4.

Results:

Deliverable 1:

-An ETL function is written to read in the three data files.

-The function converts the Wikipedia JSON file to a Pandas DataFrame, and the DataFrame is displayed in the ETL_function_test.ipynb file.

-The function converts the Kaggle metadata file to a Pandas DataFrame, and the DataFrame is displayed in the ETL_function_test.ipynb file.

-The function converts the MovieLens ratings data file to a Pandas DataFrame, and the DataFrame is displayed in the ETL_function_test.ipynb file.

Deliverable 2:

-The TV shows are filtered out, and the wiki_movies_df DataFrame is created.

-A try-except block is used to catch errors while extracting the IMDb IDs with a regular expression and dropping duplicate IDs.

-The extraction and transformation of the Wikipedia data in the ETL function does the following:

  • A list comprehension is used to keep columns with non-null values.

  • The non-null box office data is converted to string values using the lambda and join functions.

  • A regular expression is used to match the six elements of "form_one" of the box office data.

  • A regular expression is used to match the three elements of "form_two" of the box office data.

-The following columns are cleaned in the Wikipedia DataFrame:

  • The box office column
  • The budget column
  • The release date column
  • The running time column

-The cleaned Wikipedia data is converted to a Pandas DataFrame, and the DataFrame is displayed in the ETL_clean_wiki_movies.ipynb file.

Deliverable 3:

-The extraction and transformation of the Kaggle metadata using the ETL function does the following:

  • The Kaggle metadata is cleaned.

  • The Wikipedia and Kaggle DataFrames are merged.

  • The following is performed on the merged Wikipedia and Kaggle DataFrames to create the movies_df:

-Unnecessary columns are dropped.

-A function is used to fill in the missing Kaggle data.

-The movies_df DataFrame is filtered to keep specific columns.

-The movies_df DataFrame columns are renamed.

  • The extraction and transformation of the MovieLens ratings data using the ETL function does the following:

-The ratings counts are cleaned.

-The movies_df DataFrame is merged with the cleaned ratings DataFrame to create the movies_with_ratings_df DataFrame.

-The empty values in the movies_with_ratings_df DataFrame are filled with “0”.

-The movies_with_ratings_df and the movies_df DataFrames are displayed in the ETL_clean_kaggle_data.ipynb file.

Deliverable 4:

-The data from the movies_df DataFrame replaces the current data in the movies table in the SQL database, as determined by the movies_query.png.

-The data from the MovieLens rating CSV file is added to the ratings table in the SQL database, as determined by the ratings_query.png.

-The elapsed time to add the data to the database is displayed in the ETL_create_database.ipynb file.

Summary:

movies_query

ratings_query

movies-etl's People

Contributors

cgfs93 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.