Scrubbing and Cleaning Data - Lab

Introduction

In the previous labs, you joined the data from our separate files into a single DataFrame. In this lab, you'll scrub the data to get it ready for exploration and modeling!

Objectives

You will be able to:

Perform the full data cleaning process for a dataset
Identify and deal with null values appropriately
Remove unnecessary columns

Getting Started

You'll find the resulting dataset from your work in the Obtaining Data Lab stored within the file 'Lego_data_merged.csv'.

In the cells below:

Import pandas and set the standard alias.
Import numpy and set the standard alias.
Import matplotlib.pyplot and set the standard alias.
Import seaborn and set the alias sns (this is the standard alias for seaborn).
Use the ipython magic command to set all matplotlib visualizations to display inline in the notebook.
Load the dataset stored in the 'Lego_data_merged.csv' file into a DataFrame, df.
Inspect the head of the DataFrame to ensure everything loaded correctly.

# Import statements go here

# Now, load in the dataset and inspect the head to make sure everything loaded correctly

Starting our Data Cleaning

To start, you'll deal with the most obvious issue: data features with the wrong data encoding.

Checking Data Types

In the cell below, use the appropriate method to check the data type of each column.

# Your code here

Now, investigate some of the unique values inside of the list_price column.

# Your code here

Numerical Data Stored as Strings

A common issue to check for at this stage is numeric columns that have accidentally been encoded as strings. For example, you should notice that the list_price column above is currently formatted as a string and contains a proceeding '$'. Remove this and convert the remaining number to a float so that you can later model this value. After all, your primary task is to generate model to predict the price.

Note: While the data spans a multitude of countries, assume for now that all prices have been standardized to USD.

# Your code here

Detecting and Dealing With Null Values

Next, it's time to check for null values. How to deal with the null values will be determined by the columns containing them, and how many null values exist in each.

In the cell below, get a count of how many null values exist in each column in the DataFrame.

# Your code here

Now, get some descriptive statistics for each of the columns. You want to see where the minimum and maximum values lie.

# Your code here

Now that you have a bit more of a understanding of each of these features you can make an informed decision about the best strategy for dealing with the various null values.

Some common strategies for filling null values include:

Using the mean of the feature
Using the median of the feature
Inserting a random value from a normal distribution with the mean and std of the feature
Binning

Given that most of the features with null values concern user reviews of the lego set, it is reasonable to wonder whether there is strong correlation between these features in the first place. Before proceeding, take a minute to investigate this hypothesis.

# Investigate whether multicollinearity exists between the review features 
# (num_reviews, play_star_rating, star_rating, val_star_rating)

Note that there is substantial correlation between the play_star_rating, star_rating and val_star_rating. While this could lead to multicollinearity in your eventual regression model, it is too early to clearly determine this at this point. Remember that multicollinearity is a relationship between 3 or more variables while correlation simply investigates the relationship between two variables.

Additionally, these relationships provide an alternative method for imputing missing values: since they appear to be correlated, you could use these features to help impute missing values in the others features. For example, if you are missing the star_rating for a particular row but have the val_star_rating for that same entry, it seems reasonable to assume that it is a good estimate for the missing star_rating value as they are highly correlated. That said, doing so does come with risks; indeed you would be further increasing the correlation between these features which could further provoke multicollinearity in the final model.

Investigate if you could use one of the other star rating features when one is missing. How many rows have one of play_star_rating, star_rating and val_star_rating missing, but not all three.

# Your code here
# Number missing all three: 1421

Well, it seems like when one is missing, the other two are also apt to be missing. While this has been a bit of an extended investigation, simply go ahead and fill the missing values with that feature's median. Fill in the missing values of review_difficulty feature with string 'unknown'.

# Your code here

Normalizing the Data

Now, you'll need to convert all of our numeric columns to the same scale by normalizing our dataset. Recall that you normalize a dataset by converting each numeric value to it's corresponding z-score for the column, which is obtained by subtracting the column's mean and then dividing by the column's standard deviation for every value.

In the cell below:

Normalize the numeric X features by subtracting the column mean and dividing by the column standard deviation. (Don't bother to normalize the list_price as this is the feature you will be predicting.)

# Your code here

Saving Your Results

While you'll once again practice one-hot encoding as you would to preprocess data before fitting a model, saving such a reperesentation of the data will eat up additional disk space. After all, a categorical variable with 10 bins will be transformed to 10 seperate features when passed through pd.get_dummies(). As such, while further practice is worthwhile, save your DataFrame as-is for now.

# Your code here

One-Hot Encoding Categorical Columns

As a final step, you'll need to deal with the categorical columns by one-hot encoding them into binary variables via the pd.get_dummies() function.

When doing this, you may also need to subset the appropriate features to avoid encoding the wrong data. The get_dummies() function by default converts all columns with object or category dtype. However, you should always check the result of calling get_dummies() to ensure that only the categorical variables have been transformed. Consult the documentation for more details. If you are ever unsure of the data types, call the .info() method.

In the cell below, subset to the appropriate predictive features and then use pd.get_dummies() to one-hot encode the dataset properly.

# Your code here

That's it! You've now successfully scrubbed your dataset -- you're now ready for data exploration and modeling!

Summary

In this lesson, you gained practice with scrubbing and cleaning data. Specifically, you addressed an incorrect data type, detected and dealt with null values, checked for multicollinearity, and transformed data. Congrats on performing the full data cleaning process for a dataset!

learn-co-students / dsc-scrubbing-and-cleaning-data-lab-chicago-ds-102819 Goto Github PK

dsc-scrubbing-and-cleaning-data-lab-chicago-ds-102819's Introduction

Scrubbing and Cleaning Data - Lab

Introduction

Objectives

Getting Started

Starting our Data Cleaning

Checking Data Types

Numerical Data Stored as Strings

Detecting and Dealing With Null Values

Normalizing the Data

Saving Your Results

One-Hot Encoding Categorical Columns

Summary

dsc-scrubbing-and-cleaning-data-lab-chicago-ds-102819's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent