Giter Site home page Giter Site logo

dsc-scrubbing-and-cleaning-data-lab-chicago-ds-102819's Introduction

Scrubbing and Cleaning Data - Lab

Introduction

In the previous labs, you joined the data from our separate files into a single DataFrame. In this lab, you'll scrub the data to get it ready for exploration and modeling!

Objectives

You will be able to:

  • Perform the full data cleaning process for a dataset
  • Identify and deal with null values appropriately
  • Remove unnecessary columns

Getting Started

You'll find the resulting dataset from your work in the Obtaining Data Lab stored within the file 'Lego_data_merged.csv'.

In the cells below:

  • Import pandas and set the standard alias.
  • Import numpy and set the standard alias.
  • Import matplotlib.pyplot and set the standard alias.
  • Import seaborn and set the alias sns (this is the standard alias for seaborn).
  • Use the ipython magic command to set all matplotlib visualizations to display inline in the notebook.
  • Load the dataset stored in the 'Lego_data_merged.csv' file into a DataFrame, df.
  • Inspect the head of the DataFrame to ensure everything loaded correctly.
# Import statements go here
# Now, load in the dataset and inspect the head to make sure everything loaded correctly

Starting our Data Cleaning

To start, you'll deal with the most obvious issue: data features with the wrong data encoding.

Checking Data Types

In the cell below, use the appropriate method to check the data type of each column.

# Your code here

Now, investigate some of the unique values inside of the list_price column.

# Your code here

Numerical Data Stored as Strings

A common issue to check for at this stage is numeric columns that have accidentally been encoded as strings. For example, you should notice that the list_price column above is currently formatted as a string and contains a proceeding '$'. Remove this and convert the remaining number to a float so that you can later model this value. After all, your primary task is to generate model to predict the price.

Note: While the data spans a multitude of countries, assume for now that all prices have been standardized to USD.

# Your code here

Detecting and Dealing With Null Values

Next, it's time to check for null values. How to deal with the null values will be determined by the columns containing them, and how many null values exist in each.

In the cell below, get a count of how many null values exist in each column in the DataFrame.

# Your code here

Now, get some descriptive statistics for each of the columns. You want to see where the minimum and maximum values lie.

# Your code here

Now that you have a bit more of a understanding of each of these features you can make an informed decision about the best strategy for dealing with the various null values.

Some common strategies for filling null values include:

  • Using the mean of the feature
  • Using the median of the feature
  • Inserting a random value from a normal distribution with the mean and std of the feature
  • Binning

Given that most of the features with null values concern user reviews of the lego set, it is reasonable to wonder whether there is strong correlation between these features in the first place. Before proceeding, take a minute to investigate this hypothesis.

# Investigate whether multicollinearity exists between the review features 
# (num_reviews, play_star_rating, star_rating, val_star_rating)

Note that there is substantial correlation between the play_star_rating, star_rating and val_star_rating. While this could lead to multicollinearity in your eventual regression model, it is too early to clearly determine this at this point. Remember that multicollinearity is a relationship between 3 or more variables while correlation simply investigates the relationship between two variables.

Additionally, these relationships provide an alternative method for imputing missing values: since they appear to be correlated, you could use these features to help impute missing values in the others features. For example, if you are missing the star_rating for a particular row but have the val_star_rating for that same entry, it seems reasonable to assume that it is a good estimate for the missing star_rating value as they are highly correlated. That said, doing so does come with risks; indeed you would be further increasing the correlation between these features which could further provoke multicollinearity in the final model.

Investigate if you could use one of the other star rating features when one is missing. How many rows have one of play_star_rating, star_rating and val_star_rating missing, but not all three.

# Your code here
# Number missing all three: 1421

Well, it seems like when one is missing, the other two are also apt to be missing. While this has been a bit of an extended investigation, simply go ahead and fill the missing values with that feature's median. Fill in the missing values of review_difficulty feature with string 'unknown'.

# Your code here

Normalizing the Data

Now, you'll need to convert all of our numeric columns to the same scale by normalizing our dataset. Recall that you normalize a dataset by converting each numeric value to it's corresponding z-score for the column, which is obtained by subtracting the column's mean and then dividing by the column's standard deviation for every value.

In the cell below:

  • Normalize the numeric X features by subtracting the column mean and dividing by the column standard deviation. (Don't bother to normalize the list_price as this is the feature you will be predicting.)
# Your code here

Saving Your Results

While you'll once again practice one-hot encoding as you would to preprocess data before fitting a model, saving such a reperesentation of the data will eat up additional disk space. After all, a categorical variable with 10 bins will be transformed to 10 seperate features when passed through pd.get_dummies(). As such, while further practice is worthwhile, save your DataFrame as-is for now.

# Your code here

One-Hot Encoding Categorical Columns

As a final step, you'll need to deal with the categorical columns by one-hot encoding them into binary variables via the pd.get_dummies() function.

When doing this, you may also need to subset the appropriate features to avoid encoding the wrong data. The get_dummies() function by default converts all columns with object or category dtype. However, you should always check the result of calling get_dummies() to ensure that only the categorical variables have been transformed. Consult the documentation for more details. If you are ever unsure of the data types, call the .info() method.

In the cell below, subset to the appropriate predictive features and then use pd.get_dummies() to one-hot encode the dataset properly.

# Your code here

That's it! You've now successfully scrubbed your dataset -- you're now ready for data exploration and modeling!

Summary

In this lesson, you gained practice with scrubbing and cleaning data. Specifically, you addressed an incorrect data type, detected and dealt with null values, checked for multicollinearity, and transformed data. Congrats on performing the full data cleaning process for a dataset!

dsc-scrubbing-and-cleaning-data-lab-chicago-ds-102819's People

Contributors

mathymitchell avatar mike-kane avatar loredirick avatar alexgriff avatar mas16 avatar sumedh10 avatar

Watchers

James Cloos avatar  avatar Mohawk Greene avatar Victoria Thevenot avatar Bernard Mordan avatar Otha avatar raza jafri avatar  avatar Joe Cardarelli avatar The Learn Team avatar Sophie DeBenedetto avatar  avatar  avatar Matt avatar Antoin avatar  avatar  avatar  avatar Amanda D'Avria avatar  avatar Ahmed avatar Nicole Kroese  avatar Kaeland Chatman avatar Lisa Jiang avatar Vicki Aubin avatar Maxwell Benton avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.