Giter Site home page Giter Site logo

dsc-dealing-missing-data-lab-houston-ds-042219's Introduction

Dealing with Missing Data - Lab

Introduction

In this lab, we'll work through strategies for data cleaning and dealing with missing values (NaNs).

Objectives

In this lab you will:

  • Identify missing values in a dataframe using built-in methods
  • Explain why missing values are a problem in data science

Dataset

In this lab, we'll continue working with the Titanic Survivors dataset, which can be found in 'titanic.csv'.

Before we can get going, we'll need to import the usual libraries. In the cell below, import:

  • pandas as pd
  • numpy as np
  • matplotlib.pyplot as plt
  • set %matplotlib inline
# Import necessary libraries below

Now, let's get started by reading in the data from the 'titanic.csv' file and storing it the DataFrame df. Subsequently, be sure to preview the data.

# Use pandas to load the csv file
df = None

Find missing values in a DataFrame

Before we can deal with missing values, we first need to find them. There are several easy ways to detect them. We will start by answering very general questions, such as "does this DataFrame contain any null values?", and then narrowing our focus each time the answer to a question is "yes".

We'll start by checking to see if the DataFrame contains any missing values (NaNs) at all.

Hint: If you do this correctly, it will require method chaining, and will return a boolean value for each column.

# Your code here

Now we know which columns contain missing values, but not how many.

In the cell below, chain a different method with isna() to check how many total missing values are in each column.

Expected Output:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
# Your code here

Now that we know how many missing values exist in each column, we can make some decisions about how to deal with them.

We'll deal with each column individually, and employ a different strategy for each.

Dropping the column

The first column we'll deal with is the Cabin column. We'll begin by examining this column more closely.

In the cell below:

  • Determine what percentage of rows in this column contain missing values
  • Print out the number of unique values in this column
# Your code here

With this many missing values, it's probably best for us to just drop this column completely.

In the cell below:

  • Drop the Cabin column in place from the df DataFrame
  • Then, check the remaining number of null values in the dataset by using the code you wrote previously
# Your code here

Computing placeholder values

Recall that another common strategy for dealing with missing values is to replace them with the mean or median for that column. We'll begin by investigating the current version of the 'Age' column.

In the cell below:

  • Plot a histogram of values in the 'Age' column with 80 bins (1 for each year)
  • Print out the mean and median for the column
# Your code here

From the visualization above, we can see the data has a slightly positive skew.

In the cell below, replace all missing values in the 'Age' column with the median of the column. Do not hard code this value -- use the methods from pandas or numpy to make this easier! Do this replacement in place on the DataFrame.

# Your code here

Now that we've replaced the values in the 'Age' column, let's confirm that they've been replaced.

In the cell below, check how many null values remain in the dataset.

# Your code here

Great! Now we need to deal with the two pesky missing values in the 'Embarked' column.

Dropping rows that contain missing values

Perhaps the most common solution to dealing with missing values is to simply drop any rows that contain them. Of course, this is only a good idea if the number dropped does not constitute a significant portion of our dataset. Often, you'll need to make the overall determination to see if dropping the values is an acceptable loss, or if it is a better idea to just drop an offending column (e.g. the 'Cabin' column) or to impute placeholder values instead.

In the cell below, use the appropriate built-in DataFrame method to drop the rows containing missing values. Do this in place on the DataFrame.

# Your code here

Great! We've dealt with all the obvious missing values, but we should also take some time to make sure that there aren't symbols or numbers included that are meant to denote a missing value.

Missing values with placeholders

A common thing to see when working with datasets is missing values denoted with a preassigned code or symbol. Let's check to ensure that each categorical column contains only what we expect.

In the cell below, return the unique values in the 'Embarked', 'Sex', 'Pclass', and 'Survived' columns to ensure that there are no values in there that we don't understand or can't account for.

# Your code here

It looks like the 'Pclass' column contains some missing values denoted by a placeholder!

In the cell below, investigate how many placeholder values this column contains. Then, deal with these missing values using whichever strategy you believe is most appropriate in this case.

# Your code here
# Your code here

Question: What is the benefit of treating missing values as a separate valid category? What is the benefit of removing or replacing them? What are the drawbacks of each? Finally, which strategy did you choose? Explain your choice below.

Write your answer below this line:


Now, let's do a final check to ensure that there are no more missing values remaining in this dataset.

In the cell below, reuse the code you wrote at the beginning of the notebook to check how many null values our dataset now contains.

# Your code here

Great! Those all seem in line with our expectations. We can confidently say that this dataset contains no pesky missing values that will mess up our analysis later on!

Summary

In this lab, we learned:

  • How to detect missing values in our dataset
  • How to deal with missing values by dropping rows
  • How to deal with missing values by imputing mean/median values
  • Strategies for detecting missing values encoded with a placeholder

dsc-dealing-missing-data-lab-houston-ds-042219's People

Contributors

loredirick avatar mathymitchell avatar mike-kane avatar peterbell avatar sumedh10 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.