Giter Site home page Giter Site logo

nlp_code_along's Introduction

In the following codealong, we will combine our new NLP knowledge with our knowledge of pipelines. We will apply this combination of skills to a common task: effectively separate spam from ham in a set of messages.

# Import not necessary for students
import sys
sys.path.append('../..')

from new_caller.random_student_engager.student_caller import CohortCaller
from new_caller.random_student_engager.student_list import avocoder_toasters

caller = CohortCaller(avocoder_toasters)
hello

The dataset comes from the UCI Machine Learning Repository.

# Run cell with no changes to import Ham vs. Spam SMS dataset
import pandas as pd

with open('data/SMSSpamCollection') as read_file:
    texts = read_file.readlines()
    
text = [text.split('\t')[1] for text in texts]
label = [text.split('\t')[0] for text in texts]
df = pd.DataFrame(text, columns=['text'])
df['label'] = label
df['label'] = df['label']
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
text label
0 Go until jurong point, crazy.. Available only ... ham
1 Ok lar... Joking wif u oni...\n ham
2 Free entry in 2 a wkly comp to win FA Cup fina... spam
3 U dun say so early hor... U c already then say... ham
4 Nah I don't think he goes to usf, he lives aro... ham

As the head method shows, our data is labeled either ham or spam.

Check the distribution of the target in the cell below.

# Use pandas to find the distribution of Spam to Ham in the dataset
caller.call_n_students(1)
array(['Seth'], dtype='<U7')

Certain metrics require that our target be in the form of 0's and 1's. Use the LabelEncoder method to transform the target.

# f1 metric requires 0,1 labels
# Which should be 0 and which should be 1
from sklearn.preprocessing import LabelEncoder
caller.call_n_students(1)
array(['Rashid'], dtype='<U7')

Target Distribution and Train-Test Split

The model building workflow is similar to what we have performed in Phase 3.

To begin, train-test split the data set. Preserve the class balance in the test set.

# train-test split the dataset while preserving the class balance show above
# Pass random_state=42 as an argument as well
from sklearn.model_selection import train_test_split
caller.call_n_students(1)
array(['Meaghan'], dtype='<U7')

Count Vectorizor and TFIDF Vectorizer

In a small group, take 10 minutes to move through one model building iteration. What can that look like? Through some steps you decide on as a group, fit a vectorizer and a model on a training set(s), transform the "test" set, and score on it.

Two points to take into careful consideration:

1. What metric is appropriate in this case? Or, to put it another way, is one error more costly when creating a spam detector?
2. Will you use cross-validation/pipelines?
3. What vectorizer and model will you use? 

Whatever you decide, start with a simple document-term matrix. Start with a max_features of 50. Go ahead and feed arguments to the vectorizer to take out stopwords. Use default params for the rest.

After you are finished, generate a confusion matrix of your "test" predictions. If you are using cross_validate, use cross_validate_predict along with sklearn's confusion_matrix to create it.

# your code here

Iterate

For the next 15 minutes, improve you model.

Discuss with your group steps you can take to improve your "test" score.

What you should consider:

1. What hyperparameters can you tune on your vectorizer?
2. How should you tune those hyperparameters? 
3. What other preprocessing steps, transformers, and estimators should you try?
4. Once you achieve a satisfying score, can you simplify the term matrix and achieve similar performance?
# Your code here

nlp_code_along's People

Contributors

j-max avatar

Watchers

MariOS avatar Kaitlin Vignali avatar Mohawk Greene avatar  avatar Joe Cardarelli avatar The Learn Team avatar Dr. Chester Ismay avatar  avatar Liz Burton avatar Matt avatar Alex Griffith avatar  avatar  avatar Ahmed avatar Nicole Kroese  avatar Dominique De León avatar  avatar Vicki Aubin avatar Maxwell Benton avatar  avatar Blake Long avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.