In the following codealong, we will combine our new NLP knowledge with our knowledge of pipelines. We will apply this combination of skills to a common task: effectively separate spam from ham in a set of messages.

# Import not necessary for students
import sys
sys.path.append('../..')

from new_caller.random_student_engager.student_caller import CohortCaller
from new_caller.random_student_engager.student_list import avocoder_toasters

caller = CohortCaller(avocoder_toasters)

hello

The dataset comes from the UCI Machine Learning Repository.

# Run cell with no changes to import Ham vs. Spam SMS dataset
import pandas as pd

with open('data/SMSSpamCollection') as read_file:
    texts = read_file.readlines()
    
text = [text.split('\t')[1] for text in texts]
label = [text.split('\t')[0] for text in texts]
df = pd.DataFrame(text, columns=['text'])
df['label'] = label
df['label'] = df['label']

df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	text	label
0	Go until jurong point, crazy.. Available only ...	ham
1	Ok lar... Joking wif u oni...\n	ham
2	Free entry in 2 a wkly comp to win FA Cup fina...	spam
3	U dun say so early hor... U c already then say...	ham
4	Nah I don't think he goes to usf, he lives aro...	ham

As the head method shows, our data is labeled either ham or spam.

Check the distribution of the target in the cell below.

# Use pandas to find the distribution of Spam to Ham in the dataset

caller.call_n_students(1)

array(['Seth'], dtype='<U7')

Certain metrics require that our target be in the form of 0's and 1's. Use the LabelEncoder method to transform the target.

# f1 metric requires 0,1 labels
# Which should be 0 and which should be 1
from sklearn.preprocessing import LabelEncoder

caller.call_n_students(1)

array(['Rashid'], dtype='<U7')

Target Distribution and Train-Test Split

The model building workflow is similar to what we have performed in Phase 3.

To begin, train-test split the data set. Preserve the class balance in the test set.

# train-test split the dataset while preserving the class balance show above
# Pass random_state=42 as an argument as well
from sklearn.model_selection import train_test_split

caller.call_n_students(1)

array(['Meaghan'], dtype='<U7')

Count Vectorizor and TFIDF Vectorizer

In a small group, take 10 minutes to move through one model building iteration. What can that look like? Through some steps you decide on as a group, fit a vectorizer and a model on a training set(s), transform the "test" set, and score on it.

Two points to take into careful consideration:

1. What metric is appropriate in this case? Or, to put it another way, is one error more costly when creating a spam detector?
2. Will you use cross-validation/pipelines?
3. What vectorizer and model will you use?

Whatever you decide, start with a simple document-term matrix. Start with a max_features of 50. Go ahead and feed arguments to the vectorizer to take out stopwords. Use default params for the rest.

After you are finished, generate a confusion matrix of your "test" predictions. If you are using cross_validate, use cross_validate_predict along with sklearn's confusion_matrix to create it.

# your code here

Iterate

For the next 15 minutes, improve you model.

Discuss with your group steps you can take to improve your "test" score.

What you should consider:

1. What hyperparameters can you tune on your vectorizer?
2. How should you tune those hyperparameters? 
3. What other preprocessing steps, transformers, and estimators should you try?
4. Once you achieve a satisfying score, can you simplify the term matrix and achieve similar performance?

# Your code here

learn-co-students / nlp_code_along Goto Github PK

nlp_code_along's Introduction

Target Distribution and Train-Test Split

Count Vectorizor and TFIDF Vectorizer

Iterate

nlp_code_along's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent