# This is always a good idea
%load_ext autoreload
%autoreload 2
from src.student_caller import one_random_student, three_random_students
from src.student_list import student_first_names
"In a standard normal curve, what z-score is associated with the 97.5th percentile?"
one_random_student(student_first_names)
Reuben
Before returning to our Satire/No Satire example, let's consider an example with a smaller but similar scope.
Suppose we are using an API to gather articles from a news website and grabbing phrases from two different types of articles: music and politics.
We have a problem though! Only some of our articles are labeled with a category (music or politics). Is there a way we can use Machine Learning to help us label our data quickly?
- 'the song was popular'
- 'band leaders disagreed on sound'
- 'played for a sold out arena stadium'
- 'world leaders met lask week'
- 'the election was close'
- 'the officials agreed on a compromise'
Let's try and predict one example phrase:
- "world leaders agreed to fund the stadium"
How can we make a model that labels this for us rather than having to go through by hand?
from collections import defaultdict
import numpy as np
music = ['the song was popular',
'band leaders disagreed on sound',
'played for a sold out arena stadium']
politics = ['world leaders met lask week',
'the election was close',
'the officials agreed on a compromise']
test_statement = 'world leaders agreed to fund the stadium'
Let's revisit Bayes Theorem. Remember, Bayes looks to calculate the probability of a class (c) given the data (x). To do so, we calculate the likelihood (the distribution of our data within a given class) and the prior probabiliity of each class (the probability of seeing the class in the population). We are going to ignore the denominator of the right side of the equation in this instance, because, as we will see, we will be finding the ratio of posteriors probabilities, which will cancel out the denominator.
where phrase is our test statement
This is essentially the distribution of the probability of either type of article. We have three of each type of article, therefore, we assume that there is an equal probability of either article
p_politics = len(politics)/(len(politics) + len(music))
p_music = len(music)/(len(politics) + len(music))
one_random_student(student_first_names)
Andrew
The likelihood of a class label given the phrase is the joint probability distribution of the individual words, or in other words the product of their individual probabilities of appearing in a class.
We need to make a Naive assumption. Naive in this contexts means that we assume that the probabilities of each word appearing are independent from the other words in the phrase. For example, the probability of the word 'rock' would increase if we found the word 'classic' in the text. Naive bayes does not take this conditional probability into account.
one_random_student(student_first_names)
Sam
This correction process is called Laplace smoothing:
- d : number of features (in this instance total number of vocabulary words)
-
$\alpha$ can be any number greater than 0 (it is usually 1)
def vocab_maker(category):
"""
parameters: category is a list containing all the articles
of a given category.
returns the vocabulary for a given type of article
"""
vocab_category = set() # will filter down to only unique words
for art in category:
words = art.split()
for word in words:
vocab_category.add(word)
return vocab_category
voc_music = vocab_maker(music)
voc_pol = vocab_maker(politics)
# These are all the unique words in the music category
voc_music
{'a',
'arena',
'band',
'disagreed',
'for',
'leaders',
'on',
'out',
'played',
'popular',
'sold',
'song',
'sound',
'stadium',
'the',
'was'}
# These are all the unique words in the politics category
voc_pol
{'a',
'agreed',
'close',
'compromise',
'election',
'lask',
'leaders',
'met',
'officials',
'on',
'the',
'was',
'week',
'world'}
# The union of the two sets gives us the unique words across both article groups
voc_all = voc_music.union(voc_pol)
voc_all
{'a',
'agreed',
'arena',
'band',
'close',
'compromise',
'disagreed',
'election',
'for',
'lask',
'leaders',
'met',
'officials',
'on',
'out',
'played',
'popular',
'sold',
'song',
'sound',
'stadium',
'the',
'was',
'week',
'world'}
total_vocab_count = len(voc_all)
total_music_count = len(voc_music)
total_politics_count = len(voc_pol)
Let's remind ourselves of the goal, to see the posterior likelihood of the class politics given our phrase.
P(politics | leaders agreed to fund the stadium)
music
['the song was popular',
'band leaders disagreed on sound',
'played for a sold out arena stadium']
def find_number_words_in_category(phrase,category):
statement = phrase.split()
# category is a list of the raw documents of each category
str_category=' '.join(category)
cat_word_list = str_category.split()
word_count = defaultdict(int)
# loop through each word in the phrase
for word in statement:
# loop through each word in the category
for art_word in cat_word_list:
if word == art_word:
# count the number of times the phrase word occurs in the category
word_count[word] +=1
else:
word_count[word]
return word_count
test_music_word_count = find_number_words_in_category(test_statement,music)
test_music_word_count
defaultdict(int,
{'world': 0,
'leaders': 1,
'agreed': 0,
'to': 0,
'fund': 0,
'the': 1,
'stadium': 1})
test_politic_word_count = find_number_words_in_category(test_statement,politics)
test_politic_word_count
defaultdict(int,
{'world': 1,
'leaders': 1,
'agreed': 1,
'to': 0,
'fund': 0,
'the': 2,
'stadium': 0})
def find_likelihood(category_count,test_category_count,alpha):
# The numerator will be the product of all the counts
# with the smoothing factor (alpha) to make sure the probability is not zero'd out
num = np.product(np.array(list(test_category_count.values())) + alpha)
# The denominator will be the same for each word (total category count + total vocab + alph)
# so we raise it to the power of the length of the test category
denom = (category_count + total_vocab_count*alpha)**(len(test_category_count))
return num/denom
likelihood_m = find_likelihood(total_music_count,test_music_word_count,1)
likelihood_p = find_likelihood(total_politics_count,test_politic_word_count,1)
print(likelihood_m)
print(likelihood_p)
4.107740405680756e-11
1.748875897714495e-10
$ P(politics | article) = P(politics) x \prod_{i=1}^{d} P(word_{i} | politics) $
p_politics = .5
p_music = .5
# p(politics|article) > p(music|article)
likelihood_p * p_politics > likelihood_m * p_music
True
Many times, the probabilities we end up are exceedingly small, so we can transform them using logs to save on computation speed
Good Resource: https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.htmlm
import pandas as pd
import numpy as np
corpus = pd.read_csv('data/satire_nosatire.csv')
corpus.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
body | target | |
---|---|---|
0 | Noting that the resignation of James Mattis as... | 1 |
1 | Desperate to unwind after months of nonstop wo... | 1 |
2 | Nearly halfway through his presidential term, ... | 1 |
3 | Attempting to make amends for gross abuses of ... | 1 |
4 | Decrying the Senate’s resolution blaming the c... | 1 |
Like always, we will perform a train test split...
X=corpus.body
y=corpus.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, test_size=.25)
...and preprocess the training set like we learned.
import nltk
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
# Import our regex pattern that gets rid of numbers and punctuation
sw = stopwords.words('english')
sw.extend(['would', 'one', 'say'])
from nltk.corpus import wordnet
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
def get_wordnet_pos(treebank_tag):
'''
Translate nltk POS to wordnet tags
'''
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
def doc_preparer(doc, stop_words=sw):
'''
:param doc: a document from the satire corpus
:return: a document string with words which have been
lemmatized,
parsed for stopwords,
made lowercase,
and stripped of punctuation and numbers.
'''
regex_token = RegexpTokenizer(r"([a-zA-Z]+(?:’[a-z]+)?)")
doc = regex_token.tokenize(doc)
doc = [word.lower() for word in doc]
doc = [word for word in doc if word not in stop_words]
doc = pos_tag(doc)
doc = [(word[0], get_wordnet_pos(word[1])) for word in doc]
lemmatizer = WordNetLemmatizer()
doc = [lemmatizer.lemmatize(word[0], word[1]) for word in doc]
return ' '.join(doc)
token_docs = [doc_preparer(doc, sw) for doc in X_train]
from sklearn.feature_extraction.text import CountVectorizer
For demonstration purposes, we will limit our count vectorizer to 5 words (the top 5 words by frequency).
# Secondary train-test split to build our best model
X_t, X_val, y_t, y_val = train_test_split(token_docs, y_train, test_size=.25, random_state=42)
cv = CountVectorizer(max_features=5)
# Just like with our scaler, we fit our Count Vectorizer on the training set
X_t_vec = cv.fit_transform(X_t)
X_t_vec = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(cv.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)
X_t_vec
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
people | say | state | trump | year | |
---|---|---|---|---|---|
159 | 3 | 0 | 0 | 0 | 0 |
246 | 1 | 0 | 0 | 7 | 1 |
640 | 0 | 4 | 1 | 0 | 4 |
809 | 2 | 10 | 2 | 0 | 7 |
130 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... |
148 | 1 | 0 | 0 | 0 | 1 |
300 | 0 | 1 | 0 | 0 | 0 |
356 | 1 | 3 | 0 | 0 | 0 |
36 | 1 | 4 | 0 | 0 | 3 |
895 | 1 | 7 | 0 | 0 | 6 |
562 rows × 5 columns
The word say shows up in our count vectorizer, but it is excluded in the stopwords. What is going on?
# We then transform the validation set. Do not refit the vectorizer
X_val_vec = cv.transform(X_val)
X_val_vec = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(cv.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)
Now let's fit the the Multinomial Naive Bayes Classifier on our training data
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_t_vec, y_t)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
#What should our priors for each class be?
one_random_student(student_first_names)
Karim
mnb.class_log_prior_
array([-0.72203005, -0.66507516])
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix
y_hat = mnb.predict(X_val_vec)
accuracy_score(y_val, y_hat)
0.8297872340425532
Let's consider the scenario that we would like to isolate satirical news on Facebook so we can flag it. We do not want to flag real news by mistake. In other words, we want to minimize falls positives.
confusion_matrix(y_val, y_hat)
array([[83, 16],
[16, 73]])
precision_score(y_val, y_hat)
0.8202247191011236
That's pretty good for a five word vocabulary.
Let's see what happens when we increase don't restrict our vocabulary
cv = CountVectorizer()
X_t_vec = cv.fit_transform(X_t)
X_t_vec = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(cv.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)
X_val_vec = cv.transform(X_val)
X_val_vec = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(cv.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)
mnb = MultinomialNB()
mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)
confusion_matrix(y_val, y_hat)
array([[96, 3],
[ 4, 85]])
Wow! Look how well that performed.
precision_score(y_val, y_hat)
0.9659090909090909
len(cv.vocabulary_)
14819
Let's see whether or not we can maintain that level of accuracy with less words.
cv = CountVectorizer(min_df=.05, max_df=.95)
X_t_vec = cv.fit_transform(X_t)
X_t_vec = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(cv.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)
X_val_vec = cv.transform(X_val)
X_val_vec = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(cv.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)
mnb = MultinomialNB()
mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)
precision_score(y_val, y_hat)
0.9431818181818182
len(cv.vocabulary_)
650
# Now let's see what happens with TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_t_vec = tfidf.fit_transform(X_t)
X_t_vec = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(tfidf.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)
X_val_vec = tfidf.transform(X_val)
X_val_vec = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(tfidf.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)
mnb = MultinomialNB()
mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)
precision_score(y_val, y_hat)
0.9444444444444444
TFIDF does not necessarily perform better than CV. It is just a tool in our toolbelt which we can try out and compare the performance.
len(tfidf.vocabulary_)
14819
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(min_df=.05, max_df=.95)
X_t_vec = tfidf.fit_transform(X_t)
X_t_vec = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(tfidf.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)
X_val_vec = tfidf.transform(X_val)
X_val_vec = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(tfidf.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)
mnb = MultinomialNB()
mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)
precision_score(y_val, y_hat)
0.9651162790697675
len(tfidf.vocabulary_)
650
Let's compare MNB to one of our classifiers that has a track record of high performance, Random Forest.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=1000, max_features=5, max_depth=5)
rf.fit(X_t_vec, y_t)
y_hat = rf.predict(X_val_vec)
precision_score(y_val, y_hat)
Both random forest and mnb perform comparably, however, mnb is lightweight as far as computational power and speed. For real time predictions, we may choose MNB over random forest because the classifications can be performed quickly.