Naive Bayes and NLP Modeling

# This is always a good idea
%load_ext autoreload
%autoreload 2

from src.student_caller import one_random_student, three_random_students
from src.student_list import student_first_names

"In a standard normal curve, what z-score is associated with the 97.5th percentile?"
one_random_student(student_first_names)

Reuben

Before returning to our Satire/No Satire example, let's consider an example with a smaller but similar scope.

Suppose we are using an API to gather articles from a news website and grabbing phrases from two different types of articles: music and politics.

We have a problem though! Only some of our articles are labeled with a category (music or politics). Is there a way we can use Machine Learning to help us label our data quickly?

Here are our articles

Music Articles:

'the song was popular'
'band leaders disagreed on sound'
'played for a sold out arena stadium'

Politics Articles

'world leaders met lask week'
'the election was close'
'the officials agreed on a compromise'

Let's try and predict one example phrase:

"world leaders agreed to fund the stadium"

How can we make a model that labels this for us rather than having to go through by hand?

from collections import defaultdict
import numpy as np
music = ['the song was popular',
         'band leaders disagreed on sound',
         'played for a sold out arena stadium']

politics = ['world leaders met lask week',
            'the election was close',
            'the officials agreed on a compromise']

test_statement = 'world leaders agreed to fund the stadium'

Let's revisit Bayes Theorem. Remember, Bayes looks to calculate the probability of a class (c) given the data (x). To do so, we calculate the likelihood (the distribution of our data within a given class) and the prior probabiliity of each class (the probability of seeing the class in the population). We are going to ignore the denominator of the right side of the equation in this instance, because, as we will see, we will be finding the ratio of posteriors probabilities, which will cancel out the denominator.

Another way of looking at it

So, in the context of our problem......

$\large P(politics | phrase) = \frac{P(phrase|politics)P(politics)}{P(phrase)}$

$\large P(politics) = \frac{ # politics}{# all\ articles} $

where phrase is our test statement

How should we calculate P(politics)?

This is essentially the distribution of the probability of either type of article. We have three of each type of article, therefore, we assume that there is an equal probability of either article

p_politics = len(politics)/(len(politics) + len(music))
p_music = len(music)/(len(politics) + len(music))

How do you think we should calculate: $ P(phrase | politics) $ ?

one_random_student(student_first_names)

Andrew

$\large P(phrase | politics) = \prod_{i=1}^{d} P(word_{i} | politics) $

The likelihood of a class label given the phrase is the joint probability distribution of the individual words, or in other words the product of their individual probabilities of appearing in a class.

We need to make a Naive assumption. Naive in this contexts means that we assume that the probabilities of each word appearing are independent from the other words in the phrase. For example, the probability of the word 'rock' would increase if we found the word 'classic' in the text. Naive bayes does not take this conditional probability into account.

$\large P(word_{i} | politics) = \frac{#\ of\ word_{i}\ in\ politics\ art.} {#\ of\ total\ words\ in\ politics\ art.} $

Can you foresee any issues with this?

one_random_student(student_first_names)

Sam

Laplace Smoothing

$\large P(word_{i} | politics) = \frac{#\ of\ word_{i}\ in\ politics\ art. + \alpha} {#\ of\ total\ words\ in\ politics\ art. + \alpha d} $

$\large P(word_{i} | music) = \frac{#\ of\ word_{i}\ in\ music\ art. + \alpha} {#\ of\ total\ words\ in\ music\ art. + \alpha d} $

This correction process is called Laplace smoothing:

d : number of features (in this instance total number of vocabulary words)
$\alpha$ can be any number greater than 0 (it is usually 1)

Now let's find this calculation

def vocab_maker(category):
    """
    parameters: category is a list containing all the articles
    of a given category.
    
    returns the vocabulary for a given type of article
    
    """
    
    vocab_category = set() # will filter down to only unique words
    
    for art in category:
        words = art.split()
        for word in words:
            vocab_category.add(word)
    return vocab_category
        
voc_music = vocab_maker(music)
voc_pol = vocab_maker(politics)

# These are all the unique words in the music category
voc_music

{'a',
 'arena',
 'band',
 'disagreed',
 'for',
 'leaders',
 'on',
 'out',
 'played',
 'popular',
 'sold',
 'song',
 'sound',
 'stadium',
 'the',
 'was'}

# These are all the unique words in the politics category
voc_pol

{'a',
 'agreed',
 'close',
 'compromise',
 'election',
 'lask',
 'leaders',
 'met',
 'officials',
 'on',
 'the',
 'was',
 'week',
 'world'}

# The union of the two sets gives us the unique words across both article groups
voc_all = voc_music.union(voc_pol)
voc_all

{'a',
 'agreed',
 'arena',
 'band',
 'close',
 'compromise',
 'disagreed',
 'election',
 'for',
 'lask',
 'leaders',
 'met',
 'officials',
 'on',
 'out',
 'played',
 'popular',
 'sold',
 'song',
 'sound',
 'stadium',
 'the',
 'was',
 'week',
 'world'}

total_vocab_count = len(voc_all)
total_music_count = len(voc_music)
total_politics_count = len(voc_pol)

Let's remind ourselves of the goal, to see the posterior likelihood of the class politics given our phrase.

P(politics | leaders agreed to fund the stadium)

music

['the song was popular',
 'band leaders disagreed on sound',
 'played for a sold out arena stadium']

def find_number_words_in_category(phrase,category):
    statement = phrase.split()
    
    # category is a list of the raw documents of each category
    str_category=' '.join(category)
    cat_word_list = str_category.split()
    word_count = defaultdict(int)
    
    # loop through each word in the phrase
    for word in statement:
        # loop through each word in the category
        for art_word in cat_word_list:
            if word == art_word:
                # count the number of times the phrase word occurs in the category
                word_count[word] +=1
            else:
                word_count[word]
    return word_count

test_music_word_count = find_number_words_in_category(test_statement,music)

test_music_word_count

defaultdict(int,
            {'world': 0,
             'leaders': 1,
             'agreed': 0,
             'to': 0,
             'fund': 0,
             'the': 1,
             'stadium': 1})

test_politic_word_count = find_number_words_in_category(test_statement,politics)

test_politic_word_count

defaultdict(int,
            {'world': 1,
             'leaders': 1,
             'agreed': 1,
             'to': 0,
             'fund': 0,
             'the': 2,
             'stadium': 0})

def find_likelihood(category_count,test_category_count,alpha):
    # The numerator will be the product of all the counts 
    # with the smoothing factor (alpha) to make sure the probability is not zero'd out
    num = np.product(np.array(list(test_category_count.values())) + alpha)
    
    # The denominator will be the same for each word (total category count + total vocab + alph)
    # so we raise it to the power of the length of the test category
    denom = (category_count + total_vocab_count*alpha)**(len(test_category_count))
    
    return num/denom

likelihood_m = find_likelihood(total_music_count,test_music_word_count,1)

likelihood_p = find_likelihood(total_politics_count,test_politic_word_count,1)

print(likelihood_m)
print(likelihood_p)

4.107740405680756e-11
1.748875897714495e-10

$ P(politics | article) = P(politics) x \prod_{i=1}^{d} P(word_{i} | politics) $

Deteriming the winner of our model:

p_politics = .5
p_music = .5

# p(politics|article)  > p(music|article)
likelihood_p * p_politics  > likelihood_m * p_music

True

Many times, the probabilities we end up are exceedingly small, so we can transform them using logs to save on computation speed

$\large log(P(politics | article)) = log(P(politics)) + \sum_{i=1}^{d}log( P(word_{i} | politics)) $

Good Resource: https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.htmlm

Back to Satire

import pandas as pd
import numpy as np
corpus = pd.read_csv('data/satire_nosatire.csv')
corpus.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	body	target
0	Noting that the resignation of James Mattis as...	1
1	Desperate to unwind after months of nonstop wo...	1
2	Nearly halfway through his presidential term, ...	1
3	Attempting to make amends for gross abuses of ...	1
4	Decrying the Senate’s resolution blaming the c...	1

Like always, we will perform a train test split...

X=corpus.body
y=corpus.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, test_size=.25)

...and preprocess the training set like we learned.

import nltk
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords

# Import our regex pattern that gets rid of numbers and punctuation

sw = stopwords.words('english')
sw.extend(['would', 'one', 'say'])

from nltk.corpus import wordnet
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer 
  

def get_wordnet_pos(treebank_tag):
    '''
    Translate nltk POS to wordnet tags
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def doc_preparer(doc, stop_words=sw):
    '''
    
    :param doc: a document from the satire corpus 
    :return: a document string with words which have been 
            lemmatized, 
            parsed for stopwords, 
            made lowercase,
            and stripped of punctuation and numbers.
    '''
    
    regex_token = RegexpTokenizer(r"([a-zA-Z]+(?:’[a-z]+)?)")
    doc = regex_token.tokenize(doc)
    doc = [word.lower() for word in doc]
    doc = [word for word in doc if word not in stop_words]
    doc = pos_tag(doc)
    doc = [(word[0], get_wordnet_pos(word[1])) for word in doc]
    lemmatizer = WordNetLemmatizer() 
    doc = [lemmatizer.lemmatize(word[0], word[1]) for word in doc]
    return ' '.join(doc)

token_docs = [doc_preparer(doc, sw) for doc in X_train]

from sklearn.feature_extraction.text import CountVectorizer

For demonstration purposes, we will limit our count vectorizer to 5 words (the top 5 words by frequency).

# Secondary train-test split to build our best model
X_t, X_val, y_t, y_val = train_test_split(token_docs, y_train, test_size=.25, random_state=42)

cv = CountVectorizer(max_features=5)

# Just like with our scaler, we fit our Count Vectorizer on the training set
X_t_vec = cv.fit_transform(X_t)
X_t_vec  = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(cv.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)

X_t_vec

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	people	say	state	trump	year
159	3	0	0	0	0
246	1	0	0	7	1
640	0	4	1	0	4
809	2	10	2	0	7
130	0	0	0	0	0
...	...	...	...	...	...
148	1	0	0	0	1
300	0	1	0	0	0
356	1	3	0	0	0
36	1	4	0	0	3
895	1	7	0	0	6

562 rows × 5 columns

Knowledge Check

The word say shows up in our count vectorizer, but it is excluded in the stopwords. What is going on?

# We then transform the validation set.  Do not refit the vectorizer
X_val_vec = cv.transform(X_val)
X_val_vec  = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(cv.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)

Multinomial Naive Bayes

Now let's fit the the Multinomial Naive Bayes Classifier on our training data

from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()

mnb.fit(X_t_vec, y_t)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

#What should our priors for each class be?

one_random_student(student_first_names)

Karim

mnb.class_log_prior_

array([-0.72203005, -0.66507516])

from sklearn.metrics import accuracy_score, precision_score, confusion_matrix

y_hat = mnb.predict(X_val_vec)
accuracy_score(y_val, y_hat)

0.8297872340425532

Let's consider the scenario that we would like to isolate satirical news on Facebook so we can flag it. We do not want to flag real news by mistake. In other words, we want to minimize falls positives.

confusion_matrix(y_val, y_hat)

array([[83, 16],
       [16, 73]])

precision_score(y_val, y_hat)

0.8202247191011236

That's pretty good for a five word vocabulary.

Let's see what happens when we increase don't restrict our vocabulary

cv = CountVectorizer()
X_t_vec = cv.fit_transform(X_t)
X_t_vec  = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(cv.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)


X_val_vec = cv.transform(X_val)
X_val_vec  = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(cv.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)

mnb = MultinomialNB()

mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)
confusion_matrix(y_val, y_hat)

array([[96,  3],
       [ 4, 85]])

Wow! Look how well that performed.

precision_score(y_val, y_hat)

0.9659090909090909

len(cv.vocabulary_)

Let's see whether or not we can maintain that level of accuracy with less words.

cv = CountVectorizer(min_df=.05, max_df=.95)
X_t_vec = cv.fit_transform(X_t)
X_t_vec  = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(cv.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)

X_val_vec = cv.transform(X_val)
X_val_vec  = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(cv.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)

mnb = MultinomialNB()

mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)

precision_score(y_val, y_hat)

0.9431818181818182

len(cv.vocabulary_)

# Now let's see what happens with TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_t_vec = tfidf.fit_transform(X_t)
X_t_vec  = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(tfidf.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)

X_val_vec = tfidf.transform(X_val)
X_val_vec  = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(tfidf.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)

mnb = MultinomialNB()

mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)

precision_score(y_val, y_hat)

0.9444444444444444

TFIDF does not necessarily perform better than CV. It is just a tool in our toolbelt which we can try out and compare the performance.

len(tfidf.vocabulary_)

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(min_df=.05, max_df=.95)
X_t_vec = tfidf.fit_transform(X_t)
X_t_vec  = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(tfidf.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)

X_val_vec = tfidf.transform(X_val)
X_val_vec  = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(tfidf.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)

mnb = MultinomialNB()

mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)

precision_score(y_val, y_hat)

0.9651162790697675

len(tfidf.vocabulary_)

Let's compare MNB to one of our classifiers that has a track record of high performance, Random Forest.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=1000, max_features=5, max_depth=5)
rf.fit(X_t_vec, y_t)
y_hat = rf.predict(X_val_vec)
precision_score(y_val, y_hat)

Both random forest and mnb perform comparably, however, mnb is lightweight as far as computational power and speed. For real time predictions, we may choose MNB over random forest because the classifications can be performed quickly.

	people	say	state	trump	year
159	3	0	0	0	0
246	1	0	0	7	1
640	0	4	1	0	4
809	2	10	2	0	7
130	0	0	0	0	0
...	...	...	...	...	...
148	1	0	0	0	1
300	0	1	0	0	0
356	1	3	0	0	0
36	1	4	0	0	3
895	1	7	0	0	6

	people	say	state	trump	year
159	3	0	0	0	0
246	1	0	0	7	1
640	0	4	1	0	4
809	2	10	2	0	7
130	0	0	0	0	0
...	...	...	...	...	...
148	1	0	0	0	1
300	0	1	0	0	0
356	1	3	0	0	0
36	1	4	0	0	3
895	1	7	0	0	6

learn-co-students / naive_bayes_nlp_ds-west-082420 Goto Github PK

naive_bayes_nlp_ds-west-082420's Introduction

Naive Bayes and NLP Modeling

Here are our articles

Music Articles:

Politics Articles

Another way of looking at it

So, in the context of our problem......

How should we calculate P(politics)?

How do you think we should calculate: $ P(phrase | politics) $ ?

Can you foresee any issues with this?

Laplace Smoothing

Now let's find this calculation

Deteriming the winner of our model:

Back to Satire

Knowledge Check

Multinomial Naive Bayes

naive_bayes_nlp_ds-west-082420's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

	people	say	state	trump	year
159	3	0	0	0	0
246	1	0	0	7	1
640	0	4	1	0	4
809	2	10	2	0	7
130	0	0	0	0	0
...	...	...	...	...	...
148	1	0	0	0	1
300	0	1	0	0	0
356	1	3	0	0	0
36	1	4	0	0	3
895	1	7	0	0	6