Giter Site home page Giter Site logo

naive_bayes_nlp_ds-west-082420's Introduction

Naive Bayes and NLP Modeling

# This is always a good idea
%load_ext autoreload
%autoreload 2

from src.student_caller import one_random_student, three_random_students
from src.student_list import student_first_names
"In a standard normal curve, what z-score is associated with the 97.5th percentile?"
one_random_student(student_first_names)
Reuben

Before returning to our Satire/No Satire example, let's consider an example with a smaller but similar scope.

Suppose we are using an API to gather articles from a news website and grabbing phrases from two different types of articles: music and politics.

We have a problem though! Only some of our articles are labeled with a category (music or politics). Is there a way we can use Machine Learning to help us label our data quickly?


Here are our articles

Music Articles:

  • 'the song was popular'
  • 'band leaders disagreed on sound'
  • 'played for a sold out arena stadium'

Politics Articles

  • 'world leaders met lask week'
  • 'the election was close'
  • 'the officials agreed on a compromise'

Let's try and predict one example phrase:

  • "world leaders agreed to fund the stadium"

How can we make a model that labels this for us rather than having to go through by hand?

from collections import defaultdict
import numpy as np
music = ['the song was popular',
         'band leaders disagreed on sound',
         'played for a sold out arena stadium']

politics = ['world leaders met lask week',
            'the election was close',
            'the officials agreed on a compromise']

test_statement = 'world leaders agreed to fund the stadium'

Let's revisit Bayes Theorem. Remember, Bayes looks to calculate the probability of a class (c) given the data (x). To do so, we calculate the likelihood (the distribution of our data within a given class) and the prior probabiliity of each class (the probability of seeing the class in the population). We are going to ignore the denominator of the right side of the equation in this instance, because, as we will see, we will be finding the ratio of posteriors probabilities, which will cancel out the denominator.

Another way of looking at it

So, in the context of our problem......

$\large P(politics | phrase) = \frac{P(phrase|politics)P(politics)}{P(phrase)}$

$\large P(politics) = \frac{ # politics}{# all\ articles} $

where phrase is our test statement

How should we calculate P(politics)?

This is essentially the distribution of the probability of either type of article. We have three of each type of article, therefore, we assume that there is an equal probability of either article

p_politics = len(politics)/(len(politics) + len(music))
p_music = len(music)/(len(politics) + len(music))

How do you think we should calculate: $ P(phrase | politics) $ ?

one_random_student(student_first_names)
Andrew

$\large P(phrase | politics) = \prod_{i=1}^{d} P(word_{i} | politics) $

The likelihood of a class label given the phrase is the joint probability distribution of the individual words, or in other words the product of their individual probabilities of appearing in a class.

We need to make a Naive assumption. Naive in this contexts means that we assume that the probabilities of each word appearing are independent from the other words in the phrase. For example, the probability of the word 'rock' would increase if we found the word 'classic' in the text. Naive bayes does not take this conditional probability into account.

$\large P(word_{i} | politics) = \frac{#\ of\ word_{i}\ in\ politics\ art.} {#\ of\ total\ words\ in\ politics\ art.} $

Can you foresee any issues with this?

one_random_student(student_first_names)
Sam

Laplace Smoothing

$\large P(word_{i} | politics) = \frac{#\ of\ word_{i}\ in\ politics\ art. + \alpha} {#\ of\ total\ words\ in\ politics\ art. + \alpha d} $

$\large P(word_{i} | music) = \frac{#\ of\ word_{i}\ in\ music\ art. + \alpha} {#\ of\ total\ words\ in\ music\ art. + \alpha d} $

This correction process is called Laplace smoothing:

  • d : number of features (in this instance total number of vocabulary words)
  • $\alpha$ can be any number greater than 0 (it is usually 1)

Now let's find this calculation

def vocab_maker(category):
    """
    parameters: category is a list containing all the articles
    of a given category.
    
    returns the vocabulary for a given type of article
    
    """
    
    vocab_category = set() # will filter down to only unique words
    
    for art in category:
        words = art.split()
        for word in words:
            vocab_category.add(word)
    return vocab_category
        
voc_music = vocab_maker(music)
voc_pol = vocab_maker(politics)
# These are all the unique words in the music category
voc_music
{'a',
 'arena',
 'band',
 'disagreed',
 'for',
 'leaders',
 'on',
 'out',
 'played',
 'popular',
 'sold',
 'song',
 'sound',
 'stadium',
 'the',
 'was'}
# These are all the unique words in the politics category
voc_pol
{'a',
 'agreed',
 'close',
 'compromise',
 'election',
 'lask',
 'leaders',
 'met',
 'officials',
 'on',
 'the',
 'was',
 'week',
 'world'}
# The union of the two sets gives us the unique words across both article groups
voc_all = voc_music.union(voc_pol)
voc_all
{'a',
 'agreed',
 'arena',
 'band',
 'close',
 'compromise',
 'disagreed',
 'election',
 'for',
 'lask',
 'leaders',
 'met',
 'officials',
 'on',
 'out',
 'played',
 'popular',
 'sold',
 'song',
 'sound',
 'stadium',
 'the',
 'was',
 'week',
 'world'}
total_vocab_count = len(voc_all)
total_music_count = len(voc_music)
total_politics_count = len(voc_pol)

Let's remind ourselves of the goal, to see the posterior likelihood of the class politics given our phrase.

P(politics | leaders agreed to fund the stadium)

music
['the song was popular',
 'band leaders disagreed on sound',
 'played for a sold out arena stadium']
def find_number_words_in_category(phrase,category):
    statement = phrase.split()
    
    # category is a list of the raw documents of each category
    str_category=' '.join(category)
    cat_word_list = str_category.split()
    word_count = defaultdict(int)
    
    # loop through each word in the phrase
    for word in statement:
        # loop through each word in the category
        for art_word in cat_word_list:
            if word == art_word:
                # count the number of times the phrase word occurs in the category
                word_count[word] +=1
            else:
                word_count[word]
    return word_count
                
            
test_music_word_count = find_number_words_in_category(test_statement,music)
test_music_word_count
defaultdict(int,
            {'world': 0,
             'leaders': 1,
             'agreed': 0,
             'to': 0,
             'fund': 0,
             'the': 1,
             'stadium': 1})
test_politic_word_count = find_number_words_in_category(test_statement,politics)
test_politic_word_count
defaultdict(int,
            {'world': 1,
             'leaders': 1,
             'agreed': 1,
             'to': 0,
             'fund': 0,
             'the': 2,
             'stadium': 0})
def find_likelihood(category_count,test_category_count,alpha):
    # The numerator will be the product of all the counts 
    # with the smoothing factor (alpha) to make sure the probability is not zero'd out
    num = np.product(np.array(list(test_category_count.values())) + alpha)
    
    # The denominator will be the same for each word (total category count + total vocab + alph)
    # so we raise it to the power of the length of the test category
    denom = (category_count + total_vocab_count*alpha)**(len(test_category_count))
    
    return num/denom
likelihood_m = find_likelihood(total_music_count,test_music_word_count,1)
likelihood_p = find_likelihood(total_politics_count,test_politic_word_count,1)
print(likelihood_m)
print(likelihood_p)
4.107740405680756e-11
1.748875897714495e-10

$ P(politics | article) = P(politics) x \prod_{i=1}^{d} P(word_{i} | politics) $

Deteriming the winner of our model:

p_politics = .5
p_music = .5
# p(politics|article)  > p(music|article)
likelihood_p * p_politics  > likelihood_m * p_music
True

Many times, the probabilities we end up are exceedingly small, so we can transform them using logs to save on computation speed

$\large log(P(politics | article)) = log(P(politics)) + \sum_{i=1}^{d}log( P(word_{i} | politics)) $

Good Resource: https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.htmlm

Back to Satire

import pandas as pd
import numpy as np
corpus = pd.read_csv('data/satire_nosatire.csv')
corpus.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
body target
0 Noting that the resignation of James Mattis as... 1
1 Desperate to unwind after months of nonstop wo... 1
2 Nearly halfway through his presidential term, ... 1
3 Attempting to make amends for gross abuses of ... 1
4 Decrying the Senate’s resolution blaming the c... 1

Like always, we will perform a train test split...

X=corpus.body
y=corpus.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, test_size=.25)

...and preprocess the training set like we learned.

import nltk
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
# Import our regex pattern that gets rid of numbers and punctuation

sw = stopwords.words('english')
sw.extend(['would', 'one', 'say'])
from nltk.corpus import wordnet
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer 
  

def get_wordnet_pos(treebank_tag):
    '''
    Translate nltk POS to wordnet tags
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
def doc_preparer(doc, stop_words=sw):
    '''
    
    :param doc: a document from the satire corpus 
    :return: a document string with words which have been 
            lemmatized, 
            parsed for stopwords, 
            made lowercase,
            and stripped of punctuation and numbers.
    '''
    
    regex_token = RegexpTokenizer(r"([a-zA-Z]+(?:’[a-z]+)?)")
    doc = regex_token.tokenize(doc)
    doc = [word.lower() for word in doc]
    doc = [word for word in doc if word not in stop_words]
    doc = pos_tag(doc)
    doc = [(word[0], get_wordnet_pos(word[1])) for word in doc]
    lemmatizer = WordNetLemmatizer() 
    doc = [lemmatizer.lemmatize(word[0], word[1]) for word in doc]
    return ' '.join(doc)
token_docs = [doc_preparer(doc, sw) for doc in X_train]
from sklearn.feature_extraction.text import CountVectorizer

For demonstration purposes, we will limit our count vectorizer to 5 words (the top 5 words by frequency).

# Secondary train-test split to build our best model
X_t, X_val, y_t, y_val = train_test_split(token_docs, y_train, test_size=.25, random_state=42)
cv = CountVectorizer(max_features=5)

# Just like with our scaler, we fit our Count Vectorizer on the training set
X_t_vec = cv.fit_transform(X_t)
X_t_vec  = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(cv.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)
X_t_vec
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
people say state trump year
159 3 0 0 0 0
246 1 0 0 7 1
640 0 4 1 0 4
809 2 10 2 0 7
130 0 0 0 0 0
... ... ... ... ... ...
148 1 0 0 0 1
300 0 1 0 0 0
356 1 3 0 0 0
36 1 4 0 0 3
895 1 7 0 0 6

562 rows × 5 columns

Knowledge Check

The word say shows up in our count vectorizer, but it is excluded in the stopwords. What is going on?

# We then transform the validation set.  Do not refit the vectorizer
X_val_vec = cv.transform(X_val)
X_val_vec  = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(cv.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)

Multinomial Naive Bayes

Now let's fit the the Multinomial Naive Bayes Classifier on our training data

from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()

mnb.fit(X_t_vec, y_t)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
#What should our priors for each class be?

one_random_student(student_first_names)
Karim
mnb.class_log_prior_
array([-0.72203005, -0.66507516])
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix

y_hat = mnb.predict(X_val_vec)
accuracy_score(y_val, y_hat)
0.8297872340425532

Let's consider the scenario that we would like to isolate satirical news on Facebook so we can flag it. We do not want to flag real news by mistake. In other words, we want to minimize falls positives.

confusion_matrix(y_val, y_hat)
array([[83, 16],
       [16, 73]])
precision_score(y_val, y_hat)
0.8202247191011236

That's pretty good for a five word vocabulary.

Let's see what happens when we increase don't restrict our vocabulary

cv = CountVectorizer()
X_t_vec = cv.fit_transform(X_t)
X_t_vec  = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(cv.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)


X_val_vec = cv.transform(X_val)
X_val_vec  = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(cv.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)
mnb = MultinomialNB()

mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)
confusion_matrix(y_val, y_hat)
array([[96,  3],
       [ 4, 85]])

Wow! Look how well that performed.

precision_score(y_val, y_hat)
0.9659090909090909
len(cv.vocabulary_)
14819

Let's see whether or not we can maintain that level of accuracy with less words.

cv = CountVectorizer(min_df=.05, max_df=.95)
X_t_vec = cv.fit_transform(X_t)
X_t_vec  = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(cv.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)

X_val_vec = cv.transform(X_val)
X_val_vec  = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(cv.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)

mnb = MultinomialNB()

mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)

precision_score(y_val, y_hat)
0.9431818181818182
len(cv.vocabulary_)
650
# Now let's see what happens with TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_t_vec = tfidf.fit_transform(X_t)
X_t_vec  = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(tfidf.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)

X_val_vec = tfidf.transform(X_val)
X_val_vec  = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(tfidf.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)

mnb = MultinomialNB()

mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)

precision_score(y_val, y_hat)
0.9444444444444444

TFIDF does not necessarily perform better than CV. It is just a tool in our toolbelt which we can try out and compare the performance.

len(tfidf.vocabulary_)
14819
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(min_df=.05, max_df=.95)
X_t_vec = tfidf.fit_transform(X_t)
X_t_vec  = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(tfidf.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)

X_val_vec = tfidf.transform(X_val)
X_val_vec  = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(tfidf.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)

mnb = MultinomialNB()

mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)

precision_score(y_val, y_hat)
0.9651162790697675
len(tfidf.vocabulary_)
650

Let's compare MNB to one of our classifiers that has a track record of high performance, Random Forest.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=1000, max_features=5, max_depth=5)
rf.fit(X_t_vec, y_t)
y_hat = rf.predict(X_val_vec)
precision_score(y_val, y_hat)

Both random forest and mnb perform comparably, however, mnb is lightweight as far as computational power and speed. For real time predictions, we may choose MNB over random forest because the classifications can be performed quickly.

naive_bayes_nlp_ds-west-082420's People

Contributors

j-max avatar

Stargazers

Matthew Blasa avatar

Watchers

Kaitlin Vignali avatar Mohawk Greene avatar Victoria Thevenot avatar raza jafri avatar  avatar Joe Cardarelli avatar The Learn Team avatar  avatar Ben Oren avatar Matt avatar Alex Griffith avatar  avatar Amanda D'Avria avatar  avatar Ahmed avatar Nicole Kroese  avatar Dominique De León avatar  avatar Vicki Aubin avatar Maxwell Benton avatar  avatar Matthew Blasa avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.