Giter Site home page Giter Site logo

justinm0rgan / dsc-document-classification-with-naive-bayes-lab Goto Github PK

View Code? Open in Web Editor NEW

This project forked from learn-co-curriculum/dsc-document-classification-with-naive-bayes-lab

0.0 0.0 0.0 251 KB

License: Other

Jupyter Notebook 100.00%

dsc-document-classification-with-naive-bayes-lab's Introduction

Document Classification with Naive Bayes - Lab

Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

Objectives

In this lab you will:

  • Implement document classification using Naive Bayes

Import the dataset

To start, import the dataset stored in the text file 'SMSSpamCollection'.

# Your code here

Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

# Your code here

Train-test split

Now implement a train-test split on the dataset:

# Your code here
from sklearn.model_selection import train_test_split
X = None
y = None
X_train, X_test, y_train, y_test = None
train_df = None
test_df = None

Create the word frequency dictionary for each class

Create a word frequency dictionary for each class:

# Your code here

Count the total corpus words

Calculate V, the total number of words in the corpus:

# Your code here

Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function bag_it() to create a bag of words representation from a document's text.

# Your code here

Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

# Your code here
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    pass

Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

# Your code here

Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!

dsc-document-classification-with-naive-bayes-lab's People

Contributors

loredirick avatar matthew-mitchell avatar sumedh10 avatar mathymitchell avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.