Giter Site home page Giter Site logo

spam-detection-machine-learning's Introduction

Spam detection using ML

Hao Meng 2020.7

Rewrite in the notebook in 2021.10

Introduction

In my previous notebook, I analyzed how to detect malware using Supervised learning of machine learning. In this notebook, I will discuss how to detect spam emails using machine learning.

Detection of email is different from malware as email only has some words.

In research from Graham (2002), he got a high accuracy of spam email detection by Naive Bayesian filter. I will briefly explain the principle of the Naive Bayesian filter.

1. Bayesian Inference

In general, it is conditional probability. For example. In total, the probability of rain in tomorrow is 0.3. However, if many clouds appear in today's evening, the probability of rain in tomorrow will increase to 0.5. The condition is that colouds appear in the evening of previous day.

In spam email detection, we assume the probability of the spam email is 0.5. However, if "sex" appears in email, the probability of the spam email will increase to 0.8. We can statstic the probability of some key words similar to "sex" from all samples emails.

2. Naive Bayesian

It can't just judge whether it is a spam email by one word. There are more keywords like "Gamble," "smoke," and so on. If every word's conditional probability is independent, the Bayesian Inference method will be transferred to the Naive Bayesian Inference method. And the function of the combine probabilities is:

Combine Probabilities

The P with a subscript represents one word's condition probability, and in the function, there are 15 words to calc the spam probability together. Later more words can be added to the function.

Dataset

This notebook will use Ling-Spam Dataset provided by Ion Androutsopoulos(https://www.kaggle.com/mandygu/lingspam-dataset?rvi=1).

Reference Materials

In this notebook, I will use the feature engineering method mainly from the notebook(https://www.kaggle.com/surekharamireddy/spam-detection-with-99-accuracy).

Flow

  1. Feature engineering
  2. Vectorlize
  3. Analysing by some models
  4. Analysing by Naive Bayesian model

Reference Graham, P. (2002, August). A Plan for Spam. Paul Graham. http://www.paulgraham.com/spam.html

spam-detection-machine-learning's People

Contributors

brycemeng avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.