Giter Site home page Giter Site logo

triakshunn / phishytics-machine-learning-for-phishing Goto Github PK

View Code? Open in Web Editor NEW

This project forked from faizann24/phishytics-machine-learning-for-phishing

0.0 0.0 0.0 39.69 MB

Machine Learning for Phishing Website Detection

Home Page: https://faizanahmad.tech/blog/2020/02/phishytics-machine-learning-for-phishing-websites-detection/

License: MIT License

Shell 0.03% Python 1.32% HTML 98.60% Dockerfile 0.06%

phishytics-machine-learning-for-phishing's Introduction

Phishytics - Machine Learning for Detecting Phishing Websites

Machine Learning and Random Forests with Byte Pair Encoding and TFIDF scores for Phishing Website Detection.

Python 3.6

This repository contains code for the blog post Phishytics – Machine Learning for Detecting Phishing Websites. Using Random Forests on top of Byte Pair Encoding and TFIDF scores, we can obtain a highly accurate model for predicting phishing websites. Pre-trained models are also provided for users and companies to use.

Files Description

Path Description
phishytics-machine-learning-for-phishing Main folder.
└  tokenizer Folder to store the tokenizer output files.
└  saved_models Folder to save trained models.
└  pretrained_models Folder containing required model files to run the pre-trained phishing detection model.
└  labeled_data Folder containing data for phishing and legitimate websites.
   ├  phishing_htmls HTML files of phishing web pages. Please do not change the folder names.
   ├  legitimate_htmls HTML files of legitimate web pages. Please do not change the folder names.
├  create_data_for_tokenization.py Create data for tokenization and apply byte pair encodings to get tokens.
├  train_phishing_detection_model.py Train a phishing website detection model.
├  test_model.py Test a website for phishing using our pre-trained random forest model.
├  test_pretrained_model.py Test a fully trained Random Forest model with 99% test accuracy on any given website.

Usage

Packages

You will need to install the following package to train and test the models.

Training your own Model

1. Data Tokenization

In order to train a phishing website detection model, you first need to tokenize all the HTML files into tokens using Byte Pair Encoding (BPE). We will use the tokenizer library for this. Once the html files are in their respective folders, run the following command.

python3 create_data_for_tokenization.py --labeled_data_folder labeled_data --vocab_size 300 --min_frequency 3

The script takes three parameters as inputs:

  • labeled_data_folder: Folder containing data for phishing and legitimate websites.
  • vocab_size: Maximum number of tokens to have in the vocabulary
  • min_frequency: Tokens having frequency lower than this value will be ignored

2. Model Training

Once we have create a Byte Pair Encoding tokenizer, we will be able to use it to tokenize HTML files and extract features for machine learning. On top of BPE tokens, we will apply TFIDF scores to get a feature representation of each HTML file. Run the following command to train your own model.

python3 train_phishing_detection_model.py --tokenizer_folder tokenizer/ --labeled_data_folder labeled_data/ --ignore_other_languages 1 --apply_different_thresholds 1 --save_model_dir saved_models

The script takes five parameters as inputs:

  • tokenizer_folder: Folder containing tokenizer files. The default folder is 'tokenizers'
  • labeled_data_folder: Folder containing data for phishing and legitimate websites.
  • ignore_other_languages: Whether to ignore languages other than english. Set it to 0 if you want to include all languages.
  • apply_different_thresholds: Whether to apply different confidence thresholds during model evaluation.
  • save_model_dir: Directory to save to model files

3. Model Testing

Once we have a trained model, we can simply test it live on any website using the following command.

python3 test_model.py --tokenizer_folder tokenizer --threshold 0.5 --model_dir saved_models --website_to_test https://www.google.com

The script takes four parameters as inputs:

  • tokenizer_folder: Folder containing tokenizer files. The default folder is 'tokenizers'
  • threshold: Threshold to use for making final predictions. By default, the value is 0.5.
  • model_dir: Directory where saved model files exist.
  • website_to_test: Website you want to test. Please add "http://" or "https://" before the website to make everything work. Otherwise, you will face an error.

Using Pre-trained Model

To use the pre-trained model, please go to the 'pretrained_models' directory and unzip the 'document-frequency-dictionary.zip' file. Do not unzip it in a new directory, keep it in the same directory. Once that is done, you can run the following command to use the pre-trained model.

python3 test_pretrained_model.py --tokenizer_folder pretrained_models --threshold 0.5 --model_dir pretrained_models --website_to_test https://www.google.com

The script takes four parameters as inputs:

  • tokenizer_folder: Folder containing tokenizer files. The default folder is 'tokenizers' but here we will use 'pretrained_models'.
  • threshold: Threshold to use for making final predictions. By default, the value is 0.5.
  • model_dir: Directory where saved model files exist. The pre-trained model files exist in 'pretrained_models'.
  • website_to_test: Website you want to test. Please add "http://" or "https://" before the website to make everything work. Otherwise, you will face an error.

License

MIT

Copyright (c) 2020-present, Faizan Ahmad

phishytics-machine-learning-for-phishing's People

Contributors

faizann24 avatar pooki3bear avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.