Giter Site home page Giter Site logo

qwqoro / ml-corrupted Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 26.76 MB

πŸ“Š [ML] Classification Problem Solution: Guessing the type of a corrupted file

License: GNU General Public License v3.0

Jupyter Notebook 100.00%
classification classification-model classification-models classification-problem corrupted corruption file-analysis file-analyzer file-content file-type

ml-corrupted's Introduction

gif

[ Have a look at the related Jupyter Notebook: ML-Corrupted.ipynb ]
[!] This README is only a brief overview of the notebook contents!

Problem | Results | Approach | Resources


// Disclaimer:
Β  I created this out of interest in the summer of 2019 at the age of 15
Β  Resources, links and libraries were updated in the spring of 2022
β†’ Machine Learning level: Script Kiddie

πŸ“‰ Problem

Solving CTF tasks led me to corrupted files at some point. When they lack distinctive features, such as signatures, most of the tools identify them as "Text files" or "Data", which is not useful at all, to be honest.

To proceed, it would be great to know their types, at least. Even though some files might be too broken to be fixed, unless you know the exact file format, which parts are missing and which part you have got before your eyes, I believe that guessing types may still be useful in some cases, e.g. criminal investigations β€” it might be important to find remains of what used to be evidences and analyze them thoroughly. Thus, my aim was to build models that would guess types of broken files.

πŸ“Š Results


Test set of files + knnRecG8 (./guess)

πŸ” Approach

Data

> Resources Β  [ Every piece of data and all copyrights belong to the original owners ]

Features

A total of 7 DataFrames were made:

  1. S = Top 10 bytes in each part of a file (file is split into 3 parts)
  2. G4 = Top 20 byte 4-Grams (stride=1) in a file
  3. G6 = Top 20 byte 6-Grams (stride=2) in a file
  4. G8 = Top 20 byte 8-Grams (stride=4) in a file
  5. G4S = Top 10 byte 4-Grams (stride=1) in each part of a file (file is split into 3 parts)
  6. G6S = Top 10 byte 6-Grams (stride=2) in each part of a file (file is split into 3 parts)
  7. G8S = Top 10 byte 8-Grams (stride=4) in each part of a file (file is split into 3 parts)

S β†’ Split, GN β†’ N-Grams

Algorithms & boosts

Scikit-Learn + XGBoost + LightGBM + CatBoost:

from sklearn.ensemble import RandomForestClassifier      # Random Forest Classifier
from sklearn.neighbors import KNeighborsClassifier       # KNN Classifier
from xgboost import XGBClassifier                        # XGBoost Classifier
from lightgbm import LGBMClassifier                      # LightGBM Classifier
from catboost import CatBoostClassifier                  # CatBoost Classifier

from sklearn.model_selection import RandomizedSearchCV   # Randomized search on hyperparameters

Models were built using every algorithm listed above and were trained based on every DataFrame specified. For each case there are two models: one with default settings and one with hyperparameters recommended by RandomizedSearchCV.

The total number of models is 70 (7 DataFrames * 5 Algorithms * 2 Sets of hyperparameters):

rfcS, rfcRecS, rfcG4, rfcRecG4, rfcG6, rfcRecG6, rfcG8, rfcRecG8, rfcG4S, rfcRecG4S, rfcG6S, rfcRecG6S, rfcG8S, rfcRecG8S,
knnS, knnRecS, knnG4, knnRecG4, knnG6, knnRecG6, knnG8, knnRecG8, knnG4S, knnRecG4S, knnG6S, knnRecG6S, knnG8S, knnRecG8S,
xgbS, xgbRecS, xgbG4, xgbRecG4, xgbG6, xgbRecG6, xgbG8, xgbRecG8, xgbG4S, xgbRecG4S, xgbG6S, xgbRecG6S, xgbG8S, xgbRecG8S,
lgbmS, lgbmRecS, lgbmG4, lgbmRecG4, lgbmG6, lgbmRecG6, lgbmG8, lgbmRecG8, lgbmG4S, lgbmRecG4S, lgbmG6S, lgbmRecG6S, lgbmG8S, lgbmRecG8S,
cbcS, cbcRecS, cbcG4, cbcRecG4, cbcG6, cbcRecG6, cbcG8, cbcRecG8, cbcG4S, cbcRecG4S, cbcG6S, cbcRecG6S, cbcG8S, cbcRecG8S

Accuracy calculation & comparison

from sklearn.metrics import accuracy_score               # Accuracy score

# [ Accuracy Score = (True Positives + True Negatives) / (True Positives + False Positives + True Negatives + False Negatives) ]

Also, heatmaps were used to visually compare accuracy scores of predictions (on validation sets) of different models. For example, the heatmap below depicts the difference between K-Nearest Neighbors model (train dataset = G8) with default hyperparameters (knnG8) and the ones recommended by RandomizedSearchCV (knnRecG8):

Libraries

Data processing Algorithms & boosting Visualization Misc
numpy 1.21.6 scikit-learn 1.0.2 matplotlib 3.5.2 binascii
pandas 1.3.5 xgboost 1.6.1 seaborn 0.11.2 collections
lightgbm 3.3.2 tqdm 4.35.0
catboost 1.0.6 dill 0.3.5.1

Hardware

Information regarding hardware that was used to run the notebook. The time spent on every crucial step may be seen in the notebook, next to each tqdm progress bar.

RAM: 12 GB
GPU: NVIDIA GeForce GTX 1060
Processor: Intel(R) Core(TM) i5-8300H

πŸ“š Resources

Datasets:

[1] Chao Dong, Chen Change Loy, Xiaoou Tang. Accelerating the Super-Resolution Convolutional Neural Network, in Proceedings of European Conference on Computer Vision (ECCV), 2016 arXiv:1608.00367

[2] Zhang Zhifei, Song Yang, and Qi Hairong. "Age Progression/Regression by Conditional Adversarial Autoencoder". IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1702.08423, 2017

[3] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A Large Video Database for Human Motion Recognition. ICCV, 2011. PDF Bibtex

[4] Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, β€œLearning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019

[5] Royi Ronen, Marian Radu, Corina Feuerstein, Elad Yom-Tov, Mansour Ahmadi. "Microsoft Malware Classification Challenge". arXiv:1802.10135, 2018

ml-corrupted's People

Contributors

qwqoro avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.