Giter Site home page Giter Site logo

pombredanne / kilograms Goto Github PK

View Code? Open in Web Editor NEW

This project forked from futurecomputing4ai/kilograms

0.0 1.0 0.0 35 KB

KiloGram algorithm for finding the top-k most frequent n-grams for large values of n quickly with fixed memory.

Java 100.00%

kilograms's Introduction

KiloGrams

This is the java code implementing the KiloGrams algorithm, from out paper KiloGrams: Very Large N-Grams for Malware Classification. Using it, you can extract the top-k largest n-grams from a corpus using a fixed amount of memory, for large values of k and n. In our original paper, we tested with k up to 8192, which took the same time or less than processing k=6 grams.

This is research code, and comes with no warranty or support.

Quick Start

You can use this code to create a dataset based on the top-k n-grams. To do so, after building the KiloGrams code, you can run a comand like this:

java -Xmx10G -jar Kilograms-1.0-jar-with-dependencies.jar NGram -n 8 -k 1000 -g <path to goodware> -b <path to malware> -o grams.dat

The top-k ngrams are saved in grams.dat, a binary formated file. See NGram.java or Featurizer.java source code for the nature of the binary format and how to parse it if you want to know the n-grams. If you use a value of n > 8, we recommend you add the hashing-stride option with -hs. For example, if you want n=1024 grams, we would use -hs 256.

To create a dataset from the above code, you can use the following command:

java -Xmx10G -jar Kilograms-1.0-jar-with-dependencies.jar DATASET  -g <path to goodware> -b <path to malware> -h grams.dat -o data.libsvm

By default, this will produce a file using the libsvm format. Scikit-learn can read this.

If you have a machine with a very large number of cores or very large files, you may want to increase the max memory for Java, depending on your JVM used.

The folders given as input do not have to be executables, or even benign/malicious. They can be any kind of files, and the code will process byte n-grams. The DATASET creation step also supports multi-class problems by using the -mc <path to class 0> <path to class 1> ... <path to class C> flag instead of -b and -g.

Citations

If you use the Kilogram algorithm or code, please cite our work!

@inproceedings{Kilograms_2019,
author = {Raff, Edward and Fleming, William and Zak, Richard and Anderson, Hyrum and Finlayson, Bill and Nicholas, Charles K. and Mclean, Mark},
booktitle = {Proceedings of KDD 2019 Workshop on Learning and Mining for Cybersecurity (LEMINCS'19)},
title = {{KiloGrams: Very Large N-Grams for Malware Classification}},
url = {https://arxiv.org/abs/1908.00200},
year = {2019}
}

Contact

If you have questions, please contact

Mark Mclean [email protected] Edward Raff [email protected] Richard Zak [email protected]

kilograms's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.