Giter Site home page Giter Site logo

mandiant / stringsifter Goto Github PK

View Code? Open in Web Editor NEW
669.0 29.0 124.0 3.49 MB

A machine learning tool that ranks strings based on their relevance for malware analysis.

License: Apache License 2.0

Python 99.60% Dockerfile 0.40%
machine-learning fireeye-flare strings malware-analysis learning-to-rank reverse-engineering fireeye-data-science

stringsifter's Introduction


StringSifter is a machine learning tool that automatically ranks strings based on their relevance for malware analysis.

Quick Links

Usage

StringSifter requires Python version 3.9 or newer. Run the following commands to get the code, run unit tests, and use the tool:

Installation

pip install stringsifter

For development, use poetry:

git clone https://github.com/mandiant/stringsifter.git
cd stringsifter
poetry install --with dev

Running Unit Tests

To run unit tests from the StringSifter installation directory:

poetry run tests -v

Running from the Command Line

The pip install command installs two runnable scripts flarestrings and rank_strings into your python environment. When developing from source, use pipenv run flarestrings and pipenv run rank_strings.

flarestrings mimics features of GNU binutils' strings, and rank_strings accepts piped input, for example:

flarestrings <my_sample> | rank_strings

rank_strings supports a number of command line arguments. The positional argument input_strings specifies a file of strings to rank. The optional arguments are:

Option Meaning
--scores (-s) Include the rank scores in the output
--limit (-l) Limit output to the top limit ranked strings
--min-score (-m) Limit output to strings with score >= min-score
--batch (-b) Specify a folder of strings outputs for batch processing

Ranked strings are written to standard output unless the --batch option is specified, causing ranked outputs to be written to files named <input_file>.ranked_strings.

flarestrings supports an option -n (or --min-len) to print sequences of characters that are at least min-len characters long, instead of the default 4. For example:

flarestrings -n 8 <my_sample> | rank_strings

will print and rank only strings of length 8 or greater.

Running from a Docker container

  • After cloning the repo, build the container. From the the package's top level directory:
docker build -t stringsifter -f docker/Dockerfile .
  • Run the container with flarestrings or rank_strings argument to use the respective command. The containerized commands can be used in pipelines:
cat <my_sample> | docker run -i stringsifter flarestrings | docker run -i stringsifter rank_strings
  • Or, run the container without arguments to get a shell prompt, using the -v flag to expose a host directory to the container:
docker run -v <my_malware>:/samples -it stringsifter

where <my_malware> contains samples for analysis, for example:

docker run -v $HOME/malware/binaries:/samples -it stringsifter
  • At the container prompt:
flarestrings /samples/<my_sample> | rank_strings <options>

All command line arguments are supported in the containerized scripts.

Running on FLOSS Output

StringSifter can be applied to arbitrary lists of strings, making it useful for practitioners looking to glean insights from alternative intelligence-gathering sources such as live memory dumps, sandbox runs, or binaries that contain obfuscated strings. For example, FireEye Labs Obfuscated Strings Solver (FLOSS) extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. It can be used as an in-line replacement for Strings, meaning that StringSifter can be similarly invoked on FLOSS output using the following command:

$PY2_VENV/bin/floss –q <options> <my_sample> | rank_strings <options>

Notes:

  1. The –q argument suppresses headers and formatting to show only extracted strings. To learn more about additional FLOSS options, please see its Usage Docs.
  2. FLOSS requires Python 2, while StringSifter requires Python 3. In the example command at least one of floss or rank_strings must include a relative path referencing a python virtual enviroment.
  3. FLOSS can be downloaded as a standalone executable. In this case it is not required to specify a Python environment because the executable does not rely on a Python interpreter.

Notes on running strings

This distribution includes the flarestrings program to ensure predictable output across platforms. If you choose to run your system's installed strings note that its options are not consistent across versions and platforms:

Linux

Most Linux distributions include the strings program from GNU Binutils. To extract both "wide" and "narrow" strings the program must be run twice, piping to an output file:

strings <my_sample>       > strs.txt   # narrow strings
strings -el <my_sample>  >> strs.txt   # wide strings.  note the ">>"

MacOS

Some versions of BSD strings packaged with MacOS do not support wide strings. Also note that the -a option to strings to scan the whole file may be disabled in the default configuration. Without -a informative strings may be lost. We recommend installing GNU Binutils via Homebrew or MacPorts to get a version of strings that supports wide characters. Use care to invoke the correct version of strings.

Windows

strings is not installed by default on Windows. We recommend installing Windows Sysinternals, Cygwin, or Malcode Analyst Pack to get a working strings.

Discussion

This version of StringSifter was trained using Strings outputs from sampled malware binaries associated with the first EMBER dataset. Ordinal labels were generated using weak supervision procedures, and supervised learning is performed by Gradient Boosted Decision Trees with a learning-to-rank objective function. See Quick Links for further technical details. Please note that neither labeled data nor training code is currently available, though we may reconsider this approach in future releases.

Issues

We use GitHub Issues for posting bugs and feature requests.

Acknowledgements

  • Thanks to the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams for review and feedback.
  • StringSifter was designed and developed by Philip Tully (FDS), Matthew Haigh (FLARE), Jay Gibble (FLARE), and Michael Sikorski (FLARE).
  • The StringSifter logo was designed by Josh Langner (FLARE).
  • flarestrings is derived from the excellent tool FLOSS.

stringsifter's People

Contributors

ana06 avatar digitalsleuth avatar drstrng avatar ewalshmndt avatar noraj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stringsifter's Issues

Python 3.8 not supported

Stringsifter depends on numpy==1.17.1 , scipy==1.3.1. These versions do not support Python 3.8. But in setup.py, it's mentioned that stringsifter supports python>=3.6.

lightgbm 3.3.1 brings LGBMNotFittedError

  • system
    ubuntu-20.04 + stringsifter-2.20201202

  • issue
    rank_strings meets LGBMNotFittedError when lightgbm >= 3.3.1, last work version: lightgbm == 3.3.0

test@test:/dist# flarestrings -n 8 ./main | rank_strings -l 5
Traceback (most recent call last):
  File "/usr/local/bin/rank_strings", line 8, in <module>
    sys.exit(argmain())
  File "/usr/local/lib/python3.8/site-packages/stringsifter/rank_strings.py", line 140, in argmain
    main(args.input_strings, args.limit, args.min_score,
  File "/usr/local/lib/python3.8/site-packages/stringsifter/rank_strings.py", line 39, in main
    y_scores = ranker.predict(X_test)
  File "/usr/local/lib/python3.8/site-packages/lightgbm/sklearn.py", line 795, in predict
    raise LGBMNotFittedError("Estimator not fitted, call fit before exploiting the model.")
sklearn.exceptions.NotFittedError: Estimator not fitted, call fit before exploiting the model.

What is the Maximum Score of StringSifter?

For finding the malicious strings in a binary, I have used StringSifter. When I presented the sample scores StringSifter gave, my supervisor questioned the StringSifter maximum string score possible. Though I searched for it, I couldn't find a source that would answer my question. Would you be kind enough to tell me what the maximum possible string score using StringSifter?

Fix for macOS pip3 install fails

macOS install can trip up on lightgbm install section. Folks failing to install should try the following in this order:

pip3 install Cmake
#requires brew     
brew install libomp
pip3 install lightgbm
#then install stringsfter
pip3 install stringsifter 

[rank_strings 2.20201202 error]: AttributeError: module 'numpy' has no attribute 'typeDict'

Rank_strings version 2.20201202
Also with version 3.20230711
Traceback given when I try to run rank_strings (with or without arguments):

Traceback (most recent call last):
File "/usr/local/bin/rank_strings", line 8, in
sys.exit(argmain())
File "/usr/local/lib/python3.8/dist-packages/stringsifter/rank_strings.py", line 140, in argmain
main(args.input_strings, args.limit, args.min_score,
File "/usr/local/lib/python3.8/dist-packages/stringsifter/rank_strings.py", line 28, in main
featurizer = joblib.load(os.path.join(modeldir, "featurizer.pkl"))
File "/usr/local/lib/python3.8/dist-packages/joblib/numpy_pickle.py", line 585, in load
obj = _unpickle(fobj, filename, mmap_mode)
File "/usr/local/lib/python3.8/dist-packages/joblib/numpy_pickle.py", line 504, in _unpickle
obj = unpickler.load()
File "/usr/lib/python3.8/pickle.py", line 1212, in load
dispatchkey[0]
File "/usr/lib/python3.8/pickle.py", line 1537, in load_stack_global
self.append(self.find_class(module, name))
File "/usr/lib/python3.8/pickle.py", line 1579, in find_class
import(module, level=0)
File "/usr/local/lib/python3.8/dist-packages/sklearn/init.py", line 80, in
from .base import clone
File "/usr/local/lib/python3.8/dist-packages/sklearn/base.py", line 21, in
from .utils import _IS_32BIT
File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/init.py", line 20, in
from scipy.sparse import issparse
File "/usr/lib/python3/dist-packages/scipy/sparse/init.py", line 229, in
from .base import *
File "/usr/lib/python3/dist-packages/scipy/sparse/base.py", line 8, in
from .sputils import (isdense, isscalarlike, isintlike,
File "/usr/lib/python3/dist-packages/scipy/sparse/sputils.py", line 16, in
supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
File "/usr/lib/python3/dist-packages/scipy/sparse/sputils.py", line 16, in
supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
File "/usr/local/lib/python3.8/dist-packages/numpy/init.py", line 320, in getattr
raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'typeDict'

Use type hinting

Most functions currently do not return a type hint. By using this language feature it is easier for developers and intelligent code editors to analyze the code.

Release of training code

Hey,
Since your project is no longer supported properly, I was wondering if you would reconsider releasing the training code, as you have considered to do so in the past, as noted under Discussion in your README:
"Please note that neither labeled data nor training code is currently available, though we may reconsider this approach in future releases."
It would also be very helpful if in addition to the training code you would have released the labeled data used to train it as well.

rank_strings BrokenPipeError: [Errno 32] Broken pipe on macOS

macOS 10.14.6
python 3.7.6 homebrew

Traceback (most recent call last):
  File "/usr/local/bin/rank_strings", line 8, in <module>
    sys.exit(argmain())
  File "/usr/local/lib/python3.7/site-packages/stringsifter/rank_strings.py", line 138, in argmain
    args.scores, args.batch)
  File "/usr/local/lib/python3.7/site-packages/stringsifter/rank_strings.py", line 27, in main
    ranker = joblib.load(os.path.join(modeldir, "ranker.pkl"))
  File "/usr/local/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 598, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/usr/local/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 526, in _unpickle
    obj = unpickler.load()
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 1088, in load
    dispatch[key[0]](self)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 1376, in load_global
    klass = self.find_class(module, name)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 1426, in find_class
    __import__(module, level=0)
  File "/usr/local/lib/python3.7/site-packages/lightgbm/__init__.py", line 8, in <module>
    from .basic import Booster, Dataset
  File "/usr/local/lib/python3.7/site-packages/lightgbm/basic.py", line 33, in <module>
    _LIB = _load_lib()
  File "/usr/local/lib/python3.7/site-packages/lightgbm/basic.py", line 28, in _load_lib
    lib = ctypes.cdll.LoadLibrary(lib_path[0])
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ctypes/__init__.py", line 442, in LoadLibrary
    return self._dlltype(name)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(/usr/local/lib/python3.7/site-packages/lightgbm/lib_lightgbm.so, 6): Library not loaded: /usr/local/opt/gcc/lib/gcc/8/libgomp.1.dylib
  Referenced from: /usr/local/lib/python3.7/site-packages/lightgbm/lib_lightgbm.so
  Reason: image not found
Traceback (most recent call last):
  File "/usr/local/bin/flarestrings", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/stringsifter/flarestrings.py", line 29, in main
    print(match.group().decode('ascii'))
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe
Requirement already satisfied, skipping upgrade: lightgbm==2.1.2 in /usr/local/lib/python3.7/site-packages (from stringsifter) (2.1.2)
Requirement already satisfied, skipping upgrade: numpy==1.17.1 in /usr/local/lib/python3.7/site-packages (from stringsifter) (1.17.1)
Requirement already satisfied, skipping upgrade: scikit-learn==0.21.3 in /usr/local/lib/python3.7/site-packages (from stringsifter) (0.21.3)
Requirement already satisfied, skipping upgrade: joblib==0.13.2 in /usr/local/lib/python3.7/site-packages (from stringsifter) (0.13.2)
Requirement already satisfied, skipping upgrade: pytest==3.10.1 in /usr/local/lib/python3.7/site-packages (from stringsifter) (3.10.1)
Requirement already satisfied, skipping upgrade: fasttext==0.9.1 in /usr/local/lib/python3.7/site-packages (from stringsifter) (0.9.1)
Requirement already satisfied, skipping upgrade: scipy in /usr/local/lib/python3.7/site-packages (from lightgbm==2.1.2->stringsifter) (1.3.1)
Requirement already satisfied, skipping upgrade: py>=1.5.0 in /usr/local/lib/python3.7/site-packages (from pytest==3.10.1->stringsifter) (1.8.0)
Requirement already satisfied, skipping upgrade: more-itertools>=4.0.0 in /usr/local/lib/python3.7/site-packages (from pytest==3.10.1->stringsifter) (7.2.0)
Requirement already satisfied, skipping upgrade: six>=1.10.0 in /usr/local/lib/python3.7/site-packages (from pytest==3.10.1->stringsifter) (1.12.0)
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.7/site-packages (from pytest==3.10.1->stringsifter) (42.0.2)
Requirement already satisfied, skipping upgrade: atomicwrites>=1.0 in /usr/local/lib/python3.7/site-packages (from pytest==3.10.1->stringsifter) (1.3.0)
Requirement already satisfied, skipping upgrade: pluggy>=0.7 in /usr/local/lib/python3.7/site-packages (from pytest==3.10.1->stringsifter) (0.12.0)
Requirement already satisfied, skipping upgrade: attrs>=17.4.0 in /usr/local/lib/python3.7/site-packages (from pytest==3.10.1->stringsifter) (18.2.0)
Requirement already satisfied, skipping upgrade: pybind11>=2.2 in /usr/local/lib/python3.7/site-packages (from fasttext==0.9.1->stringsifter) (2.3.0)
Requirement already satisfied, skipping upgrade: importlib-metadata>=0.12 in /usr/local/lib/python3.7/site-packages (from pluggy>=0.7->pytest==3.10.1->stringsifter) (0.20)
Requirement already satisfied, skipping upgrade: zipp>=0.5 in /usr/local/lib/python3.7/site-packages (from importlib-metadata>=0.12->pluggy>=0.7->pytest==3.10.1->stringsifter) (0.6.0)
Building wheels for collected packages: stringsifter
  Building wheel for stringsifter (setup.py) ... done
  Created wheel for stringsifter: filename=stringsifter-0.20191202-py3-none-any.whl size=1932401 sha256=9e0f0c617e547b96f9ef88995e6d5006ae9aa54a8bc16cc16b077f1284f7e832
  Stored in directory: /Users/admin/Library/Caches/pip/wheels/3c/89/aa/c3bc7a6c171c52f5d4bea07724f5f2fd9fc4e7ad23b5946a8c
Successfully built stringsifter
Installing collected packages: stringsifter
  Attempting uninstall: stringsifter
    Found existing installation: stringsifter 0.20190907
    Uninstalling stringsifter-0.20190907:
      Successfully uninstalled stringsifter-0.20190907
Successfully installed stringsifter-0.20191202

probably same issue #10 is reporting, but I can't tell for sure because there is limited info
@phtully ?

Notebooks and data used for training the model

Hi,

Is there a possibility to make available the data used for training as well as the notebooks? That way we can generate different model versions, twinkle a little and perhaps help in improvements.

Thanks in advance

Setup.py also installs pytest

setup.py by using every requirements aslo installs pytest which should be a development requirement and should not be propagated also in production build.

Memory issue when processing large strings

StringSifter uses extreme amounts of memory when run on a sample containing a very long string. The immediate problem is the rectangular array used to store the input strings before processing even begins, but there may be further memory consumption issues during processing.

Ideally, StringSifter memory use should be bounded to some multiple of the input size in bytes.

Python 3.10 compatibility

Any chance of updating this for Python 3.10 support? The models are not compatible with scitkit-learn 1.0.2 (which is compatible with Python 3.10).

Can the models be reserialized with the latest scikit-learn?

pip3 install failing on installing numpy on Santoku distro

I'm trying to install this on my Santoku distro (based off Ubuntu)
It fails when it gets to running setup.py for a dependent package "numpy'
this is preventing PIP from installing the stringsifter package.

$ pip3 install numpy
Downloading/unpacking numpy
Downloading numpy-1.17.2.zip (6.5MB): 6.5MB downloaded
Running setup.py (path:/tmp/pip_build_santoku-bcheeves/numpy/setup.py) egg_info for package numpy
Traceback (most recent call last):
File "", line 17, in
File "/tmp/pip_build_santoku-bcheeves/numpy/setup.py", line 31, in
raise RuntimeError("Python version >= 3.5 required.")
RuntimeError: Python version >= 3.5 required.
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

File "", line 17, in

File "/tmp/pip_build_santoku-bcheeves/numpy/setup.py", line 31, in

raise RuntimeError("Python version >= 3.5 required.")

RuntimeError: Python version >= 3.5 required.


Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_santoku-bcheeves/numpy
Storing debug log for failure in /home/santoku-bcheeves/.pip/pip.log

I plan to lookup who supports numpy and contact them as well. Cause I have python 3.7 installed and the error clearly states that python version >= 3.5 is required :)

numpy.core._exceptions.MemoryError

Running

flarestrings <big 2 GB file> | rank_strings

crashes with an out-of-memory exception:

(python37) daubsi@bigigloo:/tmp$ flarestrings <bigfile> | rank_strings
Traceback (most recent call last):
  File "/home/daubsi/.conda/envs/python37/bin/rank_strings", line 11, in <module>
    load_entry_point('stringsifter', 'console_scripts', 'rank_strings')()
  File "/tmp/stringsifter/stringsifter/rank_strings.py", line 138, in argmain
    args.scores, args.batch)
  File "/tmp/stringsifter/stringsifter/rank_strings.py", line 31, in main
    input_strings.readlines()])
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (19412352,) and data type <U45056

There is more than 10GB free memory available.
Running on Ubuntu 14.04 with Python 3.7.4

Is the tool supposed to work on smaller files only? The standard "strings" utility had no issue getting the strings from the binary.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.