Giter Site home page Giter Site logo

adobe / stringlifier Goto Github PK

View Code? Open in Web Editor NEW
163.0 15.0 22.0 7.52 MB

Stringlifier is on Opensource ML Library for detecting random strings in raw text. It can be used in sanitising logs, detecting accidentally exposed credentials and as a pre-processing step in unsupervised ML-based analysis of application text data.

License: Apache License 2.0

Python 100.00%
machine-learning python3 api analysis unsupervised-machine-learning clustering tf-idf raw-text pytorch convolutional-networks long-short-term-memory classification

stringlifier's Introduction

Downloads Downloads Weekly daily Version Python 3 GitHub stars

stringlifier

String-classifier - is a python module for detecting random string and hashes text/code.

Typical usage scenarios include:

  • Sanitizing application or security logs
  • Detecting accidentally exposed credentials (complex passwords or api keys)

Interactive notebook

You can see Stringlifier in action by checking out this interactive notebook hosted on Colaboratory.

Quick start guide

You can quickly use stringlifier via pip-installation:

$ pip install stringlifier

In case you are using the pip3 installation that comes with Python3, use pip3 instead of pip in the above command.

$ pip3 install stringlifier

API example:

from stringlifier.api import Stringlifier

stringlifier=Stringlifier()

s = stringlifier("com.docker.hyperkit -A -u -F vms/0/hyperkit.pid -c 8 -m 8192M -b 127.0.0.1 --pass=\"NlcXVpYWRvcg\" -s 0:0,hostbridge -s 31,lpc -s 1:0,virtio-vpnkit,path=vpnkit.eth.sock,uuid=45172425-08d1-41ec-9d13-437481803412 -U c6fb5010-a83e-4f74-9a5a-50d9086b9")

After this, s should be:

'com.docker.hyperkit -A -u -F vms/0/hyperkit.pid -c 8 -m 8192M -b <IP_ADDR> --pass="<RANDOM_STRING>" -s 0:0,hostbridge -s 31,lpc -s 1:0,virtio-vpnkit,path=vpnkit.eth.sock,uuid=<UUID> -U <UUID>'

You can also choose to see the full tokenization and classification output:

s, tokens = stringlifier("com.docker.hyperkit -A -u -F vms/0/hyperkit.pid -c 8 -m 8192M -b 127.0.0.1 --pass=\"NlcXVpYWRvcg\" -s 0:0,hostbridge -s 31,lpc -s 1:0,virtio-vpnkit,path=vpnkit.eth.sock,uuid=45172425-08d1-41ec-9d13-437481803412 -U c6fb5010-a83e-4f74-9a5a-50d9086b9", return_tokens=True)

s will be the same as before and tokens will contain the following data:

[[('0', 33, 34, '<NUMERIC>'),
   ('8', 51, 52, '<NUMERIC>'),
   ('8192', 56, 60, '<NUMERIC>'),
   ('127.0.0.1', 65, 74, '<IP_ADDR>'),
   ('NlcXVpYWRvcg', 83, 95, '<RANDOM_STRING>'),
   ('0', 100, 101, '<NUMERIC>'),
   ('0', 102, 103, '<NUMERIC>'),
   ('31', 118, 120, '<NUMERIC>'),
   ('1', 128, 129, '<NUMERIC>'),
   ('0', 130, 131, '<NUMERIC>'),
   ('45172425-08d1-41ec-9d13-437481803412', 172, 208, '<UUID>'),
   ('c6fb5010-a83e-4f74-9a5a-50d9086b9', 212, 244, '<UUID>')]]

Building your own classifier

You can also train your own model if you want to detect different types of strings. For this you can use the Command Line Interface for the string classifier:

$ python3 stringlifier/modules/stringc.py --help

Usage: stringc.py [options]

Options:
  -h, --help            show this help message and exit
  --interactive
  --train
  --resume
  --train-file=TRAIN_FILE
  --dev-file=DEV_FILE
  --store=OUTPUT_BASE
  --patience=PATIENCE   (default=20)
  --batch-size=BATCH_SIZE
                        (default=32)
  --device=DEVICE

For instructions on how to generate your training data, use this link.

Important note: This model might not scale if detecting a type of string depends on the surrounding tokens. In this case, you can look at a more advanced tool for sequence processing such as NLP-Cube

stringlifier's People

Contributors

acotaie avatar atreyamaj avatar ninoseki avatar rscctest avatar tiberiu44 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stringlifier's Issues

Numbers - Detection

The number in the log is partially getting considered as number
The results do not mask the data completely

To Reproduce
Steps to reproduce the behavior:

  1. In the log, please add a log with a 16 digit number 4929193454463111
  2. Example - 4929193454463111
  3. Run Stringlifier
  4. In the output, it shows 49291

Support for other data types?

If my requirement is to support new data types such as regular expressions, URLs, cell phone numbers, etc., is there support for custom training?

Bug passing a list instead of a str

Describe the bug

When passing a list to stringlifier, it crashes with:

Traceback (most recent call last):
  File "test.py", line 17, in <module>
    x = stringlifier(s.split("/"))
  File "/root/p3.8/lib/python3.8/site-packages/stringlifier/api.py", line 61, in __call__
    new_str, toks = self._extract_tokens(tokens[iBatch], p_ts[iBatch], cutoff=cutoff)
  File "/root/p3.8/lib/python3.8/site-packages/stringlifier/api.py", line 118, in _extract_tokens
    if cls == 'C' and string[ii] in numbers:

Is is possible to release new pypi package with upgraded torch and numpy version to fix vulnerability?

Hi adobe team,

The latest version of stringlifier in pypi is v0.1.1.4, which is still using torch==1.6.0 and numpy==1.19.2. The last commit unleashed the version of torch while it’s not packaged to pypi.

We have no problem using the library, while there’s a vulnerability in torch==1.6.0 (CVE-2022-45907). To fix that, we need to upgrade torch to 1.13.1 with corresponding numpy version.

I have tried to clone repo, change requirements.txt with torch==1.13.1 and numpy==1.22.0, then build by ourselves to fix the vulnerability, while I would like to ask 2 questions

  1. Is it possible to release a new version to pypi with upgraded torch and numpy. Then we do not need to build by ourselves.
  2. Is there any issues for upgrading both libraries?

Thanks!

BR,
Shandi

Crash when passing an empty str

Describe the bug

Crash when passing None or an empty str ``""`"":

Traceback (most recent call last):
  File "test.py", line 20, in <module>
    print(stringlifier(elem)[0])
  File "/root/p3.8/lib/python3.8/site-packages/stringlifier/api.py", line 54, in __call__
    p_ts = self.classifier(tokens)
  File "/root/p3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/p3.8/lib/python3.8/site-packages/stringlifier/modules/stringc2.py", line 142, in forward
    output, _ = self._rnn(hidden)
  File "/root/p3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/p3.8/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 576, in forward
    result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: stack expects a non-empty TensorList

Expected behaviour: if it's not parsable, return it as-is (even if it is "" or None or an int or an object). Should be as fault-tolerant as possible.

I would only perform processing on elements in the input list if isinstance(element, str).
🎸

Add type hints

Is your feature request related to a problem? Please describe.

Adding type hints improves DX especially if you are using an IDE like VS code.

Describe the solution you'd like

Add type annotations.

Describe alternatives you've considered

N/A.

Additional context

I will work on this if this suggestion makes sense.

Add "/" as a delimiter

I would like to use stringlifier to parse HTML console and network logs, for this I'd need to add / as a separator.
Now, when running on strings, sometimes the / gets swallowed up in an UUID or RANDOM.

If possible, maybe a rule-based cleaning before applying ML would be beneficial, like having a standard 32 UUID (4 fixed groups separated by -) 🤔

Get the Classifier to Identify JWT Tokens

Is your feature request related to a problem? Please describe.
Currently, the classifier identifies JWT token as three random strings like this: ['<RANDOM_STRING>.<RANDOM_STRING>.<RANDOM_STRING>_<RANDOM_STRING>'].

Describe the solution you'd like
It would be great if the classifier could just output this as JWT token instead of identifying them as three random strings.

[Bug report] UnboundLocalError is raised

Describe the bug

from stringlifier.api import Stringlifier

stringlifier=Stringlifier()

s = stringlifier('device."\\n3.')
print(s)

This script raises the following error.

Traceback (most recent call last):
  File "test.py", line 5, in <module>
    s = stringlifier('device."\\n3.')
  File "/Users/foo/dev/stringlifier/stringlifier/api.py", line 137, in __call__
    new_str, toks = self._extract_tokens(tokens[iBatch], p_ts[iBatch])
  File "/Users/foo/dev/stringlifier/stringlifier/api.py", line 211, in _extract_tokens
    tokens.append((c_tok, start, ii, type_))
UnboundLocalError: local variable 'type_' referenced before assignment

To Reproduce

  1. Execute the script.

Expected behavior

Do not raise the UnboundLocalError.

Screenshots

N/A.

Desktop (please complete the following information):

  • OS: macOS
  • Browser: N/A
  • Version: Python 3.8.2

Additional context

N/A.

IP Address - Detection

The version number in the package name is getting considered as IP address
This causes a lot of false positives with the data set we are trying to use.

To Reproduce
Steps to reproduce the behavior:

  1. In the log, please add a log with version number
  2. Example - MongoDB v4.4.2
  3. Run Stringlifier
  4. In the output, it shows MongoDB v< IP_ADDR>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.