adobe / stringlifier Goto Github PK

Stringlifier is on Opensource ML Library for detecting random strings in raw text. It can be used in sanitising logs, detecting accidentally exposed credentials and as a pre-processing step in unsupervised ML-based analysis of application text data.

License: Apache License 2.0

Python 100.00%

machine-learning python3 api analysis unsupervised-machine-learning clustering tf-idf raw-text pytorch convolutional-networks long-short-term-memory classification

stringlifier's Introduction

stringlifier

String-classifier - is a python module for detecting random string and hashes text/code.

Typical usage scenarios include:

Sanitizing application or security logs
Detecting accidentally exposed credentials (complex passwords or api keys)

Interactive notebook

You can see Stringlifier in action by checking out this interactive notebook hosted on Colaboratory.

Quick start guide

You can quickly use stringlifier via pip-installation:

$ pip install stringlifier

In case you are using the pip3 installation that comes with Python3, use pip3 instead of pip in the above command.

$ pip3 install stringlifier

API example:

from stringlifier.api import Stringlifier

stringlifier=Stringlifier()

s = stringlifier("com.docker.hyperkit -A -u -F vms/0/hyperkit.pid -c 8 -m 8192M -b 127.0.0.1 --pass=\"NlcXVpYWRvcg\" -s 0:0,hostbridge -s 31,lpc -s 1:0,virtio-vpnkit,path=vpnkit.eth.sock,uuid=45172425-08d1-41ec-9d13-437481803412 -U c6fb5010-a83e-4f74-9a5a-50d9086b9")

After this, s should be:

'com.docker.hyperkit -A -u -F vms/0/hyperkit.pid -c 8 -m 8192M -b <IP_ADDR> --pass="<RANDOM_STRING>" -s 0:0,hostbridge -s 31,lpc -s 1:0,virtio-vpnkit,path=vpnkit.eth.sock,uuid=<UUID> -U <UUID>'

You can also choose to see the full tokenization and classification output:

s, tokens = stringlifier("com.docker.hyperkit -A -u -F vms/0/hyperkit.pid -c 8 -m 8192M -b 127.0.0.1 --pass=\"NlcXVpYWRvcg\" -s 0:0,hostbridge -s 31,lpc -s 1:0,virtio-vpnkit,path=vpnkit.eth.sock,uuid=45172425-08d1-41ec-9d13-437481803412 -U c6fb5010-a83e-4f74-9a5a-50d9086b9", return_tokens=True)

s will be the same as before and tokens will contain the following data:

[[('0', 33, 34, '<NUMERIC>'),
   ('8', 51, 52, '<NUMERIC>'),
   ('8192', 56, 60, '<NUMERIC>'),
   ('127.0.0.1', 65, 74, '<IP_ADDR>'),
   ('NlcXVpYWRvcg', 83, 95, '<RANDOM_STRING>'),
   ('0', 100, 101, '<NUMERIC>'),
   ('0', 102, 103, '<NUMERIC>'),
   ('31', 118, 120, '<NUMERIC>'),
   ('1', 128, 129, '<NUMERIC>'),
   ('0', 130, 131, '<NUMERIC>'),
   ('45172425-08d1-41ec-9d13-437481803412', 172, 208, '<UUID>'),
   ('c6fb5010-a83e-4f74-9a5a-50d9086b9', 212, 244, '<UUID>')]]

Building your own classifier

You can also train your own model if you want to detect different types of strings. For this you can use the Command Line Interface for the string classifier:

$ python3 stringlifier/modules/stringc.py --help

Usage: stringc.py [options]

Options:
  -h, --help            show this help message and exit
  --interactive
  --train
  --resume
  --train-file=TRAIN_FILE
  --dev-file=DEV_FILE
  --store=OUTPUT_BASE
  --patience=PATIENCE   (default=20)
  --batch-size=BATCH_SIZE
                        (default=32)
  --device=DEVICE

For instructions on how to generate your training data, use this link.

Important note: This model might not scale if detecting a type of string depends on the surrounding tokens. In this case, you can look at a more advanced tool for sequence processing such as NLP-Cube

stringlifier's People

Contributors

Stargazers

Watchers

stringlifier's Issues

Numbers - Detection

The number in the log is partially getting considered as number
The results do not mask the data completely

To Reproduce
Steps to reproduce the behavior:

In the log, please add a log with a 16 digit number 4929193454463111
Example - 4929193454463111
Run Stringlifier
In the output, it shows 49291

Support for other data types?

If my requirement is to support new data types such as regular expressions, URLs, cell phone numbers, etc., is there support for custom training?

Bug passing a list instead of a str

Describe the bug

When passing a list to stringlifier, it crashes with:

Traceback (most recent call last):
  File "test.py", line 17, in <module>
    x = stringlifier(s.split("/"))
  File "/root/p3.8/lib/python3.8/site-packages/stringlifier/api.py", line 61, in __call__
    new_str, toks = self._extract_tokens(tokens[iBatch], p_ts[iBatch], cutoff=cutoff)
  File "/root/p3.8/lib/python3.8/site-packages/stringlifier/api.py", line 118, in _extract_tokens
    if cls == 'C' and string[ii] in numbers:

Is is possible to release new pypi package with upgraded torch and numpy version to fix vulnerability?

Hi adobe team,

The latest version of stringlifier in pypi is v0.1.1.4, which is still using torch==1.6.0 and numpy==1.19.2. The last commit unleashed the version of torch while it’s not packaged to pypi.

We have no problem using the library, while there’s a vulnerability in torch==1.6.0 (CVE-2022-45907). To fix that, we need to upgrade torch to 1.13.1 with corresponding numpy version.

I have tried to clone repo, change requirements.txt with torch==1.13.1 and numpy==1.22.0, then build by ourselves to fix the vulnerability, while I would like to ask 2 questions

Is it possible to release a new version to pypi with upgraded torch and numpy. Then we do not need to build by ourselves.
Is there any issues for upgrading both libraries?

Thanks!

BR,
Shandi

Output score instead of type

Is there a way to output the score of prediction results instead of the direct output type？

Crash when passing an empty str

Describe the bug

Crash when passing None or an empty str ``""`"":

Traceback (most recent call last):
  File "test.py", line 20, in <module>
    print(stringlifier(elem)[0])
  File "/root/p3.8/lib/python3.8/site-packages/stringlifier/api.py", line 54, in __call__
    p_ts = self.classifier(tokens)
  File "/root/p3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/p3.8/lib/python3.8/site-packages/stringlifier/modules/stringc2.py", line 142, in forward
    output, _ = self._rnn(hidden)
  File "/root/p3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/p3.8/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 576, in forward
    result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: stack expects a non-empty TensorList

Expected behaviour: if it's not parsable, return it as-is (even if it is "" or None or an int or an object). Should be as fault-tolerant as possible.

I would only perform processing on elements in the input list if isinstance(element, str).
🎸

Add type hints

Is your feature request related to a problem? Please describe.

Adding type hints improves DX especially if you are using an IDE like VS code.

Describe the solution you'd like

Add type annotations.

Describe alternatives you've considered

N/A.

Additional context

I will work on this if this suggestion makes sense.

Add "/" as a delimiter

I would like to use stringlifier to parse HTML console and network logs, for this I'd need to add / as a separator.
Now, when running on strings, sometimes the / gets swallowed up in an UUID or RANDOM.

If possible, maybe a rule-based cleaning before applying ML would be beneficial, like having a standard 32 UUID (4 fixed groups separated by -) 🤔

Get the Classifier to Identify JWT Tokens

Is your feature request related to a problem? Please describe.
Currently, the classifier identifies JWT token as three random strings like this: ['<RANDOM_STRING>.<RANDOM_STRING>.<RANDOM_STRING>_<RANDOM_STRING>'].

Describe the solution you'd like
It would be great if the classifier could just output this as JWT token instead of identifying them as three random strings.

[Bug report] UnboundLocalError is raised

Describe the bug

from stringlifier.api import Stringlifier

stringlifier=Stringlifier()

s = stringlifier('device."\\n3.')
print(s)

This script raises the following error.

Traceback (most recent call last):
  File "test.py", line 5, in <module>
    s = stringlifier('device."\\n3.')
  File "/Users/foo/dev/stringlifier/stringlifier/api.py", line 137, in __call__
    new_str, toks = self._extract_tokens(tokens[iBatch], p_ts[iBatch])
  File "/Users/foo/dev/stringlifier/stringlifier/api.py", line 211, in _extract_tokens
    tokens.append((c_tok, start, ii, type_))
UnboundLocalError: local variable 'type_' referenced before assignment

To Reproduce

Execute the script.

Expected behavior

Do not raise the UnboundLocalError.

Screenshots

N/A.

Desktop (please complete the following information):

OS: macOS
Browser: N/A
Version: Python 3.8.2

Additional context

N/A.

IP Address - Detection

The version number in the package name is getting considered as IP address
This causes a lot of false positives with the data set we are trying to use.

To Reproduce
Steps to reproduce the behavior:

In the log, please add a log with version number
Example - MongoDB v4.4.2
Run Stringlifier
In the output, it shows MongoDB v< IP_ADDR>

Is there any paper or article describe the algorithms used here?

Hi Adobe team,

Thank you so much for publishing this library. I would like to learn more about the underlying algorithms, so could you please help give me some direction?

Thank you again,
Alex

adobe / stringlifier Goto Github PK

stringlifier's Introduction

stringlifier

Interactive notebook

Quick start guide

Building your own classifier

stringlifier's People

Contributors

Stargazers

Watchers

Forkers

stringlifier's Issues

Recommend Projects

Recommend Topics

Recommend Org