Giter Site home page Giter Site logo

logpai / drain3 Goto Github PK

View Code? Open in Web Editor NEW
441.0 13.0 130.0 214 KB

A robust streaming log template miner based on the Drain algorithm

License: Other

Python 99.96% Shell 0.04%
template-mining drain log aiops anomaly-detection clustering machine-learning observability log-clustering

drain3's Issues

module-wise config

Dear All,

Thanks for this nice implementation of the algo. I have one question regarding config treatment in the package.

The config object is created in the root of every module separately, which seems not to be a problem when the code is run from the command line, or once-executed script. However, this creates a nuisance when I use this library in a jupyter notebook. The objects are created in the cells where I import drain modules, and the version of config file is read at that time. Later in the notebook I'm trying to adjust the config values, but it doesn't get updated, because the module is already loaded. The only way I can use updated configuration is to reset a kernel and re run all the stuff again, which makes the try-research concept of the notebook useless.

I'd like to promote an idea to move the reading of the config file into init , so it's at least can be updated at the time of object creation.

Thanks for your help,
Andrey.

Monitor the use of frequently used cluster

How to monitor which clusters are being used frequently? By monitoring means that saving the cluster id of those recent clusters which has been indicated by following 3 change_type.

change_type - indicates either if a new template was identified, an existing template was changed or a message was added to an existing cluster.

Modify templates and reload clusters into TemplateMiner?

Would there be a way for me to save a TemplateMiner object, edit the templates in each cluster externally, and return the clusters to TemplateMiner in another script?

A friend of mine tried to suggest pickle-ing the object. Of course, that works for saving and reloading, but I need to be able to modify the templates manually.

Is there a way to add a wildcard template?

I've used Drain to create clusters by parsing one log file.
I would like to append to those clusters a 'wildcard' template (sth. like '<*>'), that will prevent Drain from creating new templates, so that Drain will classify unknown templates as '<*>'.

Could you give me a tip how to do that?

Duplicate clusters after loading saved state

Dear all,

when processing the same batch of logs twice everything works as expected, but when loading the state file again and process the same batch of logs, some new, duplicate, clusters will be created.

Here you can see the second run on the same batch of logs after reloading the state file:
2021-03-02 14_26_05-JupyterLab

And here are all clusters:
2021-03-02 14_31_14-JupyterLab

Shouldn't the behavior after loading the state file before processing the logs the second time be the same as when processing them twice without loading the state file?

Thank you in advance!

Only mask_name * is used

Hi,

I tried to run the example file (drain_stdin_demo.py) with example input log and I see that the mask_name is always * instead of IP, HEX, etc. I'd like to confirm to see if any of you has the same issue.

Plan to debug and add some fix if this is really an issue. Otherwise, could you please tell me if I miss anything?

/opt/ndfm/src # python3 log_parser2.py 
Starting Drain3 template miner
Checking for saved state
Restored 4 clusters built from 10 messages
Drain3 started with 'FILE' persistence
Starting training mode. Reading from std-in ('q' to finish)
> connected to 10.0.0.1
{"change_type": "none", "cluster_id": 1, "cluster_size": 5, "template_mined": "connected to <:*:>", "cluster_count": 4}
Parameters: [ExtractedParameter(value='10.0.0.1', mask_name='*')]
> connected to 192.168.0.1
{"change_type": "none", "cluster_id": 1, "cluster_size": 6, "template_mined": "connected to <:*:>", "cluster_count": 4}
Parameters: [ExtractedParameter(value='192.168.0.1', mask_name='*')]
> Hex number 0xDEADBEAF
{"change_type": "none", "cluster_id": 2, "cluster_size": 3, "template_mined": "Hex number <:*:>", "cluster_count": 4}
Parameters: [ExtractedParameter(value='0xDEADBEAF', mask_name='*')]
> user davidoh logged in
{"change_type": "none", "cluster_id": 3, "cluster_size": 4, "template_mined": "user <:*:> logged in", "cluster_count": 4}
Parameters: [ExtractedParameter(value='davidoh', mask_name='*')]
> user eranr logged in
{"change_type": "none", "cluster_id": 3, "cluster_size": 5, "template_mined": "user <:*:> logged in", "cluster_count": 4}
Parameters: [ExtractedParameter(value='eranr', mask_name='*')]

TIA

Drain3 does not extract text patterns?

Thanks for putting this together team. I have been trying to use Drain and came across this issue.

'user=mike ip=unknown-ip-addr cmd=Metastore shutdown complete',
'user=mike ip=unknown-ip-addr cmd=Shutting down the object store',
'user=smith ip=unknown-ip-addr cmd=Metastore shutdown complete',
'user=smith ip=unknown-ip-addr cmd=Shutting down the object store',
'user=jackson ip=unknown-ip-addr cmd=Metastore shutdown complete',
'user=jackson ip=unknown-ip-addr cmd=Shutting down the object store',
'user=bob ip=unknown-ip-addr cmd=Metastore shutdown complete',
'user=bob ip=unknown-ip-addr cmd=Shutting down the object store'

So, ideally, the patterns look similar i.e. of the form

user=<*> ip=<*> cmd=<*>

But, the drain algorithm does not pick this up. I have tried with several params of sim_th, depth, and max_children.

Am I missing something? Can someone help?

Some questions about drain_bigfile_demo

Hi David,
Thanks for the great work on updating Drain to Python 3.

I have some questions about drain_bigfile_demo:
1- Why use partition by ':' ? => line = line.partition(": ")[2]
Not all log records have ':' and therefore I receive a very large amount of blank template.

2- I'm getting many templates that begin with masking,
since drain layers are divided to clusters by size and then words from the begining,
this is causing them to be in the same cluster.
Do you have a suggestion to solve this issue?

3- Writing results to file: for anomaly detection I'm required to create a new file with records containing timestamp and the result template from drain. How can I write this into a file? (How Do I map old log message record to result template)

Thanks!

Drain3 deprecation warning with pip install command.

When I install drain3 with pip, I get this deprecation warning:

DEPRECATION: drain3 is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559

Cache always updates clusters even if not needed anymore

During fast_match, drain always iterates over all possible clusters and updates their access time in the cache. This leads to two problems:

  • The update slows down the performance
  • Even clusters that will never match anymore will never be removed from cache

Expected behavior:

Cluster will only be updated/touched in cache after they were actual used/chosen. There is actually a comment for this in the source code already:

Try to retrieve cluster from cache with bypassing eviction algorithm as we are only testing candidates for a match.
https://github.com/IBM/Drain3/blob/15470e391caed9a9ef5038cdd1dbd373bd2386a8/drain3/drain.py#L217

I cannot get value vector from match

Hi,

I would like to get value vector from log, not only cluster it matches. How to get it?

Match returns only LogCluster.

def match(self, log_message: str) -> LogCluster:

Saving log template/cluster and ID for each log

Hi!

I am familiar with the old package and starting to get accustomed with Drain3.

I have a log file example.log and I have used Drain3 to parse each log with

logging.basicConfig(filename="output_example.log", filemode='a', level=logging.DEBUG)
logger = logging.getLogger(__name__)

config = TemplateMinerConfig()
config.load("drain3.ini")
config.profiling_enabled = True
template_miner = TemplateMiner(config=config)

line_count = 0
with open("example.log") as f:
    lines = f.readlines()

batch_size = 10

for line in lines:
    line = line.rstrip()
    line = line.partition(": ")[2]
    result = template_miner.add_log_message(line)
    line_count += 1
    if line_count % batch_size == 0:
        logger.info(f"Processing line: {line_count}, rate {rate:.1f} lines/sec, "
                    f"{len(template_miner.drain.clusters)} clusters so far.")
        
    if result["change_type"] != "none":
        result_json = json.dumps(result)
        logger.info(f"Input ({line_count}): " + line)
        logger.info("Result: " + result_json)

sorted_clusters = sorted(template_miner.drain.clusters, key=lambda it: it.size, reverse=True)

for cluster in sorted_clusters:
    logger.info(cluster)

I am able to load the sorted clusters/templates by specifying

with open('output_example.log', 'r') as f:
  lines = f.readlines()

But it is a bit tedious to keep track of the different log clusters/templates this way and I have not found a way to label each original log with its new log cluster/template ID.

Do you have any suggestions of how to do this in a better way? For example, how to save a CSV with columns "original log row number ", "new parsed log", "parsed log ID"?

Thanks in advance for your help!

Annabelle

About parameter `full_search_strategy` in drain match method

As the comments said
(3) "always" is the slowest. It will select the best match among all known clusters, by always evaluating all clusters with the same token count, and selecting the cluster with perfect all token match and least count of wildcard matches.

I have two clusters as below, one has wildcard and another not
1

And I have a log to be matched
IPPROTO_TCP fd: 100, errno: 100, option: 100, value: 100

After masked it become
IPPROTO_TCP fd <NUM> errno <NUM> option <NUM> value <NUM>
And I use "always" strategy, as the comments said, I should get the cluster(id=1) which has least wildcard, but I get result

ID=2 : size=1 : IPPROTO_TCP fd <NUM> errno <NUM> option <NUM> value <*>
So I read the code in fast_match, and I found this code seg will always return the cluster which has most param_count, is it wrong?

2

Should I modified it like this

3

specify a log file

thanks for all this effort

sorry about this simple question. I need to put a file from my system to be mined. how can I do that?
I need to know how can I put options like (file, sim_th and max_clusters )

thanks a lot

Error parsing logs: "ZeroDivisionError: float division by zero"

Hi,
I'm using Drain3 to parse some logs and sometimes I get the following error:

Traceback (most recent call last):
File "C:+\envs+\lib\site-packages\IPython\core\interactiveshell.py", line 3369, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 13, in <cell line: 11>
result = template_miner.add_log_message(msg)
File "C:+\envs+\lib\site-packages\drain3\template_miner.py", line 146, in add_log_message
self.profiler.report(self.config.profiling_report_sec)
File "C:+\envs+\lib\site-packages\drain3\simple_profiler.py", line 112, in report
text = os.linesep.join(lines)
File "C:+\envs+\lib\site-packages\drain3\simple_profiler.py", line 111, in
lines = map(lambda it: it.to_string(enclosing_time_sec, include_batch_rates), sorted_sections)
File "C:+\envs+\lib\site-packages\drain3\simple_profiler.py", line 135, in to_string
samples_per_sec = f"{self.sample_count / self.total_time_sec: 15,.2f}"
ZeroDivisionError: float division by zero

any help would be appreciated.

Issue with match method in Drain class

There seems to be an issue with match method in the Drain class.
When configuring a template miner with a config file with masking (e.g. masking = [{"regex_pattern":"((?<=[^A-Za-z0-9])|^)([\\-\\+]?\\d+)((?=[^A-Za-z0-9])|$)", "mask_with": "NUM"}]) and mining all templates, it becomes impossible to match lines containing tokens that fit the regex patterns defined in the config file.

Example: with the same config file, imagine I get templates
ID=1 : size=100 : RAS node <:NUM:>
ID=2 : size=10 : RAS error <:NUM:>.
Then template_miner.drain.match("RAS node 12334") will return None and not the first template.

TensorFlow variant

Good afternoon!

Our ML team uses drain3 to transform system logs as part of a larger classification pipeline. In this pipeline, we use a pre-trained template miner to transform all of the batched logs being passed into the classifier for training. We are currently investigating how this could be done using tf.data.Dataset.map API to keep the pipeline efficient.

To this end, we were curious if any other drain3 users could benefit from a TensorFlow variant of the TemplateMiner. We have experience with TensorFlow and drain3 and would be willing to begin work on such a project.

comparison of type int with type str in function add_seq_to_prefix_tree

Hi,
due to a type mismatch between current_depth and token_count the boolean flag is_last_token is always false.
It should read int(tocken_count).

https://github.com/IBM/Drain3/blob/f004cb235f92646b3cfdb4ed6680765e9f944d06/drain3/drain.py#L136

This a test which fails:

    def test_one_token_message(self):
        model = Drain()
        cluster, change_type = model.add_log_message("oneTokenMessage")
        self.assertEqual("cluster_created", change_type, "1st check")
        cluster, change_type = model.add_log_message("oneTokenMessage")
        self.assertEqual("none", change_type, "2nd check")

PS: Thanks for fixing the other two issues so quickly

Log Matching on new data

Hi, I've been trying to use drain for preprocessing logs for deeplog model

The matching function in Template miner class works perfectly for data it has been trained on, but for some reason it always returns Nonetype when trying to match unseen log data even if some of the logs are identical to the one's in training data.

For example for the following log in training data:-

'WMPLTFMLOG523523\t1676462824978\t2023-02-15 12:07:04.978\t11.16.135.252\t-\t-\t-\t-\tprod\treceiving-api\tunknown\tPROD\t0.0.3118\t4f5f5b1a-205-18654f89e12000\tINFO\tINFO\t-\t-\t-\t-\tapplog.cls=com.expat.move.nim.secure.core.advice.AOPLogger,applog.mthd=beforeControllerMethod,applog.line=34,applog.msg=Entering into [methodName=getHeartBeat] with [requests=[]]\t[]\t[]\t[-]\t[]\t[http-apr-8080-exec-14]\n'

I get the matching cluster_id

but for the exact log in unseen data:-

'WMPLTFMLOG523523\t1681105826004\t2023-04-10 05:50:26.004\t11.16.146.8\t-\t-\t-\t-\tprod\treceiving-api\tunknown\tPROD\t0.0.3393\tffffffffd51aa0d2-192-18769b730d4000\tINFO\tINFO\t-\t-\t-\t-\tapplog.cls=com.expat.move.nim.secure.core.advice.AOPLogger,applog.mthd=beforeControllerMethod,applog.line=34,applog.msg=Entering into [methodName=getHeartBeat] with [requests=[]]\t[]\t[]\t[-]\t[]\t[http-apr-8080-exec-6]\n'

I get a None type return

I passed some extra delimiters as part of the config file when training, could that be causing the issue?

How to purge no-use existing cluster after running for a while

Dear all,
In production, data will drift in some conditions and some of old cluster will never happen again. The new log will never comes in the cluster so the result is not impact. However, the performance is. As number of cluster increase, the running performance will degrade because consuming more time on comparing existing cluster.

So, here’s the question: Are there any interface in kernel or suggestions to automatically detect no-use cluster and purge it?

Thanks to all the developers, drain3 is an amazing kerne.

Skip Masking/cluster particular tokens

Example:

sentence 1: "review 1 - [INFO] The syntax is right."
sentence 2: "review 2 - [ERROR] The syntax is faulty."


Cluster formed:

"review <:NUM:> - <> The syntax is <>"


Expected cluster:

"review <:NUM:> - [INFO] The syntax is <>",
"review <:NUM:> - [ERROR]The syntax is <
>".

How can I prevent masking/ clustering some particular tokens like here, in this case, I want to keep INFO/ERROR as it is in two different clusters. (not masked into <*>).

API document for Drain3

Is there any document about how to use Drain3 to inference and how to utilize the parsed result?

Is there any API document about how to use the functions in Drain3? Thanks.

Currently I am using the following code to parse log file with trained miner.

with open(log_file) as f:
    lines = f.readlines()
    for line in lines:
        cluster = template_miner.match(line)
        parms = template_miner.get_parameter_list(cluster.get_template(), line)
        print(cluster.get_template())
        print(parms)

Is it correct?

HDFS/BGL

Hi, thanks for the py3 implementation! I'm wondering if Drain3 supports the popular Syslog datasets, e.g., HDFS or BG/L?

Match a text to a template

Great tool! Thanks for making it available.

Is it possible to match a text to a "drained" template, i.e, get the cluster id for a specific message?

Questions

Hello,

Thanks for this good repo and blog post regarding log parsing.
I would like to use this library to do some works and I have some remaining questions before jumping in.
What do you with the template that you found? Do you transforn them to regex?
I don't get the usefullness of masking. What is the difference between masking and preprocessing (as you said in your blog)?
Do you plan to support other masking format such as grok ?
Could share some information about the analytics pipeline? Do you forseen the upcoming blog post regarding that?

Thanks for your time and response.

Kind regards

Restrictions on matching mode

In the example of the SSH.log, I noticed this clustering result :
<L=5> ID=2 : size=14551 : Invalid user <:*:> from <:IP:> <L=6> ID=27 : size=30 : Invalid user <:*:> <:*:> from <:IP:>
but i think these two template clusters should belong to one type.
The possible reason for this clustering is that the third word is composed of words and numbers, which are processed into during marking. As a result, the length of the entire log becomes longer. For example,
Invalid user test9 from 52.80.34.196
or The string that is actually in the variable of the log template is made up of two words. For example,
Invalid user boris zhang from 52.80.34.196 (This log is not in the ssh.log file, but this example may appear in other datasets)
All of these log messages should be grouped into a single log template cluster, but they are not.
So I wondered if I should devise a new matching pattern. The starting point of the design:
1, Cancel log message length as the first tree node.
2, The formula for calculating text similarity is designed to be calculated according to the text content.

Search very slow when logline has only one token (e.g. from masking)

I have a lot of lines that are are masked completely as they contain a lot of rubbish - e.g. resulting in one token (=mask) which will become the template.
For some reason search on these lines is extremely slow (given we have a bigger search tree already) while it should be actually super fast as they have only one token.

I cannot give a good example of the log due to confidentiality but perhaps this issue/limitation is generally known already?

Windows regular expression

Hello. I am not an expert in regular expression. So, Can I get any help from you to preprocess the windows event log using regular expression? Thank you very much.

  • For registry 2016-09-29 00:03:19, Info CBS Unloading offline registry hive: {bf1a281b-ad7b-4476-ac95-f47682990ce7}GLOBALROOT/Device/HarddiskVolumeShadowCopy2/Windows/System32/config/SOFTWARE
  • For Windows path 2016-09-28 04:30:30, Info CBS Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll

RuntimeError: dictionary keys changed during iteration

Hi,
after upgrading to version 0.9.1 the application fails with the following exception (see below)
In fact, it seems that since python 3.8 code like the following:

        for key in keys:
            drain.id_to_cluster[int(key)] = drain.id_to_cluster.pop(key)

is no longer valid and leads to the observed exception (when starting the parser for the second time and loading the pickeled state)

...
  File "/home/wollny/.cache/pypoetry/virtualenvs/src-k3pL4lLu-py3.8/lib/python3.8/site-packages/drain3/template_miner.py", line 56, in __init__
    self.load_state()
  File "/home/wollny/.cache/pypoetry/virtualenvs/src-k3pL4lLu-py3.8/lib/python3.8/site-packages/drain3/template_miner.py", line 74, in load_state
    for key in keys:
RuntimeError: dictionary keys changed during iteration

Would be nice if you could fix this, as I like your project very much :-)

PS: It would be also nice if you could upgrade the jsonpickle package, as version 1.4.1 has a known safety issue, which got fixed in version 1.4.2

nox > safety check --file=requirements.txt --full-report
+==============================================================================+
|                                                                              |
|                               /$$$$$$            /$$                         |
|                              /$$__  $$          | $$                         |
|           /$$$$$$$  /$$$$$$ | $$  \__//$$$$$$  /$$$$$$   /$$   /$$           |
|          /$$_____/ |____  $$| $$$$   /$$__  $$|_  $$_/  | $$  | $$           |
|         |  $$$$$$   /$$$$$$$| $$_/  | $$$$$$$$  | $$    | $$  | $$           |
|          \____  $$ /$$__  $$| $$    | $$_____/  | $$ /$$| $$  | $$           |
|          /$$$$$$$/|  $$$$$$$| $$    |  $$$$$$$  |  $$$$/|  $$$$$$$           |
|         |_______/  \_______/|__/     \_______/   \___/   \____  $$           |
|                                                          /$$  | $$           |
|                                                         |  $$$$$$/           |
|  by pyup.io                                              \______/            |
|                                                                              |
+==============================================================================+
| REPORT                                                                       |
| checked 78 packages, using default DB                                        |
+============================+===========+==========================+==========+
| package                    | installed | affected                 | ID       |
+============================+===========+==========================+==========+
| jsonpickle                 | 1.4.1     | <=1.4.1                  | 39319    |
+==============================================================================+
| Jsonpickle through 1.4.1 allows remote code execution during deserialization |
| of a malicious payload through the decode() function. See CVE-2020-22083.    |
+==============================================================================+

Error when running the example.

Hi,

I'm trying to run the example: drain_bigfile_demo.py, but I'm getting the error:

ImportError: cannot import name 'KeysView' from 'collections' (C:\Miniconda3\...\lib\collections\__init__.py)
The code is failing on this line of code:
from drain3 import TemplateMiner

To provide a bit of context, I installed Drain3 via: pip3 install drain3, my python version is 3.10.0.
Thanks for your help.
Regards.

parallel log ingestions

Dear Drain3 project,

We are very glad to find drain3. We tested it with a logfile having about 200k lines and it finishes processing in about 10 secs in my macbook.

we would like to use drain3 to process logs from logfiles of same type from hundreds of sources, keeping one tree and state. Could you advise what is best way to parallelize log ingestion?

I looked at the code, seems processing log function add_log_messages should be run single threaded.

best regards

Delete cluster from drain dict id_to_cluster | Impact | procedure

I observed that when I create a lot of clusters (10000+), the drain3 kernel consumes more processing time. So the solution I thought was to delete old clusters which are no use manually.
Assuming I have a list of clusterId which I want to remove from the drain3 kernel, what is the safest possible procedure? Please give a detailed explanation (how to modify parse tree or only deleting from template_miner.drain.id_to_cluster dict is sufficient. If no, then what else to do ?)
If deleting is not a good idea, then how to improve the running time?

Missing config object in sample files

Hey,
in both sample files no config object is passed to TemplateMiner. This should cause TemplateMiner to fall back to the default .ini file. However, if you use the call from the readme python -m examples.drain_stdin_demo to call the examples, there is no drain3.ini in the current working directory ("main" directory) and thus no config file is used. This leads to the fact that the examples cannot be reproduced, since no masking templates are used.

I think the cleanest way would be to create config objects in the examples.
Output of the current version:

python -m examples.drain_stdin_demo
Starting Drain3 template miner
Loading configuration from drain3.ini
config file not found: drain3.ini
Checking for saved state
Saved state not found
Drain3 started with 'FILE' persistence, reading from std-in (input 'q' to finish)
IP is 12.12.12.12
Saving state of 1 clusters with 1 messages, 352 bytes, reason: cluster_created (1)
{"change_type": "cluster_created", "cluster_id": 1, "cluster_size": 1, "template_mined": "IP is 12.12.12.12", "cluster_count": 1}

Proposed fix (e.g. drain_stdin_demo.py Line 45):

config = TemplateMinerConfig()
if os.path.exists(os.path.join("examples","drain3.ini")):
    config.load(os.path.join("examples","drain3.ini"))
else:
    logger.error("Drain3.ini file not found")

template_miner = TemplateMiner(persistence, config)

Fixed Output:

python -m examples.drain_stdin_demo
Starting Drain3 template miner
Checking for saved state
Saved state not found
Drain3 started with 'FILE' persistence, reading from std-in (input 'q' to finish)
IP is 12.12.12.12
Saving state of 1 clusters with 1 messages, 940 bytes, reason: cluster_created (1)
{"change_type": "cluster_created", "cluster_id": 1, "cluster_size": 1, "template_mined": "IP is <IP>", "cluster_count": 1}

Matching float numbers

Hi,

I'm trying to match and mask a float number ("+12", "-12", "-3.14", ".314e1", etc) in a sentence. I've tried several regexes, like this one: "^a-zA-Z:"

Althoug this regex works when I run in python re.findall("[^a-zA-Z:]([-+]?\d+[\.]?\d*)", 'Hi, -1.25 is a float') ,
if I add it to the mining instructions as

{"regex_pattern":"(?<![a-zA-Z:])[-+]?\d*\.?\d+", "mask_with": "FLO"},

the masking doesn't occur, I get a <:*:> in the mined template.

What am I doing wrong?

Extra delimiters in config

Hi,

Pls. let me know what is the use of specifying the extra_delimiters = ["_"] in the config.ini file ?

`get_parameters_list` can return incorrect parameters

I'm back again with another inconsistency.
Observe the following example:

>>> from drain3 import *
>>> parser = TemplateMiner()
>>> template = "<hdfs_uri>:<number>+<number>"
>>> content = "hdfs://msra-sa-41:9000/pageinput2.txt:671088640+134217728"
>>> parser.get_parameter_list(template, content)
['hdfs', '//msra-sa-41:9000/pageinput2.txt:671088640', '134217728']

Now of course this arises in a context where I use some custom masking patterns.
The expected parameter-list according to those masking patterns would be:
['hdfs//msra-sa-41:9000/pageinput2.txt', '671088640', '134217728']
but get_parameter_list does not take that into account.

I'll give another more concise example, to demonstrate why this fails:

>>> parser.config.masking_instructions = [masking.MaskingInstruction(r"\d+\.\d+", "float")]
>>> parser.get_parameter_list("<float>.<*>", "0.15.Test")
['0', '15.Test']

Therefore the problem is that the delimiter between these two parameters . is also part of the desired first parameter 0.15.
I gave it a thought and I think that implies this problem can only occur with custom masking patterns:
Under normal circumstances Drain would not produce a template where two parameters are separated by a delimiter other than a space. And since a parameter can only be a single token, they do not contain spaces and therefore the problem above does not occur.
(This might be a different story for extra_delimiters, but for the simple examples I can think of there shouldn't be any problems with that either.)


One solution would be to use the masking patterns to extract any parameters first and then apply the regular parameter extraction.
I'm working on a solution using this idea, but it's not ready yet, as it is a bit challenging to preserve the correct order.

Alternatively one could include the masking-pattern in the mask, e.g. <float|\d+\.\d+>.
Then one could use these patterns instead of (.*?):
https://github.com/IBM/Drain3/blob/6fd6117859f45560f0e576ffcbcc63863d65bdde/drain3/template_miner.py#L181
But this would mean that regexes need to be present in log-templates which is obviously less readable.
If the mask_with-attributes were unique across all MaskingInstruction-objects, one could simply use the mask to determine the required pattern, but at the moment users are free to assign multiple MaskingInstructions with the same mask_with-value.

Now since the masking patterns would need to be evaluated twice if you'd want to get the template of a log message and also the corresponding parameters, one could think about (optionally) including the parameters in the return-values of add_log_message(...) and match(...) directly. But that would also require changing multiple methods in drain.py so that would be more cumbersome.

Previously trained messages cannot always be matched

Observe the following MWE (using the current version 0.9.7):

>>> from drain3 import *
>>> parser = TemplateMiner()
>>> parser.add_log_message("training4Model start")
{'change_type': 'cluster_created', 'cluster_id': 1, 'cluster_size': 1, 'template_mined': 'training4Model start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel start")
{'change_type': 'cluster_template_changed', 'cluster_id': 1, 'cluster_size': 2, 'template_mined': '<*> start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel stop")
{'change_type': 'cluster_created', 'cluster_id': 2, 'cluster_size': 1, 'template_mined': 'loadModel stop', 'cluster_count': 2}
>>> parser.match("loadModel start") is not None # This message was seen during training, so it should match.
False

Since loadModel start was previously passed in to parser.add_log_message, it should be possible to match it later in the "interference" mode.
I would expect previously trained messages to always match later on.
As illustrated above, this cannot be guaranteed for arbitrary messages.


I've analyzed what is happening by looking through the source code and came to the following explanation:

  • When parser.match("loadModel start") is called, there are 2 clusters present with the following templates:
    1. <*> start: This is the cluster that includes the first occurrence of loadModel start.
    2. loadModel stop
  • Thus when tree_search is called, loadModel matches the prefix of 2nd template and 1st template is completely disregarded. (line 131)
  • Obviously loadModel start does NOT exactly match loadModel stop and as such no match is found. (line 134)
    https://github.com/IBM/Drain3/blob/6b724e7651bca2ac418dcbad01602a523a70b678/drain3/drain.py#L121-L135

The reason the first template was modified to include a wildcard (which later prevents the log message from matching), is that training4Model includes a number and thus will replaced by a wildcard if no other prefix matches.
See: https://github.com/IBM/Drain3/blob/6b724e7651bca2ac418dcbad01602a523a70b678/drain3/drain.py#L196-L199
The following example works perfectly fine therefore:

>>> from drain3 import *
>>> parser = TemplateMiner()
>>> parser.add_log_message("trainingModel start")
{'change_type': 'cluster_created', 'cluster_id': 1, 'cluster_size': 1, 'template_mined': 'trainingModel start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel start")
{'change_type': 'cluster_created', 'cluster_id': 2, 'cluster_size': 1, 'template_mined': 'loadModel start', 'cluster_count': 2}
>>> parser.add_log_message("loadModel stop")
{'change_type': 'cluster_template_changed', 'cluster_id': 2, 'cluster_size': 2, 'template_mined': 'loadModel <*>', 'cluster_count': 2}
>>> parser.match("loadModel start") is not None # This message was seen during training, so it should match.
True

I'm not sure how this should be optimally handled.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.