logpai / drain3 Goto Github PK
View Code? Open in Web Editor NEWA robust streaming log template miner based on the Drain algorithm
License: Other
A robust streaming log template miner based on the Drain algorithm
License: Other
Dear All,
Thanks for this nice implementation of the algo. I have one question regarding config treatment in the package.
The config object is created in the root of every module separately, which seems not to be a problem when the code is run from the command line, or once-executed script. However, this creates a nuisance when I use this library in a jupyter notebook. The objects are created in the cells where I import drain modules, and the version of config file is read at that time. Later in the notebook I'm trying to adjust the config values, but it doesn't get updated, because the module is already loaded. The only way I can use updated configuration is to reset a kernel and re run all the stuff again, which makes the try-research concept of the notebook useless.
I'd like to promote an idea to move the reading of the config file into init , so it's at least can be updated at the time of object creation.
Thanks for your help,
Andrey.
Hi,
please add cachetools to setup.py, so that it gets installed automatically when installing drain3.
Thanks a lot!
https://github.com/IBM/Drain3/blob/06d6ca44217271086c8b499aeb08090c9788ce9b/setup.py#L9
How to monitor which clusters are being used frequently? By monitoring means that saving the cluster id of those recent clusters which has been indicated by following 3 change_type.
change_type - indicates either if a new template was identified, an existing template was changed or a message was added to an existing cluster.
Would there be a way for me to save a TemplateMiner object, edit the templates in each cluster externally, and return the clusters to TemplateMiner in another script?
A friend of mine tried to suggest pickle-ing the object. Of course, that works for saving and reloading, but I need to be able to modify the templates manually.
I've used Drain to create clusters by parsing one log file.
I would like to append to those clusters a 'wildcard' template (sth. like '<*>'), that will prevent Drain from creating new templates, so that Drain will classify unknown templates as '<*>'.
Could you give me a tip how to do that?
Dear all,
when processing the same batch of logs twice everything works as expected, but when loading the state file again and process the same batch of logs, some new, duplicate, clusters will be created.
Here you can see the second run on the same batch of logs after reloading the state file:
Shouldn't the behavior after loading the state file before processing the logs the second time be the same as when processing them twice without loading the state file?
Thank you in advance!
Hi,
as far as I can see from the README this is the tree based implementation. Do you als plan to support the DAG based version (https://arxiv.org/pdf/1806.04356.pdf)? I think this is the code from the logpai repo: https://github.com/logpai/logparser/blob/dev/logparser/Drainjournal.py
Not exactly sure how it compares to the original algorithm regarding perf, ... but the automated paramter tuning or log group merging features sound interesting.
How can I get the original log lines replaced with drain parsed log lines. Is that functionality still there? What is the reason behind removing text preprocessing step using regex?
Hi,
I tried to run the example file (drain_stdin_demo.py) with example input log and I see that the mask_name is always * instead of IP, HEX, etc. I'd like to confirm to see if any of you has the same issue.
Plan to debug and add some fix if this is really an issue. Otherwise, could you please tell me if I miss anything?
/opt/ndfm/src # python3 log_parser2.py
Starting Drain3 template miner
Checking for saved state
Restored 4 clusters built from 10 messages
Drain3 started with 'FILE' persistence
Starting training mode. Reading from std-in ('q' to finish)
> connected to 10.0.0.1
{"change_type": "none", "cluster_id": 1, "cluster_size": 5, "template_mined": "connected to <:*:>", "cluster_count": 4}
Parameters: [ExtractedParameter(value='10.0.0.1', mask_name='*')]
> connected to 192.168.0.1
{"change_type": "none", "cluster_id": 1, "cluster_size": 6, "template_mined": "connected to <:*:>", "cluster_count": 4}
Parameters: [ExtractedParameter(value='192.168.0.1', mask_name='*')]
> Hex number 0xDEADBEAF
{"change_type": "none", "cluster_id": 2, "cluster_size": 3, "template_mined": "Hex number <:*:>", "cluster_count": 4}
Parameters: [ExtractedParameter(value='0xDEADBEAF', mask_name='*')]
> user davidoh logged in
{"change_type": "none", "cluster_id": 3, "cluster_size": 4, "template_mined": "user <:*:> logged in", "cluster_count": 4}
Parameters: [ExtractedParameter(value='davidoh', mask_name='*')]
> user eranr logged in
{"change_type": "none", "cluster_id": 3, "cluster_size": 5, "template_mined": "user <:*:> logged in", "cluster_count": 4}
Parameters: [ExtractedParameter(value='eranr', mask_name='*')]
TIA
is there anything out of the box or can I use other python libraries to visualize the drain parse tree ?
Thanks for putting this together team. I have been trying to use Drain and came across this issue.
'user=mike ip=unknown-ip-addr cmd=Metastore shutdown complete',
'user=mike ip=unknown-ip-addr cmd=Shutting down the object store',
'user=smith ip=unknown-ip-addr cmd=Metastore shutdown complete',
'user=smith ip=unknown-ip-addr cmd=Shutting down the object store',
'user=jackson ip=unknown-ip-addr cmd=Metastore shutdown complete',
'user=jackson ip=unknown-ip-addr cmd=Shutting down the object store',
'user=bob ip=unknown-ip-addr cmd=Metastore shutdown complete',
'user=bob ip=unknown-ip-addr cmd=Shutting down the object store'
So, ideally, the patterns look similar i.e. of the form
user=<*> ip=<*> cmd=<*>
But, the drain algorithm does not pick this up. I have tried with several params of sim_th, depth, and max_children.
Am I missing something? Can someone help?
Hi David,
Thanks for the great work on updating Drain to Python 3.
I have some questions about drain_bigfile_demo:
1- Why use partition by ':' ? => line = line.partition(": ")[2]
Not all log records have ':' and therefore I receive a very large amount of blank template.
2- I'm getting many templates that begin with masking,
since drain layers are divided to clusters by size and then words from the begining,
this is causing them to be in the same cluster.
Do you have a suggestion to solve this issue?
3- Writing results to file: for anomaly detection I'm required to create a new file with records containing timestamp and the result template from drain. How can I write this into a file? (How Do I map old log message record to result template)
Thanks!
When I install drain3 with pip, I get this deprecation warning:
DEPRECATION: drain3 is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
During fast_match
, drain always iterates over all possible clusters and updates their access time in the cache. This leads to two problems:
Expected behavior:
Cluster will only be updated/touched in cache after they were actual used/chosen. There is actually a comment for this in the source code already:
Try to retrieve cluster from cache with bypassing eviction algorithm as we are only testing candidates for a match.
https://github.com/IBM/Drain3/blob/15470e391caed9a9ef5038cdd1dbd373bd2386a8/drain3/drain.py#L217
Hi,
I would like to get value vector from log, not only cluster it matches. How to get it?
Match returns only LogCluster.
def match(self, log_message: str) -> LogCluster:
Hi!
I am familiar with the old package and starting to get accustomed with Drain3.
I have a log file example.log and I have used Drain3 to parse each log with
logging.basicConfig(filename="output_example.log", filemode='a', level=logging.DEBUG)
logger = logging.getLogger(__name__)
config = TemplateMinerConfig()
config.load("drain3.ini")
config.profiling_enabled = True
template_miner = TemplateMiner(config=config)
line_count = 0
with open("example.log") as f:
lines = f.readlines()
batch_size = 10
for line in lines:
line = line.rstrip()
line = line.partition(": ")[2]
result = template_miner.add_log_message(line)
line_count += 1
if line_count % batch_size == 0:
logger.info(f"Processing line: {line_count}, rate {rate:.1f} lines/sec, "
f"{len(template_miner.drain.clusters)} clusters so far.")
if result["change_type"] != "none":
result_json = json.dumps(result)
logger.info(f"Input ({line_count}): " + line)
logger.info("Result: " + result_json)
sorted_clusters = sorted(template_miner.drain.clusters, key=lambda it: it.size, reverse=True)
for cluster in sorted_clusters:
logger.info(cluster)
I am able to load the sorted clusters/templates by specifying
with open('output_example.log', 'r') as f:
lines = f.readlines()
But it is a bit tedious to keep track of the different log clusters/templates this way and I have not found a way to label each original log with its new log cluster/template ID.
Do you have any suggestions of how to do this in a better way? For example, how to save a CSV with columns "original log row number ", "new parsed log", "parsed log ID"?
Thanks in advance for your help!
Annabelle
As the comments said
(3) "always" is the slowest. It will select the best match among all known clusters, by always evaluating all clusters with the same token count, and selecting the cluster with perfect all token match and least count of wildcard matches.
I have two clusters as below, one has wildcard and another not
And I have a log to be matched
IPPROTO_TCP fd: 100, errno: 100, option: 100, value: 100
After masked it become
IPPROTO_TCP fd <NUM> errno <NUM> option <NUM> value <NUM>
And I use "always" strategy, as the comments said, I should get the cluster(id=1) which has least wildcard, but I get result
ID=2 : size=1 : IPPROTO_TCP fd <NUM> errno <NUM> option <NUM> value <*>
So I read the code in fast_match, and I found this code seg will always return the cluster which has most param_count, is it wrong?
Should I modified it like this
thanks for all this effort
sorry about this simple question. I need to put a file from my system to be mined. how can I do that?
I need to know how can I put options like (file, sim_th and max_clusters )
thanks a lot
Hi!
Great repository, thanks for putting it out there!
I have a quick question; do you have built-in logic for updating the previous results whenever the cluster template changes? Just want to verify this before I start making my own logic on it.
Best regards,
stianvale
Hi,
I'm using Drain3 to parse some logs and sometimes I get the following error:
Traceback (most recent call last):
File "C:+\envs+\lib\site-packages\IPython\core\interactiveshell.py", line 3369, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 13, in <cell line: 11>
result = template_miner.add_log_message(msg)
File "C:+\envs+\lib\site-packages\drain3\template_miner.py", line 146, in add_log_message
self.profiler.report(self.config.profiling_report_sec)
File "C:+\envs+\lib\site-packages\drain3\simple_profiler.py", line 112, in report
text = os.linesep.join(lines)
File "C:+\envs+\lib\site-packages\drain3\simple_profiler.py", line 111, in
lines = map(lambda it: it.to_string(enclosing_time_sec, include_batch_rates), sorted_sections)
File "C:+\envs+\lib\site-packages\drain3\simple_profiler.py", line 135, in to_string
samples_per_sec = f"{self.sample_count / self.total_time_sec: 15,.2f}"
ZeroDivisionError: float division by zero
any help would be appreciated.
There seems to be an issue with match method in the Drain class.
When configuring a template miner with a config file with masking (e.g. masking = [{"regex_pattern":"((?<=[^A-Za-z0-9])|^)([\\-\\+]?\\d+)((?=[^A-Za-z0-9])|$)", "mask_with": "NUM"}]
) and mining all templates, it becomes impossible to match lines containing tokens that fit the regex patterns defined in the config file.
Example: with the same config file, imagine I get templates
ID=1 : size=100 : RAS node <:NUM:>
ID=2 : size=10 : RAS error <:NUM:>
.
Then template_miner.drain.match("RAS node 12334")
will return None and not the first template.
Good afternoon!
Our ML team uses drain3 to transform system logs as part of a larger classification pipeline. In this pipeline, we use a pre-trained template miner to transform all of the batched logs being passed into the classifier for training. We are currently investigating how this could be done using tf.data.Dataset.map API to keep the pipeline efficient.
To this end, we were curious if any other drain3 users could benefit from a TensorFlow variant of the TemplateMiner. We have experience with TensorFlow and drain3 and would be willing to begin work on such a project.
Hi,
due to a type mismatch between current_depth
and token_count
the boolean flag is_last_token
is always false.
It should read int(tocken_count)
.
https://github.com/IBM/Drain3/blob/f004cb235f92646b3cfdb4ed6680765e9f944d06/drain3/drain.py#L136
This a test which fails:
def test_one_token_message(self):
model = Drain()
cluster, change_type = model.add_log_message("oneTokenMessage")
self.assertEqual("cluster_created", change_type, "1st check")
cluster, change_type = model.add_log_message("oneTokenMessage")
self.assertEqual("none", change_type, "2nd check")
PS: Thanks for fixing the other two issues so quickly
Hi, I've been trying to use drain for preprocessing logs for deeplog model
The matching function in Template miner class works perfectly for data it has been trained on, but for some reason it always returns Nonetype when trying to match unseen log data even if some of the logs are identical to the one's in training data.
For example for the following log in training data:-
'WMPLTFMLOG523523\t1676462824978\t2023-02-15 12:07:04.978\t11.16.135.252\t-\t-\t-\t-\tprod\treceiving-api\tunknown\tPROD\t0.0.3118\t4f5f5b1a-205-18654f89e12000\tINFO\tINFO\t-\t-\t-\t-\tapplog.cls=com.expat.move.nim.secure.core.advice.AOPLogger,applog.mthd=beforeControllerMethod,applog.line=34,applog.msg=Entering into [methodName=getHeartBeat] with [requests=[]]\t[]\t[]\t[-]\t[]\t[http-apr-8080-exec-14]\n'
I get the matching cluster_id
but for the exact log in unseen data:-
'WMPLTFMLOG523523\t1681105826004\t2023-04-10 05:50:26.004\t11.16.146.8\t-\t-\t-\t-\tprod\treceiving-api\tunknown\tPROD\t0.0.3393\tffffffffd51aa0d2-192-18769b730d4000\tINFO\tINFO\t-\t-\t-\t-\tapplog.cls=com.expat.move.nim.secure.core.advice.AOPLogger,applog.mthd=beforeControllerMethod,applog.line=34,applog.msg=Entering into [methodName=getHeartBeat] with [requests=[]]\t[]\t[]\t[-]\t[]\t[http-apr-8080-exec-6]\n'
I get a None type return
I passed some extra delimiters as part of the config file when training, could that be causing the issue?
Dear all,
In production, data will drift in some conditions and some of old cluster will never happen again. The new log will never comes in the cluster so the result is not impact. However, the performance is. As number of cluster increase, the running performance will degrade because consuming more time on comparing existing cluster.
So, here’s the question: Are there any interface in kernel or suggestions to automatically detect no-use cluster and purge it?
Thanks to all the developers, drain3 is an amazing kerne.
Example:
sentence 1: "review 1 - [INFO] The syntax is right."
sentence 2: "review 2 - [ERROR] The syntax is faulty."
Cluster formed:
"review <:NUM:> - <> The syntax is <>"
Expected cluster:
"review <:NUM:> - [INFO] The syntax is <>",
"review <:NUM:> - [ERROR]The syntax is <>".
How can I prevent masking/ clustering some particular tokens like here, in this case, I want to keep INFO/ERROR as it is in two different clusters. (not masked into <*>).
Is there any document about how to use Drain3 to inference and how to utilize the parsed result?
Is there any API document about how to use the functions in Drain3? Thanks.
Currently I am using the following code to parse log file with trained miner.
with open(log_file) as f:
lines = f.readlines()
for line in lines:
cluster = template_miner.match(line)
parms = template_miner.get_parameter_list(cluster.get_template(), line)
print(cluster.get_template())
print(parms)
Is it correct?
Hi, thanks for the py3 implementation! I'm wondering if Drain3 supports the popular Syslog datasets, e.g., HDFS or BG/L?
Such as Spark Streaming,Heron,Flink and so on.
What should I do to use Drain on these system?
Thanks!(^_^)
Great tool! Thanks for making it available.
Is it possible to match a text to a "drained" template, i.e, get the cluster id for a specific message?
Hello,
Thanks for this good repo and blog post regarding log parsing.
I would like to use this library to do some works and I have some remaining questions before jumping in.
What do you with the template that you found? Do you transforn them to regex?
I don't get the usefullness of masking. What is the difference between masking and preprocessing (as you said in your blog)?
Do you plan to support other masking format such as grok ?
Could share some information about the analytics pipeline? Do you forseen the upcoming blog post regarding that?
Thanks for your time and response.
Kind regards
In the example of the SSH.log, I noticed this clustering result :
<L=5> ID=2 : size=14551 : Invalid user <:*:> from <:IP:> <L=6> ID=27 : size=30 : Invalid user <:*:> <:*:> from <:IP:>
but i think these two template clusters should belong to one type.
The possible reason for this clustering is that the third word is composed of words and numbers, which are processed into during marking. As a result, the length of the entire log becomes longer. For example,
Invalid user test9 from 52.80.34.196
or The string that is actually in the variable of the log template is made up of two words. For example,
Invalid user boris zhang from 52.80.34.196 (This log is not in the ssh.log file, but this example may appear in other datasets)
All of these log messages should be grouped into a single log template cluster, but they are not.
So I wondered if I should devise a new matching pattern. The starting point of the design:
1, Cancel log message length as the first tree node.
2, The formula for calculating text similarity is designed to be calculated according to the text content.
I have a lot of lines that are are masked completely as they contain a lot of rubbish - e.g. resulting in one token (=mask) which will become the template.
For some reason search on these lines is extremely slow (given we have a bigger search tree already) while it should be actually super fast as they have only one token.
I cannot give a good example of the log due to confidentiality but perhaps this issue/limitation is generally known already?
No matter which persistence method I use, the first log after restart is cluster_ created。 And the same results don't merge。Is this a bug ?
Hello. I am not an expert in regular expression. So, Can I get any help from you to preprocess the windows event log using regular expression? Thank you very much.
2016-09-29 00:03:19, Info CBS Unloading offline registry hive: {bf1a281b-ad7b-4476-ac95-f47682990ce7}GLOBALROOT/Device/HarddiskVolumeShadowCopy2/Windows/System32/config/SOFTWARE
2016-09-28 04:30:30, Info CBS Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll
Hi,
after upgrading to version 0.9.1 the application fails with the following exception (see below)
In fact, it seems that since python 3.8 code like the following:
for key in keys:
drain.id_to_cluster[int(key)] = drain.id_to_cluster.pop(key)
is no longer valid and leads to the observed exception (when starting the parser for the second time and loading the pickeled state)
...
File "/home/wollny/.cache/pypoetry/virtualenvs/src-k3pL4lLu-py3.8/lib/python3.8/site-packages/drain3/template_miner.py", line 56, in __init__
self.load_state()
File "/home/wollny/.cache/pypoetry/virtualenvs/src-k3pL4lLu-py3.8/lib/python3.8/site-packages/drain3/template_miner.py", line 74, in load_state
for key in keys:
RuntimeError: dictionary keys changed during iteration
Would be nice if you could fix this, as I like your project very much :-)
PS: It would be also nice if you could upgrade the jsonpickle package, as version 1.4.1 has a known safety issue, which got fixed in version 1.4.2
nox > safety check --file=requirements.txt --full-report
+==============================================================================+
| |
| /$$$$$$ /$$ |
| /$$__ $$ | $$ |
| /$$$$$$$ /$$$$$$ | $$ \__//$$$$$$ /$$$$$$ /$$ /$$ |
| /$$_____/ |____ $$| $$$$ /$$__ $$|_ $$_/ | $$ | $$ |
| | $$$$$$ /$$$$$$$| $$_/ | $$$$$$$$ | $$ | $$ | $$ |
| \____ $$ /$$__ $$| $$ | $$_____/ | $$ /$$| $$ | $$ |
| /$$$$$$$/| $$$$$$$| $$ | $$$$$$$ | $$$$/| $$$$$$$ |
| |_______/ \_______/|__/ \_______/ \___/ \____ $$ |
| /$$ | $$ |
| | $$$$$$/ |
| by pyup.io \______/ |
| |
+==============================================================================+
| REPORT |
| checked 78 packages, using default DB |
+============================+===========+==========================+==========+
| package | installed | affected | ID |
+============================+===========+==========================+==========+
| jsonpickle | 1.4.1 | <=1.4.1 | 39319 |
+==============================================================================+
| Jsonpickle through 1.4.1 allows remote code execution during deserialization |
| of a malicious payload through the decode() function. See CVE-2020-22083. |
+==============================================================================+
Hi,
I'm trying to run the example: drain_bigfile_demo.py
, but I'm getting the error:
ImportError: cannot import name 'KeysView' from 'collections' (C:\Miniconda3\...\lib\collections\__init__.py)
The code is failing on this line of code:
from drain3 import TemplateMiner
To provide a bit of context, I installed Drain3 via: pip3 install drain3
, my python version is 3.10.0.
Thanks for your help.
Regards.
Regex presents problem when double quotes are used, showing a json decoder error message.
Dear Drain3 project,
We are very glad to find drain3. We tested it with a logfile having about 200k lines and it finishes processing in about 10 secs in my macbook.
we would like to use drain3 to process logs from logfiles of same type from hundreds of sources, keeping one tree and state. Could you advise what is best way to parallelize log ingestion?
I looked at the code, seems processing log function add_log_messages should be run single threaded.
best regards
I observed that when I create a lot of clusters (10000+), the drain3 kernel consumes more processing time. So the solution I thought was to delete old clusters which are no use manually.
Assuming I have a list of clusterId which I want to remove from the drain3 kernel, what is the safest possible procedure? Please give a detailed explanation (how to modify parse tree or only deleting from template_miner.drain.id_to_cluster dict is sufficient. If no, then what else to do ?)
If deleting is not a good idea, then how to improve the running time?
Hey,
in both sample files no config object is passed to TemplateMiner. This should cause TemplateMiner to fall back to the default .ini file. However, if you use the call from the readme python -m examples.drain_stdin_demo
to call the examples, there is no drain3.ini in the current working directory ("main" directory) and thus no config file is used. This leads to the fact that the examples cannot be reproduced, since no masking templates are used.
I think the cleanest way would be to create config objects in the examples.
Output of the current version:
python -m examples.drain_stdin_demo
Starting Drain3 template miner
Loading configuration from drain3.ini
config file not found: drain3.ini
Checking for saved state
Saved state not found
Drain3 started with 'FILE' persistence, reading from std-in (input 'q' to finish)
IP is 12.12.12.12
Saving state of 1 clusters with 1 messages, 352 bytes, reason: cluster_created (1)
{"change_type": "cluster_created", "cluster_id": 1, "cluster_size": 1, "template_mined": "IP is 12.12.12.12", "cluster_count": 1}
Proposed fix (e.g. drain_stdin_demo.py Line 45):
config = TemplateMinerConfig()
if os.path.exists(os.path.join("examples","drain3.ini")):
config.load(os.path.join("examples","drain3.ini"))
else:
logger.error("Drain3.ini file not found")
template_miner = TemplateMiner(persistence, config)
Fixed Output:
python -m examples.drain_stdin_demo
Starting Drain3 template miner
Checking for saved state
Saved state not found
Drain3 started with 'FILE' persistence, reading from std-in (input 'q' to finish)
IP is 12.12.12.12
Saving state of 1 clusters with 1 messages, 940 bytes, reason: cluster_created (1)
{"change_type": "cluster_created", "cluster_id": 1, "cluster_size": 1, "template_mined": "IP is <IP>", "cluster_count": 1}
Hi,
I'm trying to match and mask a float number ("+12", "-12", "-3.14", ".314e1", etc) in a sentence. I've tried several regexes, like this one: "^a-zA-Z:"
Althoug this regex works when I run in python re.findall("[^a-zA-Z:]([-+]?\d+[\.]?\d*)", 'Hi, -1.25 is a float')
,
if I add it to the mining instructions as
{"regex_pattern":"(?<![a-zA-Z:])[-+]?\d*\.?\d+", "mask_with": "FLO"},
the masking doesn't occur, I get a <:*:> in the mined template.
What am I doing wrong?
Hi,
Pls. let me know what is the use of specifying the extra_delimiters = ["_"] in the config.ini file ?
Using extran_require so no need to install unneeded dependencies.
I am happy to PR if you feel this is useful.
I'm back again with another inconsistency.
Observe the following example:
>>> from drain3 import *
>>> parser = TemplateMiner()
>>> template = "<hdfs_uri>:<number>+<number>"
>>> content = "hdfs://msra-sa-41:9000/pageinput2.txt:671088640+134217728"
>>> parser.get_parameter_list(template, content)
['hdfs', '//msra-sa-41:9000/pageinput2.txt:671088640', '134217728']
Now of course this arises in a context where I use some custom masking patterns.
The expected parameter-list according to those masking patterns would be:
['hdfs//msra-sa-41:9000/pageinput2.txt', '671088640', '134217728']
but get_parameter_list
does not take that into account.
I'll give another more concise example, to demonstrate why this fails:
>>> parser.config.masking_instructions = [masking.MaskingInstruction(r"\d+\.\d+", "float")]
>>> parser.get_parameter_list("<float>.<*>", "0.15.Test")
['0', '15.Test']
Therefore the problem is that the delimiter between these two parameters .
is also part of the desired first parameter 0.15
.
I gave it a thought and I think that implies this problem can only occur with custom masking patterns:
Under normal circumstances Drain would not produce a template where two parameters are separated by a delimiter other than a space. And since a parameter can only be a single token, they do not contain spaces and therefore the problem above does not occur.
(This might be a different story for extra_delimiters
, but for the simple examples I can think of there shouldn't be any problems with that either.)
One solution would be to use the masking patterns to extract any parameters first and then apply the regular parameter extraction.
I'm working on a solution using this idea, but it's not ready yet, as it is a bit challenging to preserve the correct order.
Alternatively one could include the masking-pattern in the mask, e.g. <float|\d+\.\d+>
.
Then one could use these patterns instead of (.*?)
:
https://github.com/IBM/Drain3/blob/6fd6117859f45560f0e576ffcbcc63863d65bdde/drain3/template_miner.py#L181
But this would mean that regexes need to be present in log-templates which is obviously less readable.
If the mask_with
-attributes were unique across all MaskingInstruction
-objects, one could simply use the mask to determine the required pattern, but at the moment users are free to assign multiple MaskingInstructions
with the same mask_with
-value.
Now since the masking patterns would need to be evaluated twice if you'd want to get the template of a log message and also the corresponding parameters, one could think about (optionally) including the parameters in the return-values of add_log_message(...)
and match(...)
directly. But that would also require changing multiple methods in drain.py
so that would be more cumbersome.
Observe the following MWE (using the current version 0.9.7
):
>>> from drain3 import *
>>> parser = TemplateMiner()
>>> parser.add_log_message("training4Model start")
{'change_type': 'cluster_created', 'cluster_id': 1, 'cluster_size': 1, 'template_mined': 'training4Model start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel start")
{'change_type': 'cluster_template_changed', 'cluster_id': 1, 'cluster_size': 2, 'template_mined': '<*> start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel stop")
{'change_type': 'cluster_created', 'cluster_id': 2, 'cluster_size': 1, 'template_mined': 'loadModel stop', 'cluster_count': 2}
>>> parser.match("loadModel start") is not None # This message was seen during training, so it should match.
False
Since loadModel start
was previously passed in to parser.add_log_message
, it should be possible to match it later in the "interference" mode.
I would expect previously trained messages to always match later on.
As illustrated above, this cannot be guaranteed for arbitrary messages.
I've analyzed what is happening by looking through the source code and came to the following explanation:
parser.match("loadModel start")
is called, there are 2 clusters present with the following templates:
<*> start
: This is the cluster that includes the first occurrence of loadModel start
.loadModel stop
tree_search
is called, loadModel
matches the prefix of 2nd template and 1st template is completely disregarded. (line 131)loadModel start
does NOT exactly match loadModel stop
and as such no match is found. (line 134)The reason the first template was modified to include a wildcard (which later prevents the log message from matching), is that training4Model
includes a number and thus will replaced by a wildcard if no other prefix matches.
See: https://github.com/IBM/Drain3/blob/6b724e7651bca2ac418dcbad01602a523a70b678/drain3/drain.py#L196-L199
The following example works perfectly fine therefore:
>>> from drain3 import *
>>> parser = TemplateMiner()
>>> parser.add_log_message("trainingModel start")
{'change_type': 'cluster_created', 'cluster_id': 1, 'cluster_size': 1, 'template_mined': 'trainingModel start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel start")
{'change_type': 'cluster_created', 'cluster_id': 2, 'cluster_size': 1, 'template_mined': 'loadModel start', 'cluster_count': 2}
>>> parser.add_log_message("loadModel stop")
{'change_type': 'cluster_template_changed', 'cluster_id': 2, 'cluster_size': 2, 'template_mined': 'loadModel <*>', 'cluster_count': 2}
>>> parser.match("loadModel start") is not None # This message was seen during training, so it should match.
True
I'm not sure how this should be optimally handled.
Follow instructions from https://medium.com/@joel.barmettler/how-to-upload-your-python-package-to-pypi-65edc5fe9c56
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.