Giter Site home page Giter Site logo

logpai / drain3 Goto Github PK

View Code? Open in Web Editor NEW
426.0 13.0 128.0 214 KB

A robust streaming log template miner based on the Drain algorithm

License: Other

Python 99.96% Shell 0.04%
template-mining drain log aiops anomaly-detection clustering machine-learning observability log-clustering

drain3's Introduction

Drain3

Important Update

Drain3 was moved to the logpai GitHub organization (which is also the home for the original Drain implementation). We always welcome more contributors and maintainers to join us and push the project forward. We welcome more contributions and variants of implementations if you find practical enhancements to the algorithm in production scenarios.

Introduction

Drain3 is an online log template miner that can extract templates (clusters) from a stream of log messages in a timely manner. It employs a parse tree with fixed depth to guide the log group search process, which effectively avoids constructing a very deep and unbalanced tree.

Drain3 continuously learns on-the-fly and extracts log templates from raw log entries.

Example:

For the input:

connected to 10.0.0.1
connected to 192.168.0.1
Hex number 0xDEADBEAF
user davidoh logged in
user eranr logged in

Drain3 extracts the following templates:

ID=1     : size=2         : connected to <:IP:>
ID=2     : size=1         : Hex number <:HEX:>
ID=3     : size=2         : user <:*:> logged in

Full sample program output:

Starting Drain3 template miner
Checking for saved state
Saved state not found
Drain3 started with 'FILE' persistence
Starting training mode. Reading from std-in ('q' to finish)
> connected to 10.0.0.1
Saving state of 1 clusters with 1 messages, 528 bytes, reason: cluster_created (1)
{"change_type": "cluster_created", "cluster_id": 1, "cluster_size": 1, "template_mined": "connected to <:IP:>", "cluster_count": 1}
Parameters: [ExtractedParameter(value='10.0.0.1', mask_name='IP')]
> connected to 192.168.0.1
{"change_type": "none", "cluster_id": 1, "cluster_size": 2, "template_mined": "connected to <:IP:>", "cluster_count": 1}
Parameters: [ExtractedParameter(value='192.168.0.1', mask_name='IP')]
> Hex number 0xDEADBEAF
Saving state of 2 clusters with 3 messages, 584 bytes, reason: cluster_created (2)
{"change_type": "cluster_created", "cluster_id": 2, "cluster_size": 1, "template_mined": "Hex number <:HEX:>", "cluster_count": 2}
Parameters: [ExtractedParameter(value='0xDEADBEAF', mask_name='HEX')]
> user davidoh logged in
Saving state of 3 clusters with 4 messages, 648 bytes, reason: cluster_created (3)
{"change_type": "cluster_created", "cluster_id": 3, "cluster_size": 1, "template_mined": "user davidoh logged in", "cluster_count": 3}
Parameters: []
> user eranr logged in
Saving state of 3 clusters with 5 messages, 644 bytes, reason: cluster_template_changed (3)
{"change_type": "cluster_template_changed", "cluster_id": 3, "cluster_size": 2, "template_mined": "user <:*:> logged in", "cluster_count": 3}
Parameters: [ExtractedParameter(value='eranr', mask_name='*')]
> q
Training done. Mined clusters:
ID=1     : size=2         : connected to <:IP:>
ID=2     : size=1         : Hex number <:HEX:>
ID=3     : size=2         : user <:*:> logged in

This project is an upgrade of the original Drain project by LogPAI from Python 2.7 to Python 3.6 or later with additional features and bug-fixes.

Read more information about Drain from the following paper:

A Drain3 use case is presented in this blog post: Use open source Drain3 log-template mining project to monitor for network outages .

New features

  • Persistence. Save and load Drain state into an Apache Kafka topic, Redis or a file.
  • Streaming. Support feeding Drain with messages one-be-one.
  • Masking. Replace some message parts (e.g numbers, IPs, emails) with wildcards. This improves the accuracy of template mining.
  • Packaging. As a pip package.
  • Configuration. Support for configuring Drain3 using an .ini file or a configuration object.
  • Memory efficiency. Decrease the memory footprint of internal data structures and introduce cache to control max memory consumed (thanks to @StanislawSwierc)
  • Inference mode. In case you want to separate training and inference phase, Drain3 provides a function for fast matching against already-learned clusters (templates) only, without the usage of regular expressions.
  • Parameter extraction. Accurate extraction of the variable parts from a log message as an ordered list, based on its mined template and the defined masking instructions (thanks to @Impelon).

Expected Input and Output

Although Drain3 can be ingested with full raw log message, template mining accuracy can be improved if you feed it with only the unstructured free-text portion of log messages, by first removing structured parts like timestamp, hostname. severity, etc.

The output is a dictionary with the following fields:

  • change_type - indicates either if a new template was identified, an existing template was changed or message added to an existing cluster.
  • cluster_id - Sequential ID of the cluster that the log belongs to.
  • cluster_size- The size (message count) of the cluster that the log belongs to.
  • cluster_count - Count clusters seen so far.
  • template_mined- the last template of above cluster_id.

Configuration

Drain3 is configured using configparser. By default, config filename is drain3.ini in working directory. It can also be configured passing a TemplateMinerConfig object to the TemplateMiner constructor.

Primary configuration parameters:

  • [DRAIN]/sim_th - similarity threshold. if percentage of similar tokens for a log message is below this number, a new log cluster will be created (default 0.4)
  • [DRAIN]/depth - max depth levels of log clusters. Minimum is 3. (default 4)
  • [DRAIN]/max_children - max number of children of an internal node (default 100)
  • [DRAIN]/max_clusters - max number of tracked clusters (unlimited by default). When this number is reached, model starts replacing old clusters with a new ones according to the LRU cache eviction policy.
  • [DRAIN]/extra_delimiters - delimiters to apply when splitting log message into words (in addition to whitespace) ( default none). Format is a Python list e.g. ['_', ':'].
  • [MASKING]/masking - parameters masking - in json format (default "")
  • [MASKING]/mask_prefix & [MASKING]/mask_suffix - the wrapping of identified parameters in templates. By default, it is < and > respectively.
  • [SNAPSHOT]/snapshot_interval_minutes - time interval for new snapshots (default 1)
  • [SNAPSHOT]/compress_state - whether to compress the state before saving it. This can be useful when using Kafka persistence.

Masking

This feature allows masking of specific variable parts in log message with keywords, prior to passing to Drain. A well-defined masking can improve template mining accuracy.

Template parameters that do not match any custom mask in the preliminary masking phase are replaced with <*> by Drain core.

Use a list of regular expressions in the configuration file with the format {'regex_pattern', 'mask_with'} to set custom masking.

For example, following masking instructions in drain3.ini will mask IP addresses and integers:

[MASKING]
masking = [
          {"regex_pattern":"((?<=[^A-Za-z0-9])|^)(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})((?=[^A-Za-z0-9])|$)", "mask_with": "IP"},
          {"regex_pattern":"((?<=[^A-Za-z0-9])|^)([\\-\\+]?\\d+)((?=[^A-Za-z0-9])|$)", "mask_with": "NUM"},
          ]
    ]

Persistence

The persistence feature saves and loads a snapshot of Drain3 state in a (compressed) json format. This feature adds restart resiliency to Drain allowing continuation of activity and maintain learned knowledge across restarts.

Drain3 state includes the search tree and all the clusters that were identified up until snapshot time.

The snapshot also persist number of log messages matched each cluster, and it's cluster_id.

An example of a snapshot:

{
  "clusters": [
    {
      "cluster_id": 1,
      "log_template_tokens": [
        "aa",
        "aa",
        "<*>"
      ],
      "py/object": "drain3_core.LogCluster",
      "size": 2
    },
    {
      "cluster_id": 2,
      "log_template_tokens": [
        "My",
        "IP",
        "is",
        "<IP>"
      ],
      "py/object": "drain3_core.LogCluster",
      "size": 1
    }
  ]
}

This example snapshot persist two clusters with the templates:

["aa", "aa", "<*>"] - occurs twice

["My", "IP", "is", "<IP>"] - occurs once

Snapshots are created in the following events:

  • cluster_created - in any new template
  • cluster_template_changed - in any update of a template
  • periodic - after n minutes from the last snapshot. This is intended to save cluster sizes even if no new template was identified.

Drain3 currently supports the following persistence modes:

  • Kafka - The snapshot is saved in a dedicated topic used only for snapshots - the last message in this topic is the last snapshot that will be loaded after restart. For Kafka persistence, you need to provide: topic_name. You may also provide other kwargs that are supported by kafka.KafkaConsumer and kafka.Producer e.g bootstrap_servers to change Kafka endpoint (default is localhost:9092).

  • Redis - The snapshot is saved to a key in Redis database (contributed by @matabares).

  • File - The snapshot is saved to a file.

  • Memory - The snapshot is saved an in-memory object.

  • None - No persistence.

Drain3 persistence modes can be easily extended to another medium / database by inheriting the PersistenceHandler class.

Training vs. Inference modes

In some use-cases, it is required to separate training and inference phases.

In training phase you should call template_miner.add_log_message(log_line). This will match log line against an existing cluster (if similarity is above threshold) or create a new cluster. It may also change the template of an existing cluster.

In inference mode you should call template_miner.match(log_line). This will match log line against previously learned clusters only. No new clusters are created and templates of existing clusters are not changed. Match to existing cluster has to be perfect, otherwise None is returned. You can use persistence option to load previously trained clusters before inference.

Memory efficiency

This feature limits the max memory used by the model. It is particularly important for large and possibly unbounded log streams. This feature is controlled by the max_clusters​ parameter, which sets the max number of clusters/templates trarcked by the model. When the limit is reached, new templates start to replace the old ones according to the Least Recently Used (LRU) eviction policy. This makes the model adapt quickly to the most recent templates in the log stream.

Parameter Extraction

Drain3 supports retrieving an ordered list of variables in a log message, after its template was mined. Each parameter is accompanied by the name of the mask that was matched, or * for the catch-all mask.

Parameter extraction is performed by generating a regular expression that matches the template and then applying it on the log message. When exact_matching is enabled (by default), the generated regex included the regular expression defined in relevant masking instructions. If there are multiple masking instructions with the same name, either match can satisfy the regex. It is possible to disable exact matching so that every variable is matched against a non-whitespace character sequence. This may improve performance on expanse of accuracy.

Parameter extraction regexes generated per template are cached by default, to improve performance. You can control cache size with the MASKING/parameter_extraction_cache_capacity configuration parameter.

Sample usage:

result = template_miner.add_log_message(log_line)
params = template_miner.extract_parameters(
    result["template_mined"], log_line, exact_matching=True)

For the input "user johndoe logged in 11 minuts ago", the template would be:

"user <:*:> logged in <:NUM:> minuts ago"

... and the extracted parameters:

[
  ExtractedParameter(value='johndoe', mask_name='*'), 
  ExtractedParameter(value='11', mask_name='NUM')
]

Installation

Drain3 is available from PyPI. To install use pip:

pip3 install drain3

Note: If you decide to use Kafka or Redis persistence, you should install relevant client library explicitly, since it is declared as an extra (optional) dependency, by either:

pip3 install kafka-python

-- or --

pip3 install redis

Examples

In order to run the examples directly from the repository, you need to install dependencies. You can do that using * pipenv* by executing the following command (assuming pipenv already installed):

python3 -m pipenv sync

Example 1 - drain_stdin_demo

Run examples/drain_stdin_demo.py from the root folder of the repository by:

python3 -m pipenv run python -m examples.drain_stdin_demo

This example uses Drain3 on input from stdin and persist to either Kafka / file / no persistence.

Change persistence_type variable in the example to change persistence mode.

Enter several log lines using the command line. Press q to end online learn-and-match mode.

Next, demo goes to match (inference) only mode, in which no new clusters are trained and input is matched against previously trained clusters only. Press q again to finish execution.

Example 2 - drain_bigfile_demo

Run examples/drain_bigfile_demo from the root folder of the repository by:

python3 -m pipenv run python -m examples.drain_bigfile_demo

This example downloads a real-world log file (of an SSH server) and process all lines, then prints result clusters, prefix tree and performance statistics.

Sample config file

An example drain3.ini file with masking instructions can be found in the examples folder as well.

Contributing

Our project welcomes external contributions. Please refer to CONTRIBUTING.md for further details.

Change Log

v0.9.11
  • Fixed possible DivideByZero error when the profiler is enabled - Issue #65.
v0.9.10
  • Fixed compatibility issue with Python 3.10 caused by removal of KeysView.
v0.9.9
  • Added support for accurate log message parameter extraction in a new function - extract_parameters(). The function get_parameter_list() is deprecated (Thanks to @Impelon).
  • Refactored AbstractMaskingInstruction as a base class for RegexMaskingInstruction, allowing to introduce other types of masking mechanisms.
v0.9.8
  • Added an option full_search_strategy option in TemplateMiner.match() and Drain.match(). See more info at Issue #48.
  • Added an option to disable parameterization of tokens that contains digits in configuration: TemplateMinerConfig.parametrize_numeric_tokens
  • Loading Drain snapshot now only restores clusters state and not configuration parameters. This improves backwards compatibility when introducing new Drain configuration parameters.
v0.9.7
  • Fixed bug in original Drain: log clusters were created multiple times for log messages with fewer tokens than max_node_depth.
  • Changed depth property name to a more descriptive name max_node_depth as Drain always subtracts 2 of depth argument value. Also added log_cluster_depth property to reflect original value of depth argument (Breaking Change).
  • Restricted depth param to minimum sensible value of 3.
  • Added log cluster count to nodes in Drain.print_tree()
  • Added optional log cluster details to Drain.print_tree()
v0.9.6
  • Fix issue #38: Unnecessary update of LRU cache in case max_clusters is used ( thanks @StanislawSwierc).
v0.9.5
  • Added: TemplateMiner.match() function for fast matching against existing clusters only.
v0.9.4
  • Added: TemplateMiner.get_parameter_list() function to extract template parameters for raw log message (thanks to * @cwyalpha*)
  • Added option to customize mask wrapper - Instead of the default <*>, <NUM> etc, you can select any wrapper prefix or suffix by overriding TemplateMinerConfig.mask_prefix and TemplateMinerConfig.mask_prefix
  • Fixed: config .ini file is always read from same folder as source file in demos in tests (thanks @RobinMaas95)
v0.9.3
  • Fixed: comparison of type int with type str in function add_seq_to_prefix_tree #28 (bug introduced at v0.9.1)
v0.9.2
  • Updated jsonpickle version
  • Keys id_to_cluster dict are now persisted by jsonpickle as int instead of str to avoid keys type conversion on load snapshot which caused some issues.
  • Added cachetools dependency to setup.py.
v0.9.1
  • Added option to configure TemplateMiner using a configuration object (without .ini file).
  • Support for print_tree() to a file/stream.
  • Added MemoryBufferPersistence
  • Added unit tests for state save/load.
  • Bug fix: missing type-conversion in state loading, introduced in v0.9.0
  • Refactor: Drain prefix tree keys are now of type str also for 1st level (was int before), for type consistency.
v0.9.0
  • Decrease memory footprint of the main data structures.
  • Added max_clusters option to limit the number of tracked clusters.
  • Changed cluster identifier type from str to int
  • Added more unit tests and CI
v0.8.6
  • Added extra_delimiters configuration option to Drain
v0.8.5
  • Profiler improvements
v0.8.4
  • Masking speed improvement
v0.8.3
  • Fix: profiler state after load from snapshot
v0.8.2
  • Fixed snapshot backward compatibility to v0.7.9
v0.8.1
  • Bugfix in profiling configuration read
v0.8.0
  • Added time profiling support (disabled by default)
  • Added cluster ID to snapshot reason log (credit: @boernd)
  • Minor Readability and documentation improvements in Drain
v0.7.9
  • Fix: KafkaPersistence now accepts also bootstrap_servers as kwargs.
v0.7.8
  • Using kafka-python package instead of kafka (newer).
  • Added support for specifying additional configuration as kwargs in Kafka persistence handler.
v0.7.7
  • Corrected default Drain config values.
v0.7.6
  • Improvement in config file handling (Note: new sections were added instead of DEFAULT section)
v0.7.5
  • Made Kafka and Redis optional requirements

drain3's People

Contributors

boris-2021 avatar davidohana avatar diveshlunker avatar eranra avatar impelon avatar jxtrbtk avatar matabares avatar moshik1 avatar nikolai-kummer avatar no-preserve-root avatar stanislawswierc avatar stevemar avatar superskyyy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

drain3's Issues

Missing config object in sample files

Hey,
in both sample files no config object is passed to TemplateMiner. This should cause TemplateMiner to fall back to the default .ini file. However, if you use the call from the readme python -m examples.drain_stdin_demo to call the examples, there is no drain3.ini in the current working directory ("main" directory) and thus no config file is used. This leads to the fact that the examples cannot be reproduced, since no masking templates are used.

I think the cleanest way would be to create config objects in the examples.
Output of the current version:

python -m examples.drain_stdin_demo
Starting Drain3 template miner
Loading configuration from drain3.ini
config file not found: drain3.ini
Checking for saved state
Saved state not found
Drain3 started with 'FILE' persistence, reading from std-in (input 'q' to finish)
IP is 12.12.12.12
Saving state of 1 clusters with 1 messages, 352 bytes, reason: cluster_created (1)
{"change_type": "cluster_created", "cluster_id": 1, "cluster_size": 1, "template_mined": "IP is 12.12.12.12", "cluster_count": 1}

Proposed fix (e.g. drain_stdin_demo.py Line 45):

config = TemplateMinerConfig()
if os.path.exists(os.path.join("examples","drain3.ini")):
    config.load(os.path.join("examples","drain3.ini"))
else:
    logger.error("Drain3.ini file not found")

template_miner = TemplateMiner(persistence, config)

Fixed Output:

python -m examples.drain_stdin_demo
Starting Drain3 template miner
Checking for saved state
Saved state not found
Drain3 started with 'FILE' persistence, reading from std-in (input 'q' to finish)
IP is 12.12.12.12
Saving state of 1 clusters with 1 messages, 940 bytes, reason: cluster_created (1)
{"change_type": "cluster_created", "cluster_id": 1, "cluster_size": 1, "template_mined": "IP is <IP>", "cluster_count": 1}

I cannot get value vector from match

Hi,

I would like to get value vector from log, not only cluster it matches. How to get it?

Match returns only LogCluster.

def match(self, log_message: str) -> LogCluster:

Drain3 does not extract text patterns?

Thanks for putting this together team. I have been trying to use Drain and came across this issue.

'user=mike ip=unknown-ip-addr cmd=Metastore shutdown complete',
'user=mike ip=unknown-ip-addr cmd=Shutting down the object store',
'user=smith ip=unknown-ip-addr cmd=Metastore shutdown complete',
'user=smith ip=unknown-ip-addr cmd=Shutting down the object store',
'user=jackson ip=unknown-ip-addr cmd=Metastore shutdown complete',
'user=jackson ip=unknown-ip-addr cmd=Shutting down the object store',
'user=bob ip=unknown-ip-addr cmd=Metastore shutdown complete',
'user=bob ip=unknown-ip-addr cmd=Shutting down the object store'

So, ideally, the patterns look similar i.e. of the form

user=<*> ip=<*> cmd=<*>

But, the drain algorithm does not pick this up. I have tried with several params of sim_th, depth, and max_children.

Am I missing something? Can someone help?

Monitor the use of frequently used cluster

How to monitor which clusters are being used frequently? By monitoring means that saving the cluster id of those recent clusters which has been indicated by following 3 change_type.

change_type - indicates either if a new template was identified, an existing template was changed or a message was added to an existing cluster.

RuntimeError: dictionary keys changed during iteration

Hi,
after upgrading to version 0.9.1 the application fails with the following exception (see below)
In fact, it seems that since python 3.8 code like the following:

        for key in keys:
            drain.id_to_cluster[int(key)] = drain.id_to_cluster.pop(key)

is no longer valid and leads to the observed exception (when starting the parser for the second time and loading the pickeled state)

...
  File "/home/wollny/.cache/pypoetry/virtualenvs/src-k3pL4lLu-py3.8/lib/python3.8/site-packages/drain3/template_miner.py", line 56, in __init__
    self.load_state()
  File "/home/wollny/.cache/pypoetry/virtualenvs/src-k3pL4lLu-py3.8/lib/python3.8/site-packages/drain3/template_miner.py", line 74, in load_state
    for key in keys:
RuntimeError: dictionary keys changed during iteration

Would be nice if you could fix this, as I like your project very much :-)

PS: It would be also nice if you could upgrade the jsonpickle package, as version 1.4.1 has a known safety issue, which got fixed in version 1.4.2

nox > safety check --file=requirements.txt --full-report
+==============================================================================+
|                                                                              |
|                               /$$$$$$            /$$                         |
|                              /$$__  $$          | $$                         |
|           /$$$$$$$  /$$$$$$ | $$  \__//$$$$$$  /$$$$$$   /$$   /$$           |
|          /$$_____/ |____  $$| $$$$   /$$__  $$|_  $$_/  | $$  | $$           |
|         |  $$$$$$   /$$$$$$$| $$_/  | $$$$$$$$  | $$    | $$  | $$           |
|          \____  $$ /$$__  $$| $$    | $$_____/  | $$ /$$| $$  | $$           |
|          /$$$$$$$/|  $$$$$$$| $$    |  $$$$$$$  |  $$$$/|  $$$$$$$           |
|         |_______/  \_______/|__/     \_______/   \___/   \____  $$           |
|                                                          /$$  | $$           |
|                                                         |  $$$$$$/           |
|  by pyup.io                                              \______/            |
|                                                                              |
+==============================================================================+
| REPORT                                                                       |
| checked 78 packages, using default DB                                        |
+============================+===========+==========================+==========+
| package                    | installed | affected                 | ID       |
+============================+===========+==========================+==========+
| jsonpickle                 | 1.4.1     | <=1.4.1                  | 39319    |
+==============================================================================+
| Jsonpickle through 1.4.1 allows remote code execution during deserialization |
| of a malicious payload through the decode() function. See CVE-2020-22083.    |
+==============================================================================+

`get_parameters_list` can return incorrect parameters

I'm back again with another inconsistency.
Observe the following example:

>>> from drain3 import *
>>> parser = TemplateMiner()
>>> template = "<hdfs_uri>:<number>+<number>"
>>> content = "hdfs://msra-sa-41:9000/pageinput2.txt:671088640+134217728"
>>> parser.get_parameter_list(template, content)
['hdfs', '//msra-sa-41:9000/pageinput2.txt:671088640', '134217728']

Now of course this arises in a context where I use some custom masking patterns.
The expected parameter-list according to those masking patterns would be:
['hdfs//msra-sa-41:9000/pageinput2.txt', '671088640', '134217728']
but get_parameter_list does not take that into account.

I'll give another more concise example, to demonstrate why this fails:

>>> parser.config.masking_instructions = [masking.MaskingInstruction(r"\d+\.\d+", "float")]
>>> parser.get_parameter_list("<float>.<*>", "0.15.Test")
['0', '15.Test']

Therefore the problem is that the delimiter between these two parameters . is also part of the desired first parameter 0.15.
I gave it a thought and I think that implies this problem can only occur with custom masking patterns:
Under normal circumstances Drain would not produce a template where two parameters are separated by a delimiter other than a space. And since a parameter can only be a single token, they do not contain spaces and therefore the problem above does not occur.
(This might be a different story for extra_delimiters, but for the simple examples I can think of there shouldn't be any problems with that either.)


One solution would be to use the masking patterns to extract any parameters first and then apply the regular parameter extraction.
I'm working on a solution using this idea, but it's not ready yet, as it is a bit challenging to preserve the correct order.

Alternatively one could include the masking-pattern in the mask, e.g. <float|\d+\.\d+>.
Then one could use these patterns instead of (.*?):
https://github.com/IBM/Drain3/blob/6fd6117859f45560f0e576ffcbcc63863d65bdde/drain3/template_miner.py#L181
But this would mean that regexes need to be present in log-templates which is obviously less readable.
If the mask_with-attributes were unique across all MaskingInstruction-objects, one could simply use the mask to determine the required pattern, but at the moment users are free to assign multiple MaskingInstructions with the same mask_with-value.

Now since the masking patterns would need to be evaluated twice if you'd want to get the template of a log message and also the corresponding parameters, one could think about (optionally) including the parameters in the return-values of add_log_message(...) and match(...) directly. But that would also require changing multiple methods in drain.py so that would be more cumbersome.

Cache always updates clusters even if not needed anymore

During fast_match, drain always iterates over all possible clusters and updates their access time in the cache. This leads to two problems:

  • The update slows down the performance
  • Even clusters that will never match anymore will never be removed from cache

Expected behavior:

Cluster will only be updated/touched in cache after they were actual used/chosen. There is actually a comment for this in the source code already:

Try to retrieve cluster from cache with bypassing eviction algorithm as we are only testing candidates for a match.
https://github.com/IBM/Drain3/blob/15470e391caed9a9ef5038cdd1dbd373bd2386a8/drain3/drain.py#L217

Log Matching on new data

Hi, I've been trying to use drain for preprocessing logs for deeplog model

The matching function in Template miner class works perfectly for data it has been trained on, but for some reason it always returns Nonetype when trying to match unseen log data even if some of the logs are identical to the one's in training data.

For example for the following log in training data:-

'WMPLTFMLOG523523\t1676462824978\t2023-02-15 12:07:04.978\t11.16.135.252\t-\t-\t-\t-\tprod\treceiving-api\tunknown\tPROD\t0.0.3118\t4f5f5b1a-205-18654f89e12000\tINFO\tINFO\t-\t-\t-\t-\tapplog.cls=com.expat.move.nim.secure.core.advice.AOPLogger,applog.mthd=beforeControllerMethod,applog.line=34,applog.msg=Entering into [methodName=getHeartBeat] with [requests=[]]\t[]\t[]\t[-]\t[]\t[http-apr-8080-exec-14]\n'

I get the matching cluster_id

but for the exact log in unseen data:-

'WMPLTFMLOG523523\t1681105826004\t2023-04-10 05:50:26.004\t11.16.146.8\t-\t-\t-\t-\tprod\treceiving-api\tunknown\tPROD\t0.0.3393\tffffffffd51aa0d2-192-18769b730d4000\tINFO\tINFO\t-\t-\t-\t-\tapplog.cls=com.expat.move.nim.secure.core.advice.AOPLogger,applog.mthd=beforeControllerMethod,applog.line=34,applog.msg=Entering into [methodName=getHeartBeat] with [requests=[]]\t[]\t[]\t[-]\t[]\t[http-apr-8080-exec-6]\n'

I get a None type return

I passed some extra delimiters as part of the config file when training, could that be causing the issue?

Duplicate clusters after loading saved state

Dear all,

when processing the same batch of logs twice everything works as expected, but when loading the state file again and process the same batch of logs, some new, duplicate, clusters will be created.

Here you can see the second run on the same batch of logs after reloading the state file:
2021-03-02 14_26_05-JupyterLab

And here are all clusters:
2021-03-02 14_31_14-JupyterLab

Shouldn't the behavior after loading the state file before processing the logs the second time be the same as when processing them twice without loading the state file?

Thank you in advance!

About parameter `full_search_strategy` in drain match method

As the comments said
(3) "always" is the slowest. It will select the best match among all known clusters, by always evaluating all clusters with the same token count, and selecting the cluster with perfect all token match and least count of wildcard matches.

I have two clusters as below, one has wildcard and another not
1

And I have a log to be matched
IPPROTO_TCP fd: 100, errno: 100, option: 100, value: 100

After masked it become
IPPROTO_TCP fd <NUM> errno <NUM> option <NUM> value <NUM>
And I use "always" strategy, as the comments said, I should get the cluster(id=1) which has least wildcard, but I get result

ID=2 : size=1 : IPPROTO_TCP fd <NUM> errno <NUM> option <NUM> value <*>
So I read the code in fast_match, and I found this code seg will always return the cluster which has most param_count, is it wrong?

2

Should I modified it like this

3

TensorFlow variant

Good afternoon!

Our ML team uses drain3 to transform system logs as part of a larger classification pipeline. In this pipeline, we use a pre-trained template miner to transform all of the batched logs being passed into the classifier for training. We are currently investigating how this could be done using tf.data.Dataset.map API to keep the pipeline efficient.

To this end, we were curious if any other drain3 users could benefit from a TensorFlow variant of the TemplateMiner. We have experience with TensorFlow and drain3 and would be willing to begin work on such a project.

Restrictions on matching mode

In the example of the SSH.log, I noticed this clustering result :
<L=5> ID=2 : size=14551 : Invalid user <:*:> from <:IP:> <L=6> ID=27 : size=30 : Invalid user <:*:> <:*:> from <:IP:>
but i think these two template clusters should belong to one type.
The possible reason for this clustering is that the third word is composed of words and numbers, which are processed into during marking. As a result, the length of the entire log becomes longer. For example,
Invalid user test9 from 52.80.34.196
or The string that is actually in the variable of the log template is made up of two words. For example,
Invalid user boris zhang from 52.80.34.196 (This log is not in the ssh.log file, but this example may appear in other datasets)
All of these log messages should be grouped into a single log template cluster, but they are not.
So I wondered if I should devise a new matching pattern. The starting point of the design:
1, Cancel log message length as the first tree node.
2, The formula for calculating text similarity is designed to be calculated according to the text content.

Some questions about drain_bigfile_demo

Hi David,
Thanks for the great work on updating Drain to Python 3.

I have some questions about drain_bigfile_demo:
1- Why use partition by ':' ? => line = line.partition(": ")[2]
Not all log records have ':' and therefore I receive a very large amount of blank template.

2- I'm getting many templates that begin with masking,
since drain layers are divided to clusters by size and then words from the begining,
this is causing them to be in the same cluster.
Do you have a suggestion to solve this issue?

3- Writing results to file: for anomaly detection I'm required to create a new file with records containing timestamp and the result template from drain. How can I write this into a file? (How Do I map old log message record to result template)

Thanks!

Error when running the example.

Hi,

I'm trying to run the example: drain_bigfile_demo.py, but I'm getting the error:

ImportError: cannot import name 'KeysView' from 'collections' (C:\Miniconda3\...\lib\collections\__init__.py)
The code is failing on this line of code:
from drain3 import TemplateMiner

To provide a bit of context, I installed Drain3 via: pip3 install drain3, my python version is 3.10.0.
Thanks for your help.
Regards.

Issue with match method in Drain class

There seems to be an issue with match method in the Drain class.
When configuring a template miner with a config file with masking (e.g. masking = [{"regex_pattern":"((?<=[^A-Za-z0-9])|^)([\\-\\+]?\\d+)((?=[^A-Za-z0-9])|$)", "mask_with": "NUM"}]) and mining all templates, it becomes impossible to match lines containing tokens that fit the regex patterns defined in the config file.

Example: with the same config file, imagine I get templates
ID=1 : size=100 : RAS node <:NUM:>
ID=2 : size=10 : RAS error <:NUM:>.
Then template_miner.drain.match("RAS node 12334") will return None and not the first template.

Previously trained messages cannot always be matched

Observe the following MWE (using the current version 0.9.7):

>>> from drain3 import *
>>> parser = TemplateMiner()
>>> parser.add_log_message("training4Model start")
{'change_type': 'cluster_created', 'cluster_id': 1, 'cluster_size': 1, 'template_mined': 'training4Model start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel start")
{'change_type': 'cluster_template_changed', 'cluster_id': 1, 'cluster_size': 2, 'template_mined': '<*> start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel stop")
{'change_type': 'cluster_created', 'cluster_id': 2, 'cluster_size': 1, 'template_mined': 'loadModel stop', 'cluster_count': 2}
>>> parser.match("loadModel start") is not None # This message was seen during training, so it should match.
False

Since loadModel start was previously passed in to parser.add_log_message, it should be possible to match it later in the "interference" mode.
I would expect previously trained messages to always match later on.
As illustrated above, this cannot be guaranteed for arbitrary messages.


I've analyzed what is happening by looking through the source code and came to the following explanation:

  • When parser.match("loadModel start") is called, there are 2 clusters present with the following templates:
    1. <*> start: This is the cluster that includes the first occurrence of loadModel start.
    2. loadModel stop
  • Thus when tree_search is called, loadModel matches the prefix of 2nd template and 1st template is completely disregarded. (line 131)
  • Obviously loadModel start does NOT exactly match loadModel stop and as such no match is found. (line 134)
    https://github.com/IBM/Drain3/blob/6b724e7651bca2ac418dcbad01602a523a70b678/drain3/drain.py#L121-L135

The reason the first template was modified to include a wildcard (which later prevents the log message from matching), is that training4Model includes a number and thus will replaced by a wildcard if no other prefix matches.
See: https://github.com/IBM/Drain3/blob/6b724e7651bca2ac418dcbad01602a523a70b678/drain3/drain.py#L196-L199
The following example works perfectly fine therefore:

>>> from drain3 import *
>>> parser = TemplateMiner()
>>> parser.add_log_message("trainingModel start")
{'change_type': 'cluster_created', 'cluster_id': 1, 'cluster_size': 1, 'template_mined': 'trainingModel start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel start")
{'change_type': 'cluster_created', 'cluster_id': 2, 'cluster_size': 1, 'template_mined': 'loadModel start', 'cluster_count': 2}
>>> parser.add_log_message("loadModel stop")
{'change_type': 'cluster_template_changed', 'cluster_id': 2, 'cluster_size': 2, 'template_mined': 'loadModel <*>', 'cluster_count': 2}
>>> parser.match("loadModel start") is not None # This message was seen during training, so it should match.
True

I'm not sure how this should be optimally handled.

parallel log ingestions

Dear Drain3 project,

We are very glad to find drain3. We tested it with a logfile having about 200k lines and it finishes processing in about 10 secs in my macbook.

we would like to use drain3 to process logs from logfiles of same type from hundreds of sources, keeping one tree and state. Could you advise what is best way to parallelize log ingestion?

I looked at the code, seems processing log function add_log_messages should be run single threaded.

best regards

Modify templates and reload clusters into TemplateMiner?

Would there be a way for me to save a TemplateMiner object, edit the templates in each cluster externally, and return the clusters to TemplateMiner in another script?

A friend of mine tried to suggest pickle-ing the object. Of course, that works for saving and reloading, but I need to be able to modify the templates manually.

Extra delimiters in config

Hi,

Pls. let me know what is the use of specifying the extra_delimiters = ["_"] in the config.ini file ?

Error parsing logs: "ZeroDivisionError: float division by zero"

Hi,
I'm using Drain3 to parse some logs and sometimes I get the following error:

Traceback (most recent call last):
File "C:+\envs+\lib\site-packages\IPython\core\interactiveshell.py", line 3369, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 13, in <cell line: 11>
result = template_miner.add_log_message(msg)
File "C:+\envs+\lib\site-packages\drain3\template_miner.py", line 146, in add_log_message
self.profiler.report(self.config.profiling_report_sec)
File "C:+\envs+\lib\site-packages\drain3\simple_profiler.py", line 112, in report
text = os.linesep.join(lines)
File "C:+\envs+\lib\site-packages\drain3\simple_profiler.py", line 111, in
lines = map(lambda it: it.to_string(enclosing_time_sec, include_batch_rates), sorted_sections)
File "C:+\envs+\lib\site-packages\drain3\simple_profiler.py", line 135, in to_string
samples_per_sec = f"{self.sample_count / self.total_time_sec: 15,.2f}"
ZeroDivisionError: float division by zero

any help would be appreciated.

HDFS/BGL

Hi, thanks for the py3 implementation! I'm wondering if Drain3 supports the popular Syslog datasets, e.g., HDFS or BG/L?

API document for Drain3

Is there any document about how to use Drain3 to inference and how to utilize the parsed result?

Is there any API document about how to use the functions in Drain3? Thanks.

Currently I am using the following code to parse log file with trained miner.

with open(log_file) as f:
    lines = f.readlines()
    for line in lines:
        cluster = template_miner.match(line)
        parms = template_miner.get_parameter_list(cluster.get_template(), line)
        print(cluster.get_template())
        print(parms)

Is it correct?

specify a log file

thanks for all this effort

sorry about this simple question. I need to put a file from my system to be mined. how can I do that?
I need to know how can I put options like (file, sim_th and max_clusters )

thanks a lot

Search very slow when logline has only one token (e.g. from masking)

I have a lot of lines that are are masked completely as they contain a lot of rubbish - e.g. resulting in one token (=mask) which will become the template.
For some reason search on these lines is extremely slow (given we have a bigger search tree already) while it should be actually super fast as they have only one token.

I cannot give a good example of the log due to confidentiality but perhaps this issue/limitation is generally known already?

Drain3 deprecation warning with pip install command.

When I install drain3 with pip, I get this deprecation warning:

DEPRECATION: drain3 is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559

comparison of type int with type str in function add_seq_to_prefix_tree

Hi,
due to a type mismatch between current_depth and token_count the boolean flag is_last_token is always false.
It should read int(tocken_count).

https://github.com/IBM/Drain3/blob/f004cb235f92646b3cfdb4ed6680765e9f944d06/drain3/drain.py#L136

This a test which fails:

    def test_one_token_message(self):
        model = Drain()
        cluster, change_type = model.add_log_message("oneTokenMessage")
        self.assertEqual("cluster_created", change_type, "1st check")
        cluster, change_type = model.add_log_message("oneTokenMessage")
        self.assertEqual("none", change_type, "2nd check")

PS: Thanks for fixing the other two issues so quickly

Only mask_name * is used

Hi,

I tried to run the example file (drain_stdin_demo.py) with example input log and I see that the mask_name is always * instead of IP, HEX, etc. I'd like to confirm to see if any of you has the same issue.

Plan to debug and add some fix if this is really an issue. Otherwise, could you please tell me if I miss anything?

/opt/ndfm/src # python3 log_parser2.py 
Starting Drain3 template miner
Checking for saved state
Restored 4 clusters built from 10 messages
Drain3 started with 'FILE' persistence
Starting training mode. Reading from std-in ('q' to finish)
> connected to 10.0.0.1
{"change_type": "none", "cluster_id": 1, "cluster_size": 5, "template_mined": "connected to <:*:>", "cluster_count": 4}
Parameters: [ExtractedParameter(value='10.0.0.1', mask_name='*')]
> connected to 192.168.0.1
{"change_type": "none", "cluster_id": 1, "cluster_size": 6, "template_mined": "connected to <:*:>", "cluster_count": 4}
Parameters: [ExtractedParameter(value='192.168.0.1', mask_name='*')]
> Hex number 0xDEADBEAF
{"change_type": "none", "cluster_id": 2, "cluster_size": 3, "template_mined": "Hex number <:*:>", "cluster_count": 4}
Parameters: [ExtractedParameter(value='0xDEADBEAF', mask_name='*')]
> user davidoh logged in
{"change_type": "none", "cluster_id": 3, "cluster_size": 4, "template_mined": "user <:*:> logged in", "cluster_count": 4}
Parameters: [ExtractedParameter(value='davidoh', mask_name='*')]
> user eranr logged in
{"change_type": "none", "cluster_id": 3, "cluster_size": 5, "template_mined": "user <:*:> logged in", "cluster_count": 4}
Parameters: [ExtractedParameter(value='eranr', mask_name='*')]

TIA

Match a text to a template

Great tool! Thanks for making it available.

Is it possible to match a text to a "drained" template, i.e, get the cluster id for a specific message?

module-wise config

Dear All,

Thanks for this nice implementation of the algo. I have one question regarding config treatment in the package.

The config object is created in the root of every module separately, which seems not to be a problem when the code is run from the command line, or once-executed script. However, this creates a nuisance when I use this library in a jupyter notebook. The objects are created in the cells where I import drain modules, and the version of config file is read at that time. Later in the notebook I'm trying to adjust the config values, but it doesn't get updated, because the module is already loaded. The only way I can use updated configuration is to reset a kernel and re run all the stuff again, which makes the try-research concept of the notebook useless.

I'd like to promote an idea to move the reading of the config file into init , so it's at least can be updated at the time of object creation.

Thanks for your help,
Andrey.

Delete cluster from drain dict id_to_cluster | Impact | procedure

I observed that when I create a lot of clusters (10000+), the drain3 kernel consumes more processing time. So the solution I thought was to delete old clusters which are no use manually.
Assuming I have a list of clusterId which I want to remove from the drain3 kernel, what is the safest possible procedure? Please give a detailed explanation (how to modify parse tree or only deleting from template_miner.drain.id_to_cluster dict is sufficient. If no, then what else to do ?)
If deleting is not a good idea, then how to improve the running time?

Windows regular expression

Hello. I am not an expert in regular expression. So, Can I get any help from you to preprocess the windows event log using regular expression? Thank you very much.

  • For registry 2016-09-29 00:03:19, Info CBS Unloading offline registry hive: {bf1a281b-ad7b-4476-ac95-f47682990ce7}GLOBALROOT/Device/HarddiskVolumeShadowCopy2/Windows/System32/config/SOFTWARE
  • For Windows path 2016-09-28 04:30:30, Info CBS Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll

How to purge no-use existing cluster after running for a while

Dear all,
In production, data will drift in some conditions and some of old cluster will never happen again. The new log will never comes in the cluster so the result is not impact. However, the performance is. As number of cluster increase, the running performance will degrade because consuming more time on comparing existing cluster.

So, here’s the question: Are there any interface in kernel or suggestions to automatically detect no-use cluster and purge it?

Thanks to all the developers, drain3 is an amazing kerne.

Saving log template/cluster and ID for each log

Hi!

I am familiar with the old package and starting to get accustomed with Drain3.

I have a log file example.log and I have used Drain3 to parse each log with

logging.basicConfig(filename="output_example.log", filemode='a', level=logging.DEBUG)
logger = logging.getLogger(__name__)

config = TemplateMinerConfig()
config.load("drain3.ini")
config.profiling_enabled = True
template_miner = TemplateMiner(config=config)

line_count = 0
with open("example.log") as f:
    lines = f.readlines()

batch_size = 10

for line in lines:
    line = line.rstrip()
    line = line.partition(": ")[2]
    result = template_miner.add_log_message(line)
    line_count += 1
    if line_count % batch_size == 0:
        logger.info(f"Processing line: {line_count}, rate {rate:.1f} lines/sec, "
                    f"{len(template_miner.drain.clusters)} clusters so far.")
        
    if result["change_type"] != "none":
        result_json = json.dumps(result)
        logger.info(f"Input ({line_count}): " + line)
        logger.info("Result: " + result_json)

sorted_clusters = sorted(template_miner.drain.clusters, key=lambda it: it.size, reverse=True)

for cluster in sorted_clusters:
    logger.info(cluster)

I am able to load the sorted clusters/templates by specifying

with open('output_example.log', 'r') as f:
  lines = f.readlines()

But it is a bit tedious to keep track of the different log clusters/templates this way and I have not found a way to label each original log with its new log cluster/template ID.

Do you have any suggestions of how to do this in a better way? For example, how to save a CSV with columns "original log row number ", "new parsed log", "parsed log ID"?

Thanks in advance for your help!

Annabelle

Is there a way to add a wildcard template?

I've used Drain to create clusters by parsing one log file.
I would like to append to those clusters a 'wildcard' template (sth. like '<*>'), that will prevent Drain from creating new templates, so that Drain will classify unknown templates as '<*>'.

Could you give me a tip how to do that?

Skip Masking/cluster particular tokens

Example:

sentence 1: "review 1 - [INFO] The syntax is right."
sentence 2: "review 2 - [ERROR] The syntax is faulty."


Cluster formed:

"review <:NUM:> - <> The syntax is <>"


Expected cluster:

"review <:NUM:> - [INFO] The syntax is <>",
"review <:NUM:> - [ERROR]The syntax is <
>".

How can I prevent masking/ clustering some particular tokens like here, in this case, I want to keep INFO/ERROR as it is in two different clusters. (not masked into <*>).

Questions

Hello,

Thanks for this good repo and blog post regarding log parsing.
I would like to use this library to do some works and I have some remaining questions before jumping in.
What do you with the template that you found? Do you transforn them to regex?
I don't get the usefullness of masking. What is the difference between masking and preprocessing (as you said in your blog)?
Do you plan to support other masking format such as grok ?
Could share some information about the analytics pipeline? Do you forseen the upcoming blog post regarding that?

Thanks for your time and response.

Kind regards

Matching float numbers

Hi,

I'm trying to match and mask a float number ("+12", "-12", "-3.14", ".314e1", etc) in a sentence. I've tried several regexes, like this one: "^a-zA-Z:"

Althoug this regex works when I run in python re.findall("[^a-zA-Z:]([-+]?\d+[\.]?\d*)", 'Hi, -1.25 is a float') ,
if I add it to the mining instructions as

{"regex_pattern":"(?<![a-zA-Z:])[-+]?\d*\.?\d+", "mask_with": "FLO"},

the masking doesn't occur, I get a <:*:> in the mined template.

What am I doing wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.