Giter Site home page Giter Site logo

compscifutures / delog Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 18.83 MB

JSON/XML logfile deserializer - converts unstructured CSV data to tabular

License: Other

Java 100.00%
csv csv-files excel powerbi siem tableau tabular json tabular-data tabular-data-formatter

delog's People

Contributors

compscifutures avatar madcaffeinebuzz avatar

Stargazers

 avatar

Watchers

 avatar

delog's Issues

Add ability to consume output as grains in memory

Add option to receive the output as List<Map<String,String>> in memory rather than writing to disk so we can use the API directly as a kind of libdelog.jar

E.g use case is to pull AWS CloudTrail files from S3 programatically and process them in memory into a data warehouse.

Keep track of duplicates

Keep bloom filter state in a separate file so no duplicates in the last N rows are repeated. Allows for re-processing of files on a real time batch-based basis.

Logfile aggregation

Is there a need to add aggregation? There's the use case of taking high frequency IOT logs and reducing them down in size for longterm storage.

If so, support just summing & historgrams for now, accept lambdas later for more complex aggregates.

Also add option to include in output a time_t or a TemporalInterval as the aggregation grain key as a separate field.

?what are the two types of aggregate according to Han & Kamber? I forget.

Support format string similar to HTTP logfile analysers

Alot of SIEM data flows around as syslogs, which means any CSV/TDF structure & unstructured data is prefixed by space separated columns with syslog stuff. Need to be able to support consuming this type of data (combined space separated + CSV/TDF or space separated + unstructured)

MORE TEST DATASETS WANTED!!

It looks like over a hundred people have cloned the project in it's first 2 weeks, fab.

Do you have TDF/CSV dadta containing unstructured XML/JSON data that delog didn't handle well? email me at [email protected] and I'll update delog so it works with it automagically.

Particularly interested in anything that comes from a commonly used API, eg, AWS CloudTrail & Azure Audit logs. Anyone want to contribute some SIEM input logs?

Ability to add to output timestamp as time_t or time_id

time_t for point time
time_id to TemporalInterval
Use GMT for time-id time zone

? do we split weeks?

? what is default grain / aggregation level for temporal intervals? seconds? minutes? hours? Maybe minutes.

Make temporal interval default to whatever we choose.

Add support for multiple input files

  • Accept multiple files on command line, combine into single output with one header row at top. Preserve order (ie, stable).
  • Update --help
  • update README.md w/example

Generate & ingest YAML transformation specification

Make it compatible with ChatGPT structured prompt YAML specification, so one can specify a prompt for processing individual columns.

Main features is to:

  • specify non-CSV separated prefix format string like in HTTP logfile analysers (but mostly for syslogs). Use both to test.
  • specify include/omit fields,
  • tabulate (table-ize) vs. crosstab unstructured fields (default is to crosstab), and
  • to specify a prompt for more advanced processing.

Add suport for no header

Some CSV files don't have a header (eg, AWS CloudTrail). Need to be able to consume them in an obvious way that creates predictable column names & optionally generates a header row in the output.

Add --input-headers=no/yes and --output-headers=no/yes cmd line option

XML attributes not being processed

Attributes in XML elements are not being converted into fields - they are currently being ignored. Need to include them in the deserialisation as part of the xml document traversals.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.