compscifutures / delog Goto Github PK
View Code? Open in Web Editor NEWJSON/XML logfile deserializer - converts unstructured CSV data to tabular
License: Other
JSON/XML logfile deserializer - converts unstructured CSV data to tabular
License: Other
Add option to receive the output as List<Map<String,String>> in memory rather than writing to disk so we can use the API directly as a kind of libdelog.jar
E.g use case is to pull AWS CloudTrail files from S3 programatically and process them in memory into a data warehouse.
Need to downgrade JDK
Keep bloom filter state in a separate file so no duplicates in the last N rows are repeated. Allows for re-processing of files on a real time batch-based basis.
Is there a need to add aggregation? There's the use case of taking high frequency IOT logs and reducing them down in size for longterm storage.
If so, support just summing & historgrams for now, accept lambdas later for more complex aggregates.
Also add option to include in output a time_t or a TemporalInterval as the aggregation grain key as a separate field.
?what are the two types of aggregate according to Han & Kamber? I forget.
Need to investigate. The schema is quite dynamic, not sure how to make it work with the more rigid protobuf schemas. Maybe consider parquet as well.
Alot of SIEM data flows around as syslogs, which means any CSV/TDF structure & unstructured data is prefixed by space separated columns with syslog stuff. Need to be able to support consuming this type of data (combined space separated + CSV/TDF or space separated + unstructured)
Delog should eat RSS feeds & google sitemaps for breakfast. Also check out SDMX for other data.
It looks like over a hundred people have cloned the project in it's first 2 weeks, fab.
Do you have TDF/CSV dadta containing unstructured XML/JSON data that delog didn't handle well? email me at [email protected] and I'll update delog so it works with it automagically.
Particularly interested in anything that comes from a commonly used API, eg, AWS CloudTrail & Azure Audit logs. Anyone want to contribute some SIEM input logs?
time_t for point time
time_id to TemporalInterval
Use GMT for time-id time zone
? do we split weeks?
? what is default grain / aggregation level for temporal intervals? seconds? minutes? hours? Maybe minutes.
Make temporal interval default to whatever we choose.
Some fields need to be exploded into multiple rows, for example, ASW CloudTrail logs.
Also consider protobuf & ?others.
as a separate column.
Make it compatible with ChatGPT structured prompt YAML specification, so one can specify a prompt for processing individual columns.
Main features is to:
Some CSV files don't have a header (eg, AWS CloudTrail). Need to be able to consume them in an obvious way that creates predictable column names & optionally generates a header row in the output.
Add --input-headers=no/yes and --output-headers=no/yes cmd line option
Attributes in XML elements are not being converted into fields - they are currently being ignored. Need to include them in the deserialisation as part of the xml document traversals.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.