rubytogether / kirby Goto Github PK

View Code? Open in Web Editor NEW

331.0 331.0 29.0 375 KB

License: MIT License

Shell 12.25% VCL 10.02% Ruby 5.83% Python 25.87% Rust 46.03%

kirby's People

Contributors

Stargazers

Watchers

kirby's Issues

Impressive

That's quite a speedup from the original!

Although this is not a complete solution, I was curious how plain old Javascript would perform. I did a basic test and it performed reasonably well. It processed a one million line log file in 1.5 seconds on my Windows machine (Processor Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, 3401 Mhz, 4 Core(s), 8 Logical Processor(s)) using the SpiderMonkey JavaScript Engine (https://archive.mozilla.org/pub/firefox/nightly/latest-mozilla-central/)

One can accomplish a lot in a few lines of Javascript code :-)

C:\spidermonkey>cat sample_1000000.log.txt | js parseJSONFile.js
Processed 1000000 records in 1562ms

//parseJSONFile.js
var start=Date.now();
var records=[];

function getVal(key,str) {
return str.split(key+"/")[1] ? r.split(key+"/")[1].split(" ")[0] : "";
}

while(x=JSON.parse(readline())){
var r=x.user_agent.toLowerCase();
x._user_Agent={
"bundler":getVal("bundler",r),
"rubygems":getVal("rubygems",r),
"ruby":getVal("ruby",r),
"platform":r.split("(")[1] ? r.split("(")[1].split(")")[0] : "",
"command":getVal("command",r),
"options":getVal("options",r),
"jruby":getVal("jruby",r),
"truffleruby":getVal("truffleruby",r),
"ci":getVal("ci",r),
"gemstach":getVal("gemstach",r)
}
records.push(x);
}
print("Processed "+records.length+" records in "+(Date.now()-start)+"ms");

Question about s3 logs

Hi!
Thanks for the great article. Just wondering how you kept this thing fed with s3 logs?

Thanks!
J

Performance suggestion: Use zstd compression with a dictionary for the logs

Hi,

I've noticed that's you're open for more performance suggestions.
To make this one work you'll have to change the logger that dumps the log files to S3, to compress the logs using zstd and pretrained dictionary.

I can say that it works wonders for repetitive data, which is the case for logs. I got from ~10MB(~50MB uncompressed, with repetitive html files) zip file to ~1MB compressed with a 100KB(if you'll do it, try different dictionary sizes to find the sweet spot) dictionary. All using rust zstd bindings. The sense of wonder never left me since.

Obviously this may not be as good for your case, but I assume it will. Basically you train a dictionary with one of your 1GB logs and use that for compression/decompression in the future, and save the dictionary in S3 as well, associate an archive with the dictionary(maybe by file name: log_dict1_stamp.bin) in case you'll want to train more dictionaries later.

Can't provide any size/perf comparisons since I don't have access to your data. Also these will require more changes to the infrastructure.

Decompression speed should be well over 1GB/s on modern hardware, maybe it's even worth investigating decompressing a stream, and search the data over the stream(but you'll have to copy the found strings), obviously the tool would change a fair amount with this architecture.

Anyway, please feel free to ignore this, if you thought about it before, or it's obvious, or it's not practical.

Kirby imports not resolving

Hi , I am building the docker file and after fixing some of the paths inside it it reached till step 20 on which it says
---> 861b440a6508
Step 20/34 : RUN cargo build --target $BUILD_TARGET --release --bin kirby-s3
---> Running in 1253d5f404d1
Compiling kirby v0.1.0 (/build)
error[E0432]: unresolved import kirby::Options
--> src/bin/kirby-s3.rs:13:5
|
13 | use kirby::Options;
| ^^^^^^^^^^^^^^ no Options in the root

error[E0432]: unresolved import kirby::stream_stats
--> src/bin/kirby-s3.rs:20:5
|
20 | use kirby::stream_stats;
| ^^^^^^^^^^^^^^^^^^^ no stream_stats in the root

warning: trait objects without an explicit dyn are deprecated
--> src/bin/kirby-s3.rs:22:53
|
22 | fn read_object(bucket_name: &str, key: &str) -> Box {
| ^^^^^^^ help: use dyn: dyn BufRead
|
= note: #[warn(bare_trait_objects)] on by default

error: aborting due to 2 previous errors

For more information about this error, try rustc --explain E0432.
error: could not compile kirby.

To learn more, run the command again with --verbose.

Can you help with this ?

Clarify I/O in benchmark numbers

Hi, I would like to test kirby with a different format, which is a 3gb text file consisting of JSON objects (1 per line).

I understand the impressive benchmark numbers were achieved by streaming the data in, and not reading files from disk (SSD / NVMe)?

If you did indeed achieve it with files read from disk, could you clarify how you managed to get around the bottle neck of read speeds?

Maybe this question doesn't make sense, but I feel the speed of serde would be wasted in our setup because we are dealing with files as input.

Performance improvements

Looking at your code there are a few areas where it could be made even faster.

file.rs is doing BuffReader::new without specifying a size. This results in a buffer of just 8kb. You might see some improvement by increasing this size.
In lib.rs it's calling lines() on the stream. This is parsing the data from bytes into utf-8 string, which is then handed to serde to deserialize from the utf-8 string. However if you use split instead it will give you bytes instead without parsing, which serde is capable of serializing directly. So you can cut out the whole UTF-8 parsing step.
The counters are going into a hashMap which is using the default ddos resistant hash function, so it's not the fastest. Though you can plug in an alternative hash such as https://crates.io/crates/fnv .
In the case of the inner map even that is not needed because all the keys are just string constants. It would be both more type safe, and faster to replace the strings with an Enum and use an enum-map.

Add MIT license?

Hi,
please add MIT license.

Create a generic tool to analyse JSON logs

Hi.

Congrats, kirby is really fast. I am used it to analyze Open edX JSON logs.

I am curious about the comparison with MongoDB. If the kirby performance is similar to MongoDB, a generic tool/lib to analyze JSON logs could be very useful. See: https://twitter.com/rubenrua/status/1111240276583071744

I am working in a tool to import JSON logs into MongoDB to compare kirby and MongoDB. I wil update this issue with my results.

Deserialize without allocation

I think you should try to change the Request struct a little bit to use &str because it seems you can avoid allocations, the request doesn't outlive the function.

https://serde.rs/lifetimes.html

rubytogether / kirby Goto Github PK

kirby's People

Contributors

Stargazers

Watchers

Forkers

kirby's Issues

Impressive

Question about s3 logs

Performance suggestion: Use zstd compression with a dictionary for the logs

Kirby imports not resolving

Clarify I/O in benchmark numbers

Performance improvements

Add MIT license?

Create a generic tool to analyse JSON logs

Deserialize without allocation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent