rubytogether / kirby Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
That's quite a speedup from the original!
Although this is not a complete solution, I was curious how plain old Javascript would perform. I did a basic test and it performed reasonably well. It processed a one million line log file in 1.5 seconds on my Windows machine (Processor Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, 3401 Mhz, 4 Core(s), 8 Logical Processor(s)) using the SpiderMonkey JavaScript Engine (https://archive.mozilla.org/pub/firefox/nightly/latest-mozilla-central/)
One can accomplish a lot in a few lines of Javascript code :-)
C:\spidermonkey>cat sample_1000000.log.txt | js parseJSONFile.js
Processed 1000000 records in 1562ms
//parseJSONFile.js
var start=Date.now();
var records=[];
function getVal(key,str) {
return str.split(key+"/")[1] ? r.split(key+"/")[1].split(" ")[0] : "";
}
while(x=JSON.parse(readline())){
var r=x.user_agent.toLowerCase();
x._user_Agent={
"bundler":getVal("bundler",r),
"rubygems":getVal("rubygems",r),
"ruby":getVal("ruby",r),
"platform":r.split("(")[1] ? r.split("(")[1].split(")")[0] : "",
"command":getVal("command",r),
"options":getVal("options",r),
"jruby":getVal("jruby",r),
"truffleruby":getVal("truffleruby",r),
"ci":getVal("ci",r),
"gemstach":getVal("gemstach",r)
}
records.push(x);
}
print("Processed "+records.length+" records in "+(Date.now()-start)+"ms");
Hi!
Thanks for the great article. Just wondering how you kept this thing fed with s3 logs?
Thanks!
J
Hi,
I've noticed that's you're open for more performance suggestions.
To make this one work you'll have to change the logger that dumps the log files to S3, to compress the logs using zstd and pretrained dictionary.
I can say that it works wonders for repetitive data, which is the case for logs. I got from ~10MB(~50MB uncompressed, with repetitive html files) zip file to ~1MB compressed with a 100KB(if you'll do it, try different dictionary sizes to find the sweet spot) dictionary. All using rust zstd bindings. The sense of wonder never left me since.
Obviously this may not be as good for your case, but I assume it will. Basically you train a dictionary with one of your 1GB logs and use that for compression/decompression in the future, and save the dictionary in S3 as well, associate an archive with the dictionary(maybe by file name: log_dict1_stamp.bin) in case you'll want to train more dictionaries later.
Can't provide any size/perf comparisons since I don't have access to your data. Also these will require more changes to the infrastructure.
Decompression speed should be well over 1GB/s on modern hardware, maybe it's even worth investigating decompressing a stream, and search the data over the stream(but you'll have to copy the found strings), obviously the tool would change a fair amount with this architecture.
Anyway, please feel free to ignore this, if you thought about it before, or it's obvious, or it's not practical.
Hi , I am building the docker file and after fixing some of the paths inside it it reached till step 20 on which it says
---> 861b440a6508
Step 20/34 : RUN cargo build --target $BUILD_TARGET --release --bin kirby-s3
---> Running in 1253d5f404d1
Compiling kirby v0.1.0 (/build)
error[E0432]: unresolved import kirby::Options
--> src/bin/kirby-s3.rs:13:5
|
13 | use kirby::Options;
| ^^^^^^^^^^^^^^ no Options
in the root
error[E0432]: unresolved import kirby::stream_stats
--> src/bin/kirby-s3.rs:20:5
|
20 | use kirby::stream_stats;
| ^^^^^^^^^^^^^^^^^^^ no stream_stats
in the root
warning: trait objects without an explicit dyn
are deprecated
--> src/bin/kirby-s3.rs:22:53
|
22 | fn read_object(bucket_name: &str, key: &str) -> Box {
| ^^^^^^^ help: use dyn
: dyn BufRead
|
= note: #[warn(bare_trait_objects)]
on by default
error: aborting due to 2 previous errors
For more information about this error, try rustc --explain E0432
.
error: could not compile kirby
.
To learn more, run the command again with --verbose.
Can you help with this ?
Hi, I would like to test kirby with a different format, which is a 3gb text file consisting of JSON objects (1 per line).
I understand the impressive benchmark numbers were achieved by streaming the data in, and not reading files from disk (SSD / NVMe)?
If you did indeed achieve it with files read from disk, could you clarify how you managed to get around the bottle neck of read speeds?
Maybe this question doesn't make sense, but I feel the speed of serde would be wasted in our setup because we are dealing with files as input.
Looking at your code there are a few areas where it could be made even faster.
file.rs
is doing BuffReader::new without specifying a size. This results in a buffer of just 8kb. You might see some improvement by increasing this size.lib.rs
it's calling lines() on the stream. This is parsing the data from bytes into utf-8 string, which is then handed to serde to deserialize from the utf-8 string. However if you use split
instead it will give you bytes instead without parsing, which serde is capable of serializing directly. So you can cut out the whole UTF-8 parsing step.Hi,
please add MIT license.
Hi.
Congrats, kirby is really fast. I am used it to analyze Open edX JSON logs.
I am curious about the comparison with MongoDB. If the kirby performance is similar to MongoDB, a generic tool/lib to analyze JSON logs could be very useful. See: https://twitter.com/rubenrua/status/1111240276583071744
I am working in a tool to import JSON logs into MongoDB to compare kirby and MongoDB. I wil update this issue with my results.
I think you should try to change the Request
struct a little bit to use &str
because it seems you can avoid allocations, the request doesn't outlive the function.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.