Giter Site home page Giter Site logo

kirby's People

Contributors

byroot avatar indirect avatar killercup avatar rubenrua avatar tkaitchuck avatar wezm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kirby's Issues

Impressive

That's quite a speedup from the original!

Although this is not a complete solution, I was curious how plain old Javascript would perform. I did a basic test and it performed reasonably well. It processed a one million line log file in 1.5 seconds on my Windows machine (Processor Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, 3401 Mhz, 4 Core(s), 8 Logical Processor(s)) using the SpiderMonkey JavaScript Engine (https://archive.mozilla.org/pub/firefox/nightly/latest-mozilla-central/)

One can accomplish a lot in a few lines of Javascript code :-)

C:\spidermonkey>cat sample_1000000.log.txt | js parseJSONFile.js
Processed 1000000 records in 1562ms

//parseJSONFile.js
var start=Date.now();
var records=[];

function getVal(key,str) {
return str.split(key+"/")[1] ? r.split(key+"/")[1].split(" ")[0] : "";
}

while(x=JSON.parse(readline())){
var r=x.user_agent.toLowerCase();
x._user_Agent={
"bundler":getVal("bundler",r),
"rubygems":getVal("rubygems",r),
"ruby":getVal("ruby",r),
"platform":r.split("(")[1] ? r.split("(")[1].split(")")[0] : "",
"command":getVal("command",r),
"options":getVal("options",r),
"jruby":getVal("jruby",r),
"truffleruby":getVal("truffleruby",r),
"ci":getVal("ci",r),
"gemstach":getVal("gemstach",r)
}
records.push(x);
}
print("Processed "+records.length+" records in "+(Date.now()-start)+"ms");

Question about s3 logs

Hi!
Thanks for the great article. Just wondering how you kept this thing fed with s3 logs?

Thanks!
J

Performance suggestion: Use zstd compression with a dictionary for the logs

Hi,

I've noticed that's you're open for more performance suggestions.
To make this one work you'll have to change the logger that dumps the log files to S3, to compress the logs using zstd and pretrained dictionary.

I can say that it works wonders for repetitive data, which is the case for logs. I got from ~10MB(~50MB uncompressed, with repetitive html files) zip file to ~1MB compressed with a 100KB(if you'll do it, try different dictionary sizes to find the sweet spot) dictionary. All using rust zstd bindings. The sense of wonder never left me since.

Obviously this may not be as good for your case, but I assume it will. Basically you train a dictionary with one of your 1GB logs and use that for compression/decompression in the future, and save the dictionary in S3 as well, associate an archive with the dictionary(maybe by file name: log_dict1_stamp.bin) in case you'll want to train more dictionaries later.

Can't provide any size/perf comparisons since I don't have access to your data. Also these will require more changes to the infrastructure.

Decompression speed should be well over 1GB/s on modern hardware, maybe it's even worth investigating decompressing a stream, and search the data over the stream(but you'll have to copy the found strings), obviously the tool would change a fair amount with this architecture.

Anyway, please feel free to ignore this, if you thought about it before, or it's obvious, or it's not practical.

Kirby imports not resolving

Hi , I am building the docker file and after fixing some of the paths inside it it reached till step 20 on which it says
---> 861b440a6508
Step 20/34 : RUN cargo build --target $BUILD_TARGET --release --bin kirby-s3
---> Running in 1253d5f404d1
Compiling kirby v0.1.0 (/build)
error[E0432]: unresolved import kirby::Options
--> src/bin/kirby-s3.rs:13:5
|
13 | use kirby::Options;
| ^^^^^^^^^^^^^^ no Options in the root

error[E0432]: unresolved import kirby::stream_stats
--> src/bin/kirby-s3.rs:20:5
|
20 | use kirby::stream_stats;
| ^^^^^^^^^^^^^^^^^^^ no stream_stats in the root

warning: trait objects without an explicit dyn are deprecated
--> src/bin/kirby-s3.rs:22:53
|
22 | fn read_object(bucket_name: &str, key: &str) -> Box {
| ^^^^^^^ help: use dyn: dyn BufRead
|
= note: #[warn(bare_trait_objects)] on by default

error: aborting due to 2 previous errors

For more information about this error, try rustc --explain E0432.
error: could not compile kirby.

To learn more, run the command again with --verbose.

Can you help with this ?

Clarify I/O in benchmark numbers

Hi, I would like to test kirby with a different format, which is a 3gb text file consisting of JSON objects (1 per line).

I understand the impressive benchmark numbers were achieved by streaming the data in, and not reading files from disk (SSD / NVMe)?

If you did indeed achieve it with files read from disk, could you clarify how you managed to get around the bottle neck of read speeds?

Maybe this question doesn't make sense, but I feel the speed of serde would be wasted in our setup because we are dealing with files as input.

Performance improvements

Looking at your code there are a few areas where it could be made even faster.

  • file.rs is doing BuffReader::new without specifying a size. This results in a buffer of just 8kb. You might see some improvement by increasing this size.
  • In lib.rs it's calling lines() on the stream. This is parsing the data from bytes into utf-8 string, which is then handed to serde to deserialize from the utf-8 string. However if you use split instead it will give you bytes instead without parsing, which serde is capable of serializing directly. So you can cut out the whole UTF-8 parsing step.
  • The counters are going into a hashMap which is using the default ddos resistant hash function, so it's not the fastest. Though you can plug in an alternative hash such as https://crates.io/crates/fnv .
  • In the case of the inner map even that is not needed because all the keys are just string constants. It would be both more type safe, and faster to replace the strings with an Enum and use an enum-map.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.