Giter Site home page Giter Site logo

michel-kraemer / actson-rs Goto Github PK

View Code? Open in Web Editor NEW
29.0 3.0 2.0 1.58 MB

๐ŸŽฌ A reactive (or non-blocking, or asynchronous) JSON parser

License: MIT License

Rust 33.69% Jupyter Notebook 66.02% Shell 0.30%
asynchronous big-data json non-blocking non-blocking-io parser reactive streaming

actson-rs's Introduction

Actson Actions Status MIT license Latest Version Documentation

Actson is a low-level JSON parser for reactive applications and non-blocking I/O. It is event-based and can be used in asynchronous code (for example in combination with Tokio).



Teaser Image


Why another JSON parser?

  • Non-blocking. Reactive applications should use non-blocking I/O so that no thread needs to wait indefinitely for a shared resource to become available (see the Reactive Manifesto). Actson supports this pattern.
  • Big Data. Actson can handle arbitrarily large JSON text without having to completely load it into memory. It is very fast and achieves constant parsing throughput (see the Performance section below).
  • Event-based. Actson produces events during parsing and can be used for streaming. For example, if you write an HTTP server, you can receive a file and parse it at the same time.

Actson was primarily developed for the GeoJSON support in GeoRocket, a high-performance reactive data store for geospatial files. For this application, we needed a way to parse very large JSON files with varying contents. The files are received through an HTTP server, parsed into JSON events while they are being read from the socket, and indexed into a database at the same time. The whole process runs asynchronously.

If this use case sounds familiar, then Actson might be a good solution for you. Read more about its performance and how it compares to Serde JSON below.

Usage

Push-based parsing

Push-based parsing is the most flexible way of using Actson. Push new bytes into a PushJsonFeeder and then let the parser consume them until it returns Some(JsonEvent::NeedMoreInput). Repeat this process until you receive None, which means the end of the JSON text has been reached. The parser returns Err if the JSON text is invalid or some other error has occurred.

This approach is very low-level but gives you the freedom to provide new bytes to the parser whenever they are available and to generate JSON events whenever you need them.

use actson::{JsonParser, JsonEvent};
use actson::feeder::{PushJsonFeeder, JsonFeeder};

let json = r#"{"name": "Elvis"}"#.as_bytes();

let feeder = PushJsonFeeder::new();
let mut parser = JsonParser::new(feeder);
let mut i = 0;
while let Some(event) = parser.next_event().unwrap() {
    match event {
        JsonEvent::NeedMoreInput => {
            // feed as many bytes as possible to the parser
            i += parser.feeder.push_bytes(&json[i..]);
            if i == json.len() {
                parser.feeder.done();
            }
        }

        JsonEvent::FieldName => assert!(matches!(parser.current_str(), Ok("name"))),
        JsonEvent::ValueString => assert!(matches!(parser.current_str(), Ok("Elvis"))),

        _ => {} // there are many other event types you may process here
    }
}

Asynchronous parsing with Tokio

Actson can be used with Tokio to parse JSON asynchronously.

The main idea here is to call JsonParser::next_event() in a loop to parse the JSON document and to produce events. Whenever you get JsonEvent::NeedMoreInput, call AsyncBufReaderJsonFeeder::fill_buf() to asynchronously read more bytes from the input and to provide them to the parser.

Note

The tokio feature has to be enabled for this. It is disabled by default.

use tokio::fs::File;
use tokio::io::{self, AsyncReadExt, BufReader};

use actson::{JsonParser, JsonEvent};
use actson::tokio::AsyncBufReaderJsonFeeder;

#[tokio::main]
async fn main() {
    let file = File::open("tests/fixtures/pass1.txt").await.unwrap();
    let reader = BufReader::new(file);

    let feeder = AsyncBufReaderJsonFeeder::new(reader);
    let mut parser = JsonParser::new(feeder);
    while let Some(event) = parser.next_event().unwrap() {
        match event {
            JsonEvent::NeedMoreInput => parser.feeder.fill_buf().await.unwrap(),
            _ => {} // do something useful with the event
        }
    }
}

Parsing from a BufReader

BufReaderJsonFeeder allows you to feed the parser from a std::io::BufReader.

Note

By following this synchronous and blocking approach, you are missing out on Actson's reactive properties. We recommend using Actson together with Tokio instead to parse JSON asynchronously (see above).

use actson::{JsonParser, JsonEvent};
use actson::feeder::BufReaderJsonFeeder;

use std::fs::File;
use std::io::BufReader;

let file = File::open("tests/fixtures/pass1.txt").unwrap();
let reader = BufReader::new(file);

let feeder = BufReaderJsonFeeder::new(reader);
let mut parser = JsonParser::new(feeder);
while let Some(event) = parser.next_event().unwrap() {
    match event {
        JsonEvent::NeedMoreInput => parser.feeder.fill_buf().unwrap(),
        _ => {} // do something useful with the event
    }
}

Parsing a slice of bytes

For convenience, SliceJsonFeeder allows you to feed the parser from a slice of bytes.

use actson::{JsonParser, JsonEvent};
use actson::feeder::SliceJsonFeeder;

let json = r#"{"name": "Elvis"}"#.as_bytes();

let feeder = SliceJsonFeeder::new(json);
let mut parser = JsonParser::new(feeder);
while let Some(event) = parser.next_event().unwrap() {
    match event {
        JsonEvent::FieldName => assert!(matches!(parser.current_str(), Ok("name"))),
        JsonEvent::ValueString => assert!(matches!(parser.current_str(), Ok("Elvis"))),
        _ => {}
    }
}

Parsing into a Serde JSON Value

For testing and compatibility reasons, Actson is able to parse a byte slice into a Serde JSON Value.

Note

You need to enable the serde_json feature for this.

use actson::serde_json::from_slice;

let json = r#"{"name": "Elvis"}"#.as_bytes();
let value = from_slice(json).unwrap();

assert!(value.is_object());
assert_eq!(value["name"], "Elvis");

However, if you find yourself doing this, you probably don't need the reactive features of Actson and your data seems to completely fit into memory. In this case, you're most likely better off using Serde JSON directly (see the comparison below)

Performance

Actson has been optimized to perform best with large files. It scales linearly, which means it exhibits constant parsing speed and memory consumption regardless of the size of the input JSON text.

The figures below show the parser's throughput and runtime for different GeoJSON input files and in comparison to Serde JSON.

Actson with a BufReader performs best on every file tested (actson-bufreader benchmark). Its throughput stays constant and its runtime only grows linearly with the input size.

The same applies to the other Actson benchmarks using Tokio (actson-tokio and actson-tokio-twotasks). Asynchronous code has a slight overhead, which is mostly compensated for by using two concurrently running Tokio tasks (actson-tokio-twotasks).

The serde-value benchmark shows that the parser's throughput collapses the larger the file becomes. This is because it has to load its entire contents into memory (into a Serde JSON Value). The serde-struct benchmark deserializes the file into a struct that replicates the GeoJSON format. It suffers from the same issue as the serde-value benchmark, namely that the whole file has to be loaded into memory. In this case, the impact on the throughput is not visible in the figure since the custom struct is smaller than Serde JSON's Value and the test system had 36 GB of RAM.

The serde-custom-deser benchmark is the only Serde benchmark whose performance is on par with the slowest asynchronous Actson benchmark actson-tokio (which runs with only one Tokio task). This is because serde-custom-deser uses a custom deserializer, which avoids having to load the whole file into memory (see example on the Serde website). This very specific implementation only works because the structure of the input files is known and the used GeoJSON files are not deeply nested. The solution is not generalizable.

Read more about the individual benchmarks and the test files here.

Throughput (higher is better)

Throughput

Tested on a MacBook Pro 16" 2023 with an M3 Pro Chip and 36 GB of RAM.

Runtime (lower is better)

Runtime

Tested on a MacBook Pro 16" 2023 with an M3 Pro Chip and 36 GB of RAM.

Should I use Actson or Serde JSON?

As can be seen from the benchmarks above, Actson performs best with large files. However, if your JSON input files are small (a few KB or maybe 1 or 2 MB), you should probably stick to Serde JSON, which is a rock-solid, battle-tested parser and which will perform extremely fast in this case.

On the other hand, if you require scalability and your input files can be of arbitrary size, or if you want to parse JSON asynchronously, use Actson.

The aim of this section is not to make one parser appear better than the other. Actson and Serde JSON are two very distinct libraries that each have advantages and disadvantages. The following table may help you decide whether you require Actson or if you should prefer Serde JSON:

Actson Serde JSON
The input files can be of arbitrary size (several GB) The input files are just a few KB or MB in size
The JSON text is streamed, e.g. through a web server The JSON text is stored on the file system or in memory
You want to concurrently read and parse the JSON text Sequential parsing is sufficient
Parsing should not block other tasks in your application (reactive programming) The JSON text is so small that parsing is quick enough, or your application is not reactive and does not run multiple tasks in parallel
You want to process individual JSON events You prefer convenience and do not care about events
The structure of the JSON text can vary or is not known at all The structure is very well known
You don't require deserialization (mapping the JSON text to a struct), or deserialization is impossible due to the varying or unknown structure of the JSON text You want and can deserialize the JSON text into a struct

Compliance

We test Actson thoroughly to make sure it is compliant with RFC 8259, can parse valid JSON documents, and rejects invalid ones.

Besides own unit tests, Actson passes the tests from JSON_checker.c and all 283 accept and reject tests from the very comprehensive JSON Parsing Test Suite.

Other languages

Besides this implementation in Rust here, there is a Java implementation.

Acknowledgments

The event-based parser code and the JSON files used for testing are largely based on the file JSON_checker.c and the JSON test suite from JSON.org originally released under this license (basically MIT license).

The directory tests/json_test_suite is a Git submodule pointing to the JSON Parsing Test Suite curated by Nicolas Seriot and released under the MIT license.

License

Actson is released under the MIT license. See the LICENSE file for more information.

actson-rs's People

Contributors

bernhardlthomas avatar dependabot[bot] avatar michel-kraemer avatar zozs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

actson-rs's Issues

Profile-Guided Optimization (PGO) benchmark results

Hi!

I evaluate Profile-Guided Optimization (PGO) performance improvements for applications and libraries (including other JSON parsers) in different software domains - all current results can be found here. According to the tests, PGO helps to improve performance in various software domains. I decided to perform PGO benchmarks on the actson-rs library too since some library users may be interested in improving the library's performance. I did some benchmarks and here are the results.

Test environment

  • Fedora 39
  • Linux kernel 6.7.6
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Compiler - Rustc 1.76
  • actson-rs version: main branch on commit 332b8922eb6b3960956902fe80cc031c13f09dd6
  • Disabled Turbo boost for improving consistency across runs

Benchmark

For benchmark purposes, I use two benchmarks: via cargo bench and geojson benchmarks.

For cargo bench set of benchmarks, the PGO training phase is done with cargo pgo bench, PGO optimization phase - with cargo pgo optimize bench.

For geojson set of benchmarks, I used run-all.sh script. For the PGO training phase, I used completely the same script but the instrumented binary is built with cargo pgo build, and the number of runs was settled to 1 since there is no need to run more frequently for the training phase in this case.

All PGO-related routines are done with cargo-pgo.

All benchmarks are done on the same machine, with the same hardware/software during runs, with the same background "noise" (as much as I can guarantee, of course).

Results

Firstly, here are the results for cargo bench:

Geojson results:

Raw files (results.json files) are also saved on my machine - I can share them as well if you are interested.

At least in the provided by project benchmarks, there are measurable improvements in many cases. Also, I got the same results for other JSON benchmarks that can be found in the awesome-pgo repo.

Maybe mentioning these results somewhere in the README (or any other user-facing documentation) can be a good idea. Probably you will be interested in integrating building actson-rs with PGO into your applications - who knows :)

Please do not treat the issue as a bug or something like that - it's just a benchmark report.

Ownership of Feeder in the context of JsonParser

Issue

JsonParser taking a reference to Feeder makes it difficult to write abstractions that contain both the parser and the feeder, as this would cause the struct to be self referential. This brings difficulties in Rust.

Proposals

Proposal 1: Own Feeder

JsonParser should either take ownership of Feeder, either through exclusive ownership or through shared ownership (Arc).

Proposal 2: Provide feeder at call site of next_event(..)

JsonParser should not own, or have a reference to, a Feeder. Instead the dependency can be provided when JsonParser::next_event(...) is called. This has the additional benefit of making the JsonParser non-generic, merely the next_event(..) method needs to be generic over Feeder.

Pull Request

Proposal 2 has been implemented and a pull request will be made.

Related issues

BufReaderJsonFeeder and AsyncBufReaderJsonFeeder have a similar issue, in that they contain an exclusive reference to a BufReader, instead of taking ownership.

Cargo bench fails to compile

When I run cargo bench in the root, I get the following error:

error[E0432]: unresolved import `actson::serde_json`
 --> benches/bench.rs:6:39
  |
6 | use actson::{feeder::SliceJsonFeeder, serde_json::from_slice, JsonEvent, JsonParser};
  |                                       ^^^^^^^^^^ could not find `serde_json` in `actson`

For more information about this error, try `rustc --explain E0432`.
error: could not compile `actson` (bench "bench") due to 1 previous error
warning: build failed, waiting for other jobs to finish...

actson-rs version: main branch, f33a94290b50ec8cacb73fdc1be7684af7b9fe38 commit, Rustc version: 1.76

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.