Giter Site home page Giter Site logo

rakaly / jomini Goto Github PK

View Code? Open in Web Editor NEW
51.0 6.0 5.0 1.14 MB

Low level, performance oriented parser for save and game files from EU4, CK3, HOI4, Vic3, Imperator, and other PDS titles.

Home Page: https://crates.io/crates/jomini

License: MIT License

Rust 99.89% R 0.11%
parser eu4 ck3 text binary imperator hoi4 paradox clausewitz

jomini's Introduction

ci Version

Jomini

A low level, performance oriented parser for save and game files from Paradox Development Studio titles (eg: Europa Universalis (EU4), Hearts of Iron (HOI4), and Crusader Kings (CK3), Imperator, Stellaris, and Victoria).

For an in-depth look at the Paradox Clausewitz format and the pitfalls that come trying to support all variations, consult the write-up. In short, it's extremely difficult to write a robust and fast parser that abstracts over the format difference between games as well as differences between game patches. Jomini hits the sweet spot between flexibility while still being ergonomic.

Jomini is the cornerstone of the online EU4 save file analyzer. This library also powers the Paradox Game Converters and pdxu.

Features

  • ✔ Versatile: Handle both plaintext and binary encoded data
  • ✔ Fast: Parse data at over 1 GB/s
  • ✔ Small: Compile with zero dependencies
  • ✔ Safe: Extensively fuzzed against potential malicious input
  • ✔ Ergonomic: Use serde-like macros to have parsing logic automatically implemented
  • ✔ Embeddable: Cross platform native apps, statically compiled services, or in the browser via Wasm

Quick Start

Below is a demonstration of deserializing plaintext data using serde. Several additional serde-like attributes are used to reconcile the serde data model with structure of these files.

use jomini::{
    text::{Operator, Property},
    JominiDeserialize,
};

#[derive(JominiDeserialize, PartialEq, Debug)]
pub struct Model {
    human: bool,
    first: Option<u16>,
    third: Property<u16>,
    #[jomini(alias = "forth")]
    fourth: u16,
    #[jomini(alias = "core", duplicated)]
    cores: Vec<String>,
    names: Vec<String>,
    #[jomini(take_last)]
    checksum: String,
}

let data = br#"
    human = yes
    third < 5
    forth = 10
    core = "HAB"
    names = { "Johan" "Frederick" }
    core = FRA
    checksum = "first"
    checksum = "second"
"#;

let expected = Model {
    human: true,
    first: None,
    third: Property::new(Operator::LessThan, 5),
    fourth: 10,
    cores: vec!["HAB".to_string(), "FRA".to_string()],
    names: vec!["Johan".to_string(), "Frederick".to_string()],
    checksum: "second".to_string(),
};

let actual: Model = jomini::text::de::from_windows1252_slice(data)?;
assert_eq!(actual, expected);

Binary Deserialization

Deserializing data encoded in the binary format is done in a similar fashion but with a couple extra steps for the caller to supply:

  • How text should be decoded (typically Windows-1252 or UTF-8)
  • How rational (floating point) numbers are decoded
  • How tokens, which are 16 bit integers that uniquely identify strings, are resolved

Implementors be warned, not only does each Paradox game have a different binary format, but the binary format can vary between patches!

Below is an example that defines a sample binary format and uses a hashmap token lookup.

use jomini::{Encoding, JominiDeserialize, Windows1252Encoding, binary::BinaryFlavor};
use std::{borrow::Cow, collections::HashMap};

#[derive(JominiDeserialize, PartialEq, Debug)]
struct MyStruct {
    field1: String,
}

#[derive(Debug, Default)]
pub struct BinaryTestFlavor;

impl jomini::binary::BinaryFlavor for BinaryTestFlavor {
    fn visit_f32(&self, data: [u8; 4]) -> f32 {
        f32::from_le_bytes(data)
    }

    fn visit_f64(&self, data: [u8; 8]) -> f64 {
        f64::from_le_bytes(data)
    }
}

impl Encoding for BinaryTestFlavor {
    fn decode<'a>(&self, data: &'a [u8]) -> Cow<'a, str> {
        Windows1252Encoding::decode(data)
    }
}

let data = [ 0x82, 0x2d, 0x01, 0x00, 0x0f, 0x00, 0x03, 0x00, 0x45, 0x4e, 0x47 ];

let mut map = HashMap::new();
map.insert(0x2d82, "field1");

let actual: MyStruct = BinaryTestFlavor.deserialize_slice(&data[..], &map)?;
assert_eq!(actual, MyStruct { field1: "ENG".to_string() });

When done correctly, one can use the same structure to represent both the plaintext and binary data without any duplication.

One can configure the behavior when a token is unknown (ie: fail immediately or try to continue).

Direct identifier deserialization with token attribute

There may be some performance loss during binary deserialization as tokens are resolved to strings via a TokenResolver and then matched against the string representations of a struct's fields.

We can fix this issue by directly encoding the expected token value into the struct:

#[derive(JominiDeserialize, PartialEq, Debug)]
struct MyStruct {
    #[jomini(token = 0x2d82)]
    field1: String,
}

// Empty token to string resolver
let map = HashMap::<u16, String>::new();

let actual: MyStruct = BinaryDeserializer::builder_flavor(BinaryTestFlavor)
    .deserialize_slice(&data[..], &map)?;
assert_eq!(actual, MyStruct { field1: "ENG".to_string() });

Couple notes:

  • This does not obviate need for the token to string resolver as tokens may be used as values.
  • If the token attribute is specified on one field on a struct, it must be specified on all fields of that struct.

Caveats

Before calling any Jomini API, callers are expected to:

  • Determine the correct format (text or binary) ahead of time.
  • Strip off any header that may be present (eg: EU4txt / EU4bin)
  • Provide the token resolver for the binary format
  • Provide the conversion to reconcile how, for example, a date may be encoded as an integer in the binary format, but as a string when in plaintext.

The Mid-level API

If the automatic deserialization via JominiDeserialize is too high level, there is a mid-level api where one can easily iterate through the parsed document and interrogate fields for their information.

use jomini::TextTape;

let data = b"name=aaa name=bbb core=123 name=ccc name=ddd";
let tape = TextTape::from_slice(data).unwrap();
let reader = tape.windows1252_reader();

for (key, _op, value) in reader.fields() {
    println!("{:?}={:?}", key.read_str(), value.read_str().unwrap());
}

For even lower level of parisng, see the respective binary and text documentation.

The mid-level API also provides the excellent utility of converting the plaintext Clausewitz format to JSON when the json feature is enabled.

use jomini::TextTape;

let tape = TextTape::from_slice(b"foo=bar")?;
let reader = tape.windows1252_reader();
let actual = reader.json().to_string()?;
assert_eq!(actual, r#"{"foo":"bar"}"#);

Write API

There are two targeted use cases for the write API. One is when a text tape is on hand. This is useful when one needs to reformat a document (note that comments are not preserved):

use jomini::{TextTape, TextWriterBuilder};

let tape = TextTape::from_slice(b"hello   = world")?;
let mut out: Vec<u8> = Vec::new();
let mut writer = TextWriterBuilder::new().from_writer(&mut out);
writer.write_tape(&tape)?;
assert_eq!(&out, b"hello=world");

The writer normalizes any formatting issues. The writer is not able to losslessly write all parsed documents, but these are limited to truly esoteric situations and hope to be resolved in future releases.

The other use case is geared more towards incremental writing that can be found in melters or those crafting documents by hand. These use cases need to manually drive the writer:

use jomini::TextWriterBuilder;
let mut out: Vec<u8> = Vec::new();
let mut writer = TextWriterBuilder::new().from_writer(&mut out);
writer.write_unquoted(b"hello")?;
writer.write_unquoted(b"world")?;
writer.write_unquoted(b"foo")?;
writer.write_unquoted(b"bar")?;
assert_eq!(&out, b"hello=world\nfoo=bar");

Unsupported Syntax

Due to the nature of Clausewitz being closed source, this library can never guarantee compatibility with Clausewitz. There is no specification of what valid input looks like, and we only have examples that have been collected in the wild. From what we do know, Clausewitz is recklessly flexible: allowing each game object to potentially define its own unique syntax.

We can only do our best and add support for new syntax as it is encountered.

Benchmarks

Benchmarks are ran with the following command:

cargo clean
cargo bench -- parse
find ./target -wholename "*/new/raw.csv" -print0 | xargs -0 xsv cat rows > assets/jomini-benchmarks.csv

And can be analyzed with the R script found in the assets directory.

Below is a graph generated from benchmarking on an arbitrary computer.

jomini-bench-throughput.png

jomini's People

Contributors

nickbabcock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

jomini's Issues

How to deserialize operators?

e.g. a modifier list:

modifier = {
	factor = 2
	has_policy_flag = economic_stance_market
}
modifier = {
	factor = 0
	num_communications < 2
}

I know TextTape can do this, but I don't know how to mix TextTape and struct together. I tried JominiDeserialize and serde's visitor, but neither Operator nor OperatorValue implements Deserialize.

Expose 64bit binary floating point values

There's a bit of an oddity with the binary tokens. There are two floating point values:

BinaryToken::F32_1(f32)
BinaryToken::F32_2(f32)

What's odd is that the second version, F32_2 actually consumes 8 bytes of data, so a better design for this would be:

BinaryToken::F64(f64)

This would not increase the size of BinaryToken as 8 bytes is less than the 16 bytes to store a slice reference.

All these years, I've assumed that the last 4 bytes are unused, so before making this change I'll need to run some tests to make sure this does not change how values are decoded.

Support "list" text keyword (coat of arms)

Currently recorded as unsupported syntax, the "list" keyword references a previously defined property:

  simple_cross_flag = {
      pattern = list "christian_emblems_list"
      color1 = list "normal_colors"
  }

Where normal_colors is defined elsewhere like:

normal_colors = {
  30 = "red"
  12 = "blue"
  1 = "green"
  14 = "black"
  0 = "purple"
  # ....
}

With CK3 and Imperator both using this syntax for coat of arms, it is likely that this syntax will continue to be seen (if only in coat of arms).

I dislike the keyword approach as it introduces new syntax unused elsewhere, so it may be tricky to ensure that this doesn't regress on save games that happen to have a list value, all the while maintaining performance

Include common date object

I've copied an implementation of non leap year date objects in three different implementations:

This library should include commonly used abstractions or data structures so that I don't have to try and keep them all in sync.

EDIT: while HOI4 has hours, that can remain in that repo (if it's ever created) as HourlyDate

Generically support token header

Found these across several PDS games:

color = hsv { 0.58 1.00 0.72 }
color = rgb { 169 242 27 }
color = hsv360{ 25 75 63 }
position = cylindrical{ 150 3 0 }
color5 = hex { aabbccdd }
mild_winter = LIST { 3700 3701
    # ....
}

A few options:

  1. Support them generically: Create a TextToken::Header(Scalar) followed by a TextToken::Array (though it seems equally plausible that it is followed by an object)
  2. Create a unique TextToken for each case (eg: TextToken::Hsv, TextToken::Rgb, TextToken::Hsv360, etc)
  3. Parse similar objects to the same internal structure (eg: hsv, rgb, hsv360, and hex should all be about the same)

Right now I'm leaning towards option 1. as that will allow this mechanism for unforeseen tokens that I'm sure will be introduced with each game. With option 1, we should still keep the BinaryToken::Rgb as that is the only color that appears in the binary format (thus far). The downstream breaking change will come with deserialization, in order for a client to distinguish values the deserializer will need to expose the token header somehow. The only thing coming to mind is to emulate deserializing a tuple with the first element being (rgb, hsv, etc) and the second element being the data.

Support Parsing Colors in Text Parser

The binary parser already supports parsing RGB color info found in imperator saves -- the text parser should support RGB at the very least (and potentially HSV)

color = hsv { 0.5 0.2 0.8 }
color = rgb { 100 200 150 }

RGB info can be found in imperator saves.

jomini::Value

It would be useful to have a structure that represents every possible value we can deserialize into. Example:

// here's deserializer into a generic Value
let value: HashMap<String, jomini::Value> = jomini::text::de::from_utf8_slice(r#"
    unquoted = 1234
    quoted = "567"
    operator > 0.5
    color = hsv { 0.3 0.4 0.5 }
    sequence = { 1 2 3 4 }
    map = { a = 1 b = 2 }
"#.as_bytes());

// and we can deserialize it into concrete type later
let unquoted = value.get("unquoted").unwrap();
assert_eq!(f32::deserialize(unquoted.clone()).unwrap(), 1234.0);

This is the similar idea to existing serde_value::Value, serde_json::Value, serde_yaml::Value and others.

  • it should losslessly keep any value from jomini deserializer (i.e. I can deserialize tape into it)
  • it should implement IntoDeserializer (i.e. I can deserialize it into any concrete value)
  • it should be a static type (owned)
  • it should implement Debug
  • it should give user some API to inspect the value (e.g. is it a token or a sequence)

There's ValueKind in jomini right now, but it's a borrowed type, it doesn't implement Debug, and it's not public API.

There's also generic serde_value::Value, but it is not lossless (loses quotes, operators, headers, etc.). And the code above fails with it in multiple ways (e.g. loses hsv header, loses > operator, and unquoted number is transformed into string and can't be parsed within assert statement later).

Why? I want to solve #138 in more generic way (rather than adding yet another _internal_jomini_property hack) + make low level syntax more accessible in general. Plus some visual inspection of the file contents via Debug.

How to debug and handle ScalarError?

When I use the hoi4save parser to parse my HOI4 save file, Jomini returns an "AllDigits Error". This error is likely caused by the fact that the value in the "manpower_pool" field can be a string of numbers with dots, which makes it difficult for the parser to recognize it as valid input.

manpower_pool={
	available=95915
	locked=3379.3.22.15
	total=3782.1.21.20
}

Option::unwrap() on a None value

I've managed to trigger this unwrap:

let value = self.value.take().unwrap();

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', C:\Users\user\.cargo\registry\src\index.crates.io-6f17d22bba15001f\jomini-0.24.0\src\text\de.rs:409:43

This is probably my misuse of serde (running next_value() on a map twice), but it should be at least an .expect() with a better error message or an Error::custom().

Code sample:

use serde::Deserialize;

#[derive(Deserialize, Debug)]
struct Color((u8, u8, u8));

#[derive(Deserialize, Debug)]
enum ColorName {
    Red,
    Green,
    Blue,
}

struct Container;

impl<'de> Deserialize<'de> for Container {
    fn deserialize<D: serde::Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> {
        struct TVisitor;
        impl<'de> serde::de::Visitor<'de> for TVisitor {
            type Value = Container;

            fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
                formatter.write_str("{r g b} or name")
            }

            fn visit_map<A: serde::de::MapAccess<'de>>(self, mut map: A) -> Result<Self::Value, A::Error> {
                while let Some(key) = map.next_key::<String>()? {
                    println!("key: {}", key);
                    if let Ok(color) = map.next_value::<Color>() {
                        println!("color: {:?}", color);
                    } else {
                        let name = map.next_value::<ColorName>()?;
                        println!("name: {:?}", name);
                    }
                }
                Ok(Container)
            }
        }
        deserializer.deserialize_map(TVisitor)
    }
}

fn main() {
    let data = r#"
        color1 = red
        color2 = { 255 0 0 }
    "#;
    let _: Container = jomini::text::de::from_utf8_slice(data.as_bytes()).unwrap();
}

Add Heterogenous lists to BinaryTape

CK3 ironman save contains the following:

6f 34 01 00 03 00 0c 00 0a 00 00 00
0c 00 00 00 00 00 01 00 14 00 02 00 00 00 0c 00
01 00 00 00 01 00 14 00 02 00 00 00 04

which translates into

levels={ 10 0=2 1=2 }

While TextTape can force itself through, the BinaryTape fails. So the fix is to interpret it like follow

levels={ 10 { 0=2 1=2 } }

so it becomes a heterogenous list with an integer first followed by an object.

Differentiate quoted vs unquoted at high level

This PR added distinction between those in tapes: #55

Can I distinguish between those in custom serde Deserializer? Basically, I'm asking if the following code is possible:

#[derive(Debug, Deserialize)]
struct GameData {
    a: QuotedString,
    b: UnquotedString,
}

let data = r#"
    a = "@test"
    b = @test
"#;

Add Enhanced Text Parser

The current text and binary parser are geared towards performance of parsing large save files. From what I've gathered, save files tend to use a subset of possible tokens. For instance, stellaris data files use other operators than = like:

has_level > 2

So it may be beneficial to add another type of parser: an enhanced text parser. It can contain features that are too expensive to have in the base parser:

  • Stream parser
  • Support operators other than equals (and other syntax that is in data files but not save files)
  • Losslessly record tokens position (line and column) -- like an AST

Ref nickbabcock/jomini#9

Investigate array arguments instead of slices to binary float flavor

fn visit_f32(&self, data: [u8; 4]) -> f32;

is much more self-explanatory than

fn visit_f32(&self, data: &[u8]) -> f32;

This will need to tested to ensure there is not a performance regression (potential matchup: unaligned read vs memcpy + shifts + adds).

The implementation would be similar to arrayref:

        let val = data
            .get(..4)
            .map(|x| {
                let arr: &[u8; 4] = unsafe { &*(x.as_ptr() as *const [u8; 4]) };
                self.flavor.visit_f32(*arr)
            })
            .ok_or_else(Error::eof)?;

Introduce API for discovering hidden objects

levels = { 10 0=1 1=2 }

Is encoded in the tapes as

levels = { 10 { 0=1 1=2 } }

I call these hidden objects (though the 10 may be more of a object header -- time will only tell which is correct).

Right now it is impossible for the client to know if the object start or end token they are looking at denotes a hidden object.

Each tape should have a corresponding method for determining which tokens delimit a hidden object

This will help the downstream melters create an equivalent plain text document easier.

De-serializing enums

I'd like to de-serialize something like this

...
requirements = {
    country = ENG
    prestige = 10
}
...

My struct looks like this

#[derive(Clone, Debug, Deserialize, PartialEq)]
pub struct Event {
    requirements: Vec<Condition>,
}

#[derive(Clone, Debug, Deserialize, PartialEq)]
#[serde(untagged)]
pub enum Condition {
    Country { country: String },
    Prestige { prestige: u32 }
}

But I can't get it to work, with or without named fields in the enums, with or without untagged. I'd like to be able to write something like this

#[derive(Clone, Debug, Deserialize, PartialEq)]
#[serde(untagged)]
pub enum Condition {
    #[serde(rename="country)]
    Country(String),
    #[serde(rename="prestige")]
    Prestige(u32)
}

And have it interpret the 1-field enum alternatives correctly. I wasn't able to use JominiDeserialize with enums either. Is something like this supported?

How to ignore hidden object deserialization by using derived macros?

I got Error(InvalidSyntax { msg: "hidden object must start with a key", offset: 3560 }) while trying to deserialize a Stellaris mod technology definition, file contains

tech_xxx {
weight_modifier {
...
    any_owned_planet = {
					OR = {
						has_building = building_mote_harvesters
						building_mote_harvesting_traps_2 // <-- ERROR HERE
...

But I didn't define weight_modifier field in my technology struct

So my questions are:

  1. why should I deal with a hidden object while I simply want to ignore that field? If so, how can I deal with it?
  2. TextTape could be an alternative but it is too complicated to use, especially when dealing with a multi layer nested object. Any real examples using TextTape to deserialize a game file?
  3. Could the parser allow to be continued after a syntax error to be found, and return a None instead of an error? Might not be clear but it's like HTML parsers vs XHTML parsers.

Support Parsing Escaped Strings

While not possible (afaik) in EU4, Stellaris allows one to embed quotes in names, which require them to be escaped:

name = "Joe \"Captain\" Rogers"

Ref nickbabcock/jomini#11

Here is a failing test case:

    #[test]
    fn test_escaped_quotes() {
        let data = br#"name = "Joe \"Captain\" Rogers""#;

        assert_eq!(
            parse(&data[..]).unwrap().token_tape,
            vec![
                TextToken::Scalar(Scalar::new(b"name")),
                TextToken::Scalar(Scalar::new(br#"Joe "Captain" Rogers"#)),
            ]
        );
    }

Also consider only unescaping the quotes on to_utf8 so that Scalar can still be a pure slice reference instead of a something like a Cow.

custom_name="THE !@#$%^&*( '\"LEGION\"')"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.