stranger6667 / jsonschema-rs Goto Github PK

View Code? Open in Web Editor NEW

497.0 497.0 90.0 3.83 MB

JSON Schema validation library

Home Page: https://docs.rs/jsonschema

License: MIT License

Rust 97.27% Shell 0.12% Python 2.61%

hacktoberfest jsonschema python rust

jsonschema-rs's Introduction

Hi, I am Dmitry 👋

Software Engineer with more than 12 years of experience specializing in Rust and Python with a focus on writing parsers and fuzzing.

🌐 Based in Prague, Czech Republic 🇨🇿
💡 Interested in software testing & building reliable systems
🎓 Studied information security
🚲 Love traveling
👋 Reach me on LinkedIn, Twitter or Telegram

jsonschema-rs's People

Contributors

Stargazers

Watchers

Forkers

macisamuele fimbault mtreinish gitter-badger dodomorandi driv3r aaron-makowski wmain duckontheweb gavadinov jrdngr peddermaster2 thebearingedge jcburnside dzikichrzan josfeenstra tamasfe benferse mrceperka alexjg rahulahoop arcsi42 pure-peace djmitche ru5ty0ne kyle-mccarthy zhiburt jacobmischka celeo vishalsodani francismurillo rafaelcaricio blacha brooooooklyn isgasho sthagen ecyrbe jqnatividad fujiapple852 adamtrs antouhou ballpointcarrot sudhanvadixit jvanstraten kod-kristoff jeromegn benfalk evanrichter simrit1 olirice levenson 6293 expyron ermakov-oleg niraj-kamdar danielbauman88 theori-io britisharmy tu6ge samwilsn aciba90 tw39124-1 santhosh-tekuri tobz plato-solutions vectordotdev tempbottle cbor-schema iliya-malecki m1guelpf samgqroberts eastside middle-app wugouzi kgutwin torkeln red-abierta whytheplatypus orangetux getsentry jayvdb dm-duys neurelo-public derridda solsolution bryncooke dashpay desirecore

jsonschema-rs's Issues

Compile validators for specific schemas

As python-fastjsonschema does. Not yet sure how to implement

Refactor benchmarks

At the moment I see these disadvantages of the current implementation:

They test only the performance of is_valid. We should bench validate as well
benchmark names are hardcoded and often duplicated. we should autogenerate them, so they are not re-written accidentally during the run
a lot of duplicated schemas. They could be reorganized with some macro
a lot of duplicate code in benches implementation
commented code. actually it will be better to uncomment and then select by name

Generate validators without dispatching

Even though compiling validators gives pretty good results, it is not the fastest way to perform validation in all circumstances. If we know the schema during the build time, then we can generate code that will be more efficient, than the current approach.

For example if we have this schema:

{"type": "string", "maxLength": 5}

then our current approach will basically iterate over a vector of trait objects and call their validate / is_valid methods.

The idea is to generate code like this:

fn is_valid(instance: &Value) -> bool {
    match instance {
        Value::String(value) => value.len() <= 5,
        _ => false
    }
}

https://github.com/horejsek/python-fastjsonschema does this.

Implement From trait for missing errors

e.g. reqwest error

Avoid copying to ValidationError

In most of the cases, we copy data into ValidationError instance like this (taken from the implementation of required keyword):

    fn validate<'a>(&self, _: &'a JSONSchema, instance: &'a Value) -> ErrorIterator<'a> {
        if let Value::Object(item) = instance {
            for property_name in &self.required {
                if !item.contains_key(property_name) {
                    return error(ValidationError::required(instance, property_name.clone()));
                }
            }
        }
        no_error()
    }

instance is later wrapped in Cow::Borrowed but property_name is cloned. Sometimes instance is cloned too via ValidationError::into_owned (e.g. in additional_properties keyword implementation) so it can be used in our error iterator.

I assume that it is possible to avoid cloning, but it will require some lifetime tweaks which I failed to implement (a couple of times)

Use test_case crate to simplify test parametrization

https://github.com/frondeus/test-case

Optimize check_time

It might be faster with a single regex rather than with 4 calls to parse_from_str

Add rayon back

It was removed during error iterator implementation

Add meta-schema validation

as in jsonschema for Python

Restructure project

Rename Validator -> Validate
Rename Schema -> Validator
Move Scope and related things to a separate module
Put types.rs to relevant places
Rename validate_sub -> descend
Rename validators -> keywords
Move compile to the root
Move validators/mod.rs to the root

Eliminate function call overhead in `contentMediaType` validator

it might be faster to compile validator for a concrete type. So we'll have a separate struct for application/json and separate ones for all supported content encoding + all their combinations (to avoid overhead). I assume that the simplest way to implement it without duplication will be using macros

Current implementation

Bug in AdditionalPropertiesFalseValidator

There is no test case for it, but instead of :

fn is_valid(&self, _: &JSONSchema, instance: &Value) -> bool {
        if let Value::Object(item) = instance {
            return item.iter().next().is_some();
        }
        true
    }

it should be:

fn is_valid(&self, _: &JSONSchema, instance: &Value) -> bool {
        if let Value::Object(item) = instance {
            return item.iter().next().is_none();
        }
        true
    }

I.e. it is only valid on objects without properties

Debug for validators

Optimize double replace call

in pointer

Restructure project

resolver & validators should be at the same level
errors should be grouped in the same file
move format checkers into a separate file
types separately

Setup CI

GitHub actions
Each commit - cargo fmt & cargo clippy
Test build

Update "Performance" section

It will be more fair to have two groups - compiled and not compiled. Currently, results from jsonschema_valid and valico are done with compiled validators. So, basically we need to move jsonschema (not compiled) column into a new table and compare it with not compiled versions of jsonschema_valid and valico.

Rust compiler version & options will also be useful there. As for benchmarks probably it will be better to compile with LTO and RUSTFLAGS="--emit=asm"

Handle errors instead `unwrap`

In some cases, it might be better to return an error to the client instead. they are mostly in the resolver.

But for known to be valid regexes & URLs we can use expect

Add all types of validation for `contentEncoding`

There are not many of them

Improve compilation

Currently, all possible sub-schemas are built. Maybe only subschemas for existing refs should be built instead?

Avoid mutable context - in this case, it is harder to parallelize compilation, simple clones should work

`is_valid` should return Result

Because it can fail on SchemaError

Macros to return validation error

Instead of

let message = format!("'{}' is too long", item);
return Err(ValidationError::ValidationError(message));

it can be

return validation_error!("`{}` is too long", item)

Make a top-level API

e.g. validate

Add a changelog

There should be a proper file + maybe in GitHub releases

Improve validators debug representation

I think it will be better if this representation will be closer to the original schema. E.g.

<unique_items> vs {"uniqueItems": true}

It might be less confusing since it will use the same keywords as in the original schema

Store & use meta-schemas

If we'll validate the input schemas for conformance to the respective specs, then:

We probably can skip a lot of our own checks during the compilation process
There will be an understandable error message in case if the input schema is not valid

Regarding the implementation details - it can be done via lazy_static! so the schema is not re-compiled. In the perfect scenario I'd like to have it done via code-generation (like described here - #46)

Normalize URLs in resolver

to avoid extra network calls:

http://example.com/? is the same as http://example.com/

Do not include empty nodes in the validation tree

When a schema is compiled it is possible to have empty nodes. Example

{
    "items": {"additionalProperties": true}
}

It compiles to items: {} because true is a default value for this keyword which makes this schema empty.
Such cases should be checked and removed from the tree

Provide a better documentation for keywords implementation

Probably some places might look quite confusing without a proper explanation. E.g. why some case converts to true validator. Maybe some links to the official documentation might help

Reject more invalid schemas

E.g. multipleOf MUST accept a number strictly greater than 0. Currently, we accept any number.

We should check more cases. However, it might be solved by using a meta schema (after validation we'll know that the schema is valid)

Examples:

https://tools.ietf.org/html/draft-fge-json-schema-validation-00#section-5.1.2.1

Cache for loaded documents

Once a remote reference is resolved I think it will make sense to cache it somewhere. I assume that it might be done with RefCell. Some kind of LRU cache with a small capacity (usually there are not many remote schemas under the same document)

Canonicalise schemas during compilation

We can eliminate some not-efficient options. e.g:

{"anyOf": [{"type": "string"}, {"type": "number"}]}

can be simplified to:

{"type": ["string", "number"}

And if there are integer and number, then it can be replaced with {"type": "number"} since number includes integer.

Another approach that we can take additionally:

When a schema is compiled it is possible to have empty nodes. Example

{
    "items": {"additionalProperties": true}
}

It compiles to items: {} because true is a default value for this keyword which makes this schema empty. Such cases should be checked and removed from the tree

Restrict format checks available for different drafts

Some formats are available only for certain drafts. E.g. uri-reference is valid for Drafts 6, 7 and 2019-09, but is not for Draft 4

Implement Draft 2019-09

Validation docs: https://tools.ietf.org/html/draft-handrews-json-schema-validation-02

New keywords:

Implementation examples - TypeScript, Python (partial)

Provide explanation in errors when the input schema is not valid

At the moment it is only Schema compilation error

Possible truncation & panic

e.g. in min_properties:

let limit = limit.as_u64().unwrap() as usize;

If the schema will contain a negative/float number for this keyword then this line will panic

On a 32-bit platform, an integer that exceeds usize it will be truncated which may lead to wrong results during validation

Affected validators:

max_items
max_length
max_properties
min_items
min_length
min_properties

As a result we should have clippy::cast_possible_truncation lint + test cases

Cache regexps

in funcs.rs to avoid re-computing them

Factor out url parsing in resolver

The code is very similar in push_scope and resolve

Cleanup folders joining in `resolve_fragment`

Adjust type checking for Draft 6

Add code coverage reports

Something like this + uploading to codecov

https://github.com/mozilla/grcov
https://github.com/actions-rs/grcov

Finish incomplete error messages

Extract resolver from Config

Separate validators for different drafts

validate function should guess version and use an appropriate validator

Return errors as an iterator

Fix regexs for checks

Python interface

PyO3

As proposed by @macisamuele , this library might help - https://github.com/macisamuele/json-trait-rs

CFFI implementation that I started is far from being done, there are some weird edge cases that I don't know how to handle. E.g. calling PyLong_FromLongLong on a big integer produces more corruption :/

Implement Draft 3

Group validators by the input type

The idea is to store validators in groups by the input type, e.g. all validators that can be applied to a number, object, array, string, etc.

What we can get from it

Less pattern matching on the matching type

Consider this schema: {"minimum": 1, "maximum": 10}

Essentially we have 2 validators that together roughly do the following:

if let Value::Number(item) = instance {
    let item = item.as_f64().unwrap();
    if item < self.limit {
        return false;
    }
}
if let Value::Number(item) = instance {
    let item = item.as_f64().unwrap();
    if item > self.limit {
        return false;
    }
}

Pattern matching twice, and item.as_f64().unwrap() twice. Instead, we can do on the root validation method (and in nodes where it is appropriate):

... // some common validators for any type here
match instance {
    Value::Number(item) => {
        let item = item.as_f64().unwrap();
        // first validator inlined for illustration
        if item < self.limit {
            return false;
        };
        if item > self.limit {
            return false;
        }
        true
    }
    ...
}

In this arm, we can apply exclusiveMaximum, exclusiveMinimum, minimum, maximum, and multipleOf.

Much simpler validators

Instead of this:

    fn is_valid(&self, _: &JSONSchema, instance: &Value) -> bool {
        if let Value::Number(item) = instance {
            let item = item.as_f64().unwrap();
            if item < self.limit {
                return false;
            }
        }
        true
    }

we can do this:

    fn is_valid(&self, item: f64) -> bool {
        item < self.limit
    }

And there is no need to pass a not used reference to JSONSchema instance. The same simplification can be applied to the validate method.

Faster execution for not-matching types

Currently, if we pass null to the validator above, we'll still call both of them in a loop. and they both will return true. With that idea, there will be only 1 pattern matching in the root + maybe some small checks which I'll describe below

More insights where to apply parallel execution

We can know for sure that there is no point to apply any parallel execution for numeric validators, since they are fast and there are only 5 of them. In other words, the surface of possibilities will be more visible (only applicable to arrays and objects) and smaller.

As a downside, I see that there could be some extra logic to iterate over two vectors (common & specific validators) which may have higher overhead for some small schemas with a single keyword

Also, the implementation will require splitting to multiple traits.

But anyway, this option is worth exploring, maybe some other optimizations will be more visible on the way

I think that this idea can be also applied to the compilation phase