Giter Site home page Giter Site logo

reubenbond / hagar Goto Github PK

View Code? Open in Web Editor NEW
226.0 13.0 16.0 1.86 MB

Fast, flexible, and version-tolerant serializer for .NET

License: MIT License

PowerShell 0.01% C# 99.64% Batchfile 0.35%
serializer dotnet-core dotnet serialization-library span pipelines

hagar's Introduction

This work has been integrated into dotnet/orleans and this repository is in maintenance mode. See dotnet/orleans#7070


Hagar

Build status

There are many existing serialization libraries and formats which are efficient, fast, and support schema evolution, so why create this?

Existing serialization libraries which support version tolerance tend to restrict how data is modelled, usually by providing a very restricted type system which supports few of the features found in common type systems, features such as:

  • Polymorphism
  • Generics (parametric types)
  • References, including cyclic references

Hagar is a new serialization library which does supports these features, is fast & compact, supports schema evolution, and requires minimal input from the developer.

Encoding

  • Fields are tagged with compact field ids. Those field ids are provided by developers.
  • Fields are encoded into primitives which fall into 4 categories:
    • Fixed length - most numerics, unless specifically annotated.
    • Variable length - for variable-length integer encoding, useful for length, count, index type properties (relatively small and 0-based in nature).
    • Length-prefixed - strings, arrays of fixed-width primitives.
    • Tag-delimited - objects, collections of non-primitive types.
  • Type information is embedded, but not required for parsing.
    • Separation of wire type & runtime type.
    • Library of application-defined runtime types are available during encoding & decoding. These types can be given short ids to reduce the size of the serialized payload.
    • Types can be parameterized (support for generics).
    • Types which are not specified in the type library can be explicitly named.
      • These named types are runtime specific (i.e, .NET specific).
      • Note: may want to restrict this for security reasons.
  • Wire format based around 1-byte tags, which are in one of two forms:
    • [W W W] [S S] [F F F] where:
      • W is a wire type bit.
      • S is a schema type bit.
      • F is a field identifier bit.
    • [1 1 1] [E E] [X X X] where:
      • E is an extended wire type bit.
      • X is reserved for use in the context of the extended wire type.
  • The wire type, schema type, and extended wire type are detailed more below.
  • When a schema type requires extra data, it is encoded after initial tag.
  • When a field id cannot be encoded in 3 bits, it is encoded after schema data.
  • Overall encoding takes the form: Tag Schema FieldId FieldData.
  • Every message can be parsed without prior knowledge of any schema type because all wire types have a fixed, well-known format for determining the length of the encoded data.
  • When serializing and deserializing data, there is no single, predetermined mapping between a .NET type and a wire encoding. For example, ProtoBufs dictates that an int64 is encoded as a Varint and that float32 is encoded as a fixed 32-bit field. Instead, the serializer can determine that a long is encoded as VarInt, Fixed32, or Fixed64 at runtime depending on which takes up the least space.
/// <summary>
/// Represents a 3-bit wire type, shifted into position
/// </summary>
public enum WireType : byte
{
    VarInt = 0b000 << 5, // Followed by a VarInt
    TagDelimited = 0b001 << 5, // Followed by field specifiers, then an Extended tag with EndTagDelimited as the extended wire type.
    LengthPrefixed = 0b010 << 5, // Followed by VarInt length representing the number of bytes which follow.
    Fixed32 = 0b011 << 5, // Followed by 4 bytes
    Fixed64 = 0b100 << 5, // Followed by 8 bytes
    Reference = 0b110 << 5, // Followed by a VarInt reference to a previously defined object. Note that the SchemaType and type specification must still be included.
    Extended = 0b111 << 5, // This is a control tag. The schema type and embedded field id are invalid. The remaining 5 bits are used for control information.
}

public enum SchemaType : byte
{
    Expected = 0b00 << 3, // This value has the type expected by the current schema.
    WellKnown = 0b01 << 3, // This value is an instance of a well-known type. Followed by a VarInt type id.
    Encoded = 0b10 << 3, // This value is of a named type. Followed by an encoded type name.
    Referenced = 0b11 << 3, // This value is of a type which was previously specified. Followed by a VarInt indicating which previous type is being reused.
}

public enum ExtendedWireType : byte
{
    EndTagDelimited = 0b00 << 3, // This tag marks the end of a tag-delimited object. Field id is invalid.
    EndBaseFields = 0b01 << 3, // This tag marks the end of a base object in a tag-delimited object.
}
  • If a type has base types, the fields of the base types are serialized before the subtype fields. Between the base type fields and its sub type is an EndBaseFields tag. This allows base types and sub types to have overlapping field ids without ambiguity. Therefore object encoding follows this pattern: [StartTagDelimited] [Base Fields]* [EndBaseFields] [Sub Type fields]* [EndTagDelimited].

  • Third-party serializers such as ProtoBuf, Bond, .NET's BinaryFormatter, JSON.NET, etc, are supported by serializing using a serializer-specific type id and including the payload via the length-prefixed wire type. This has the advantage of supporting any number of well-known serializers and does not require double-encoding the concrete type, since the external serializer is responsible for that.

Security

Allowing arbitrary types to be specified in a serialized payload is a vector for security vulnerabilities. Because of this, all types should be checked against a whitelist.

Rules

Version Tolerance is supported provided the developer follows a set of rules when modifying types. If the developer is familiar with systems such as ProtoBuf and Bond, then these rules will come as no surprise.

Composites (class & struct)

  • Inheritance is supported, but modifying the inheritance hierarchy of an object is not supported. The base class of a class cannot be added, changed to another class, or removed.
  • With the exception of some numeric types, described in the Numerics section, field types cannot be changed.
  • Fields can be added or removed at any point in an inheritance hierarchy.
  • Field ids cannot be changed.
  • Field ids must be unique for each level in a type hierarchy, but can be reused between base-classes and sub-classes. For example, Base class can declare a field with id 0 and a different field can be declared by Sub : Base with the same id, 0.

Numerics

  • The signedness of a numeric field cannot be changed.
    • Conversions between int & uint are invalid.
  • The width of a numeric field can be changed.
    • Eg: conversions from int to long or ulong to ushort are supported.
    • Conversions which narrow the width will throw if the runtime value of a field would cause an overflow.
      • Conversion from ulong to ushort are only supported if the value at runtime is less than ushort.MaxValue.
      • Conversions from double to float are only supported if the runtime value is between float.MinValue and float.MaxValue.
      • Similarly for decimal, which has a narrower range than both double and float.

Types

  • Types can be added to the system and used as long as either:
    • They are only used for newly added fields.
    • They are never used on older versions which do not have access to that type.
  • Type names cannot be changed unless the type was always registered as a WellKnown type or a TypeCodec is used to translate between the old and new name.

Packages

Packages are published to nuget.org: https://www.nuget.org/packages?q=hagar

Running build.ps1 will build and locally publish packages for testing purposes

hagar's People

Contributors

alexmg avatar dependabot[bot] avatar mcm2020 avatar reubenbond avatar ronbrogan avatar sepppenner avatar tornhoof avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hagar's Issues

Orleans integration

How soon can we expect to get this on NuGet with a separate package to use it with Orleans? I've been putting off switching from the default one because the others are a bit too invasive for my liking.

Remove WireType.Fixed128 & leave as reserved

This WireType is unnecessary and rarely used. I rather we keep it as reserved for the time being. This will mostly require rewriting the Guid codec to use a length prefix instead.

Support round-tripping unknown fields

Currently, Hagar will safely and predictably deserialize types where the payload contains unknown fields. It supports this by marking and otherwise ignoring the field when it encounters it in the bit stream. If it encounters a reference to that ignored field later, then it will look at that mark and deserialize the field (now knowing what type to deserialize the previously unknown field as).

However, Hagar does not yet support re-serializing that object with full fidelity: those ignored fields will not be serialized. In order to support scenarios where this is useful, Hagar should recognize objects which have a property/field marked with a [Hagar.ExtensionData] attribute. Initially, we can require that the member declared as an object and potentially add an interface which allows some small degree of introspection later, as needed.

Example:

public class MyData
{
  // This could be public
  [Hagar.ExtensionData]
  private object _extensionData;

  [Hagar.Id(0)]
  public int MyValue { get; set; }
}

We can optionally also define an interface which users can optionally implement instead of annotating a field themselves.

public interface IHasExtensionData
{
  [Hagar.ExtensionData]
  object ExtensionData { get; set; }
}

Code generation needs to be updated to support identifying extension data members, deserializing into them, and serializing from them. This is a substantial change, since it means that the existing, optimized routine for serialization cannot be used. Instead, the generated code will need to check for extension data between every known field which has gaps before it

Given a type definition:

public class MyData
{
  [Id(1)] public int MyInt { get; set; }

  [Id(2)] public int MyInt2 { get; set; }

  [Id(44)] public int MyInt3 { get; set; }
}

The serialization order would be

  • Serialize any unknown field with id 0
  • Serialize known field with id 1
  • Serialize known field with id 2 (no gaps between 1 and 2)
  • Serialize any unknown fields with ids between 2 and 44
  • Serialize known field with id 44
  • Serialize any unknown fields with ids greater than 44

The performance hit will not be insignificant in some cases, and therefore the code generator should decide whether to use the existing, optimized routine, or this proposed routine based on whether the type (or a parent type) has an [ExtensionData] member.

Similarly, deserialization will need to change, but that change is likely not as involved: instead of ignoring unknown fields, it will need to place them into the extension data.

Generated code can call static helper methods to help with that serialization and deserialization.

DeepCopy/Clone support

Add support for generating/retrieving deep copiers for a given type.

Example interface:

public interface IDeepCopier<T>
{
  T DeepCopy(T input)
}

future of Hagar

hello @ReubenBond is this project something you may consider developing further? I consult a US company whose serialization needs align very closely with your design philosophy here. They built a cumbersome set of extensions on top of protobuf which I would be happy to get rid of. Would you be interested in working together to iron out any remaining wrinkles in your library? Say I got permission for me to spend some serious time on this, would you be available to hand-hold?

Optionally intern copies of strings and other immutable reference types

We can serialize in a more compact format if we support interning of strings so that subsequent copies of a string which are not represented by the same address in .NET (i.e, .NET hasn't already interned them) are serialized only once.

This may not be worthwhile - it likely depends on the workload.

Motivation for Hagar

This looks like a very cool project.
I'm curious about what motivated you to build a new serialization library and format.

Was it as a learning experience?
Does it solve specific problems for a particular use case?

Allow developer to specify serialized type name for a type

Support user specifying a (string) type name which can be used when serializing and deserializing a Type.

  • [TypeName("my_type")] on the type
  • Reflected by code generator into generated metadata
  • No initial support for generic types
  • Allow any matching attribute named "TypeNameAttribute"?

Generate serialization methods (ctor + method) for partial types

Generate serialization method and deserialization constructor for partial types, useful for types with readonly/initonly/private fields.

The generated ctor must accept dependencies (codecs) as well as the reader and other parameters via the constructor. Therefore, a fixed signature is not suitable.

If there are no existing constructors, also add a default constructor if it's a reference type.

Support surrogate types

There are cases where a type is not owned by the user and its members are not publicly accessible.

This issue is for tracking support for externally specifying a mapping between properties/fields in those types and their stable field ids.

The strategy for doing so is TBD, but it's worth looking at how some other serializers handle this.

Codegen serializer skips complex get/set properties

In attempting to change a property's type, we added a bridge-property that can take its value if an older client only sends the original property. This caused codegen to silently ignore the new property altogether (i.e. new client to new server case).

Repro:

   [Hagar.GenerateSerializer]
    public class Foo { 
      ...
        private string? versHack;

        // To change property RuleEnvironmentId from Guid to string, we add an Id(13) 
        // until world-wide rollout allows us to retire Id(12) and use just a normal auto-property on Id(13).
        [JsonIgnore]
        [Hagar.Id(12)]
        public Guid versHackRuleEnvironmentId { get; set; }

        [JsonPropertyName("ruleEnvironmentId")]
        [Hagar.Id(13)]
        public string RuleEnvironmentId
        {
            get => versHack ??= versHackRuleEnvironmentId.ToString();
            set
            {
                versHack = value;
                versHackRuleEnvironmentId = Guid.TryParse(value, out Guid g) ? g : default;
            }
        }

Expected: Write includes both 12 and 13 fields.
Actual:

  • Write only has Id(12) field.
  • No Hagar.Analyzer warning that Id(13) would be dropped by code-gen.

Workaround: Write an IFieldCodec.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.