Giter Site home page Giter Site logo

avro-fastserde's Introduction

avro-fastserde

avro-fastserde is an alternative approach to Apache Avro serialization and deserialization. It generates dedicated code responsible for handling serialization and deserialization, which achieves better performance results than native implementation. Learn more here.

build status

Deprecation info

All users are encouraged to switch to linkedin/avro-util project which has incorporated and extended functionalities that have originated from this project. All fixes and improvements will happen now through the avro-util fork of this library.

Version

Current version is 1.0.7. This is the final official release of this project.

Requirements

You need Java 8 to use this library.

Installation

Releases are distributed on Maven central:

<dependency>
    <groupId>com.rtbhouse</groupId>
    <artifactId>avro-fastserde</artifactId>
    <version>1.0.7</version>
</dependency>

Usage

Just use avro-fastserde DatumReader and DatumWriter interface implementation:

import com.rtbhouse.utils.avro.FastGenericDatumReader;
import com.rtbhouse.utils.avro.FastGenericDatumWriter;
import com.rtbhouse.utils.avro.FastSpecificDatumReader;
import com.rtbhouse.utils.avro.FastSpecificDatumWriter;

...

FastGenericDatumReader<GenericData.Record> fastGenericDatumReader = new FastGenericDatumReader<>(writerSchema, readerSchema);
fastGenericDatumReader.read(null, binaryDecoder);

FastGenericDatumWriter<GenericData.Record> fastGenericDatumWriter = new FastGenericDatumWriter<>(schema);
fastGenericDatumWriter.read(data, binaryEncoder);

FastSpecificDatumReader<T> fastSpecificDatumReader = new FastSpecificDatumReader<>(writerSchema, readerSchema);
fastSpecificDatumReader.read(null, binaryDecoder);

FastSpecificDatumWriter<T> fastSpecificDatumWriter = new FastSpecificDatumWriter<>(schema);
fastSpecificDatumWriter.write(data, binaryEncoder);

You can alter class generation behaviour via system properties:

 // Set compilation classpath
 System.setProperty(FastSerdeCache.CLASSPATH, compileClasspath);
 
 // Set generated classes directory
 System.setProperty(FastSerdeCache.GENERATED_CLASSES_DIR, generatedClassesDir);

Or FastSerdeCache class:

import com.rtbhouse.utils.avro.FastGenericDatumReader;
import com.rtbhouse.utils.avro.FastSerdeCache;

...

FastSerdeCache cache = new FastSerdeCache(compileClassPath);
FastGenericDatumReader<GenericData.Record> fastGenericDatumReader = new FastGenericDatumReader<>(writerSchema, readerSchema, cache);

Limitations

  • no support for reuse parameter in DatumReader interface.
  • no support for SchemaConstructable marker interface for specific Avro records.
  • FastSpecificDatumReader will not read data into GenericRecord if the specific classes are not available but will result in compilation failure and fall back to default SpecificDatumReader implementation.

avro-fastserde's People

Contributors

dervan avatar flowenol avatar michal-bielecki-rtbhouse avatar sirpiotrek avatar solid-design avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

avro-fastserde's Issues

Un support null element for array

Test case is follow:

@Test
public void testNullElementArray () {
        // given
        Schema arrayRecordSchema = Schema.createArray(Schema.createUnion(Schema.create(Schema.Type.STRING)
                , Schema.create(Schema.Type.NULL)));

        List<Object> records = new ArrayList<Object>();
        records.add(null);
        records.add(null);
        records.add(null);

        // when
        List<Object> array = decodeRecord(arrayRecordSchema, arrayRecordSchema,
                specificDataAsDecoder(records, arrayRecordSchema));

        // then
        Assert.assertEquals(3, array.size());
        Assert.assertNull(array.get(0));
        Assert.assertNull(array.get(1));
        Assert.assertNull(array.get(2));
    }

No Performance Gain over Base Avro

I am using Avro to parse messages generated by another vendor (I single fixed schema) and found that the parsing performance is not great. I was looking for ways to improve performance when I came across avro-fastserde library.

My initial attempt at using the library was not successful, I was able to parse messages but the performance was identical to the base Avro implementation. I was hoping you might be able to provide some additional insights into what I might be doing wrong as the documentation does not provide a complete working example.

In my case I have used the Avro schema provided by the vendor to generate a SpecificDatumParser and all supporting classes. The code used to parse the messages looks something like this...

final SpecificDatumReader<HfReadData> avroHfReader = new SpecificDatumReader<HfReadData>(HfReadData.SCHEMA$);

DataFileReader<HfReadData> dataFileReader = new DataFileReader<HfReadData>(byteStream, avroHfReader);

while (dataFileReader.hasNext()) {
    HfReadData hfRead = hfReadQueue.take();
    hfRead = dataFileReader.next(hfRead);
    parseAvroMessage(hfRead);
}

where the parseAvroMessage() method parses the method into various objects before passing them on to the application for processing. Note: the JSON schema is moderately simple consisting of a single record with an array of 1..n sub-records. The parser method combines sets of sub-records into single objects. This method consumes minimal cpu as verified using Java Mission Control to take Flight Recordings.

Here is the FastSerde implementation...

final FastSerdeCache serdeCache = new FastSerdeCache("./build/classes/main/com/serde");
final FastSpecificDatumReader<HfReadData> fastSpecificDatumReader = new FastSpecificDatumReader<HfReadData>(HfReadData.SCHEMA$, HfReadData.SCHEMA$, serdeCache);

DataFileReader<HfReadData> dataFileReader = new DataFileReader<HfReadData>(byteStream, fastSpecificDatumReader);

while (dataFileReader.hasNext()) {
    HfReadData hfRead = hfReadQueue.take();
    hfRead = dataFileReader.next(hfRead);
    parseAvroMessage(hfRead);
}

As noted above I get the exact same throughput regardless of whether I use FastSerde or base Avro. On my test setup it takes about 5min to process 100,000 messages and flight recordings from each test run show slight differences but otherwise are more or less the same.

You note that the FastSpecificDatumReader will fall back to the SpecificDatumReader if the specific classes are not available. I am feeling this is most likely happening in my case (hence the identical performance). I feel like I have done something wrong but not sure what that is.

FastDeserializerGenerator throws NPE for avro schemas with nested records

For avro schemas with nested records structure, FastDeserializerGenerator.generateDeserializer throws NPE if FastGenericDatumReader is used with a (reader) schema different from the writer schema. It deserializes correctly if both writer and reader schemas are the same.

Sample writer schema (a record with 3 nested records):
{
"type": "record",
"name": "User",
"namespace": "example.avro",
"fields": [
{
"name": "address",
"type": {
"type": "record",
"name": "address_data",
"fields": [
{
"name": "streetName",
"type": "string",
"doc": "name of street"
},
{
"name": "city",
"type": "string",
"doc": "name of city"
}
]
},
"doc": "user addresses"
},
{
"name": "segment",
"type": {
"type": "record",
"name": "segment_data",
"fields": [
{
"name": "segmentA",
"type": "string",
"doc": "name of segment A"
},
{
"name": "segmentB",
"type": "string",
"doc": "name of segment B"
}
]
},
"doc": "user segments"
},
{
"name": "devices",
"type": {
"type": "record",
"name": "device_data",
"fields": [
{
"name": "deviceA",
"type": "string",
"doc": "name of device A"
},
{
"name": "deviceB",
"type": "string",
"doc": "name of device B"
}
]
},
"doc": "user devices"
}
],
"doc": "user schema"
}

Sample reader schema (contains one less record than the writer):
{
"type": "record",
"name": "User",
"namespace": "example.avro",
"fields": [
{
"name": "address",
"type": {
"type": "record",
"name": "address_data",
"fields": [
{
"name": "streetName",
"type": "string",
"doc": "name of street"
},
{
"name": "city",
"type": "string",
"doc": "name of city"
}
]
},
"doc": "user addresses"
},
{
"name": "devices",
"type": {
"type": "record",
"name": "device_data",
"fields": [
{
"name": "deviceA",
"type": "string",
"doc": "name of device A"
},
{
"name": "deviceB",
"type": "string",
"doc": "name of device B"
}
]
},
"doc": "user devices"
}
],
"doc": "user schema"
}

WARNING: deserializer generation exception
com.rtbhouse.utils.avro.FastDeserializerGeneratorException: java.lang.NullPointerException
at com.rtbhouse.utils.avro.FastDeserializerGenerator.generateDeserializer(FastDeserializerGenerator.java:169)
at com.rtbhouse.utils.avro.FastSerdeCache.buildGenericDeserializer(FastSerdeCache.java:322)
at com.rtbhouse.utils.avro.FastSerdeCache.lambda$getFastGenericDeserializer$4(FastSerdeCache.java:225)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at com.rtbhouse.utils.avro.FastDeserializerGenerator.processRecord(FastDeserializerGenerator.java:227)
at com.rtbhouse.utils.avro.FastDeserializerGenerator.generateDeserializer(FastDeserializerGenerator.java:97)
... 6 more

Clarify License

Hi

Could you possibly update the project with a LICENSE file, so it clarifies what license this can be used under e.g. MIT, BSD-3-Clause, or Apache License 2.0

we notice some of you other projects have this (but seems sometime MIT is used and other times ASLv2)
https://github.com/RTBHOUSE/graphite-gw/blob/master/LICENSE
https://github.com/RTBHOUSE/torch-utils/blob/master/LICENSE

thanks.
Mike

(im asking as i want to add a dependency to use this over the vanilla avro serializer in an apache licensed open source project, but to add this its important all dependencies in it have a License and the license is ASL compatible (the ones i mentioned are))

Support StringableClasses

Please could this support StringableClasses, currently it throws a cast exception, where these are defined in avro schema.

e.g.

{ "name": "id", "type": { "type": "string", "java-class": "java.math.BigInteger" } }

https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java

/** Read/write some common builtin classes as strings. Representing these as

  • strings isn't always best, as they aren't always ordered ideally, but at
  • least they're stored. Also note that, for compatibility, only classes
  • that wouldn't be otherwise correctly readable or writable should be added
  • here, e.g., those without a no-arg constructor or those whose fields are
  • all transient. */
    protected Set stringableClasses = new HashSet<>();
    {
    stringableClasses.add(java.math.BigDecimal.class);
    stringableClasses.add(java.math.BigInteger.class);
    stringableClasses.add(java.net.URI.class);
    stringableClasses.add(java.net.URL.class);
    stringableClasses.add(java.io.File.class);
    }

Two failure cases of FastDeserializerGenerator

We have seen two FastDeserializer generation failure cases -

  1. Record with Nested Map fields
  2. Record that holds union a branch of which is the record itself

These two issues have been fixed in the following PR:
linkedin/avro-util#46

Can you take a look of this PR to see if you need this fixup?

NPE in processUnion

We're seeing the following NPE when running on Java 11 using fastserde-1.0.5.

com.rtbhouse.utils.avro.FastDeserializerGeneratorException: java.lang.NullPointerException
at com.rtbhouse.utils.avro.FastDeserializerGenerator.generateDeserializer(FastDeserializerGenerator.java:139) ~[avro-fastserde-1.0.5.jar:?]
at com.rtbhouse.utils.avro.FastSerdeCache.buildSpecificDeserializer(FastSerdeCache.java:319) ~[avro-fastserde-1.0.5.jar:?]
at com.rtbhouse.utils.avro.FastSerdeCache.lambda$getFastSpecificDeserializer$1(FastSerdeCache.java:207) ~[avro-fastserde-1.0.5.jar:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.NullPointerException
at com.rtbhouse.utils.avro.FastDeserializerGenerator.processUnion(FastDeserializerGenerator.java:473) ~[avro-fastserde-1.0.5.jar:?]
at com.rtbhouse.utils.avro.FastDeserializerGenerator.processComplexType(FastDeserializerGenerator.java:157) ~[avro-fastserde-1.0.5.jar:?]
at com.rtbhouse.utils.avro.FastDeserializerGenerator.processRecord(FastDeserializerGenerator.java:264) ~[avro-fastserde-1.0.5.jar:?]
at com.rtbhouse.utils.avro.FastDeserializerGenerator.generateDeserializer(FastDeserializerGenerator.java:102) ~[avro-fastserde-1.0.5.jar:?]
... 6 more

The error is at this line:

Symbol.UnionAdjustAction unionAdjustAction = (Symbol.UnionAdjustAction) alternative.symbols[i].production[0];

The production == null.

I think the relevant part of the schema is the "type" in the following:

{
"name": "name",
"type": ["null", "string"]
}

Example usage of the efficient deserializer

Very interesting project.
Would you be able to share an example of pointer to code in avro which does schema analysis and slower and how you hand rolled a de-serializer ?

For FastSerdeCache cache = new FastSerdeCache(compileClassPath); when and why do i need to set this ?
Do i need to set this ? on reading the blog i thought class generation happens on demand ? Is there a way to enable some static time compilation for cases where i know already the schema i am trying to de-serialize ?
How to check if the class is generated and compiled and observe in some stacktrace that the more efficient de-serializer is being invoked ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.