Giter Site home page Giter Site logo

Future-proof serialization about t-digest HOT 24 OPEN

tdunning avatar tdunning commented on May 25, 2024
Future-proof serialization

from t-digest.

Comments (24)

Nikko78 avatar Nikko78 commented on May 25, 2024

You can overload Java serialization process to save what you want, for example in my example, methods getBytes and fromBytes already implements in your Digest classes.

In this example, fields of object are ignored and only declared "digestEncodeWithBytes" field is declared :

public abstract class TDigest implements Serializable {

    // For serialisation, overload ObjectInput/OutputStream reflection
    private static final ObjectStreamField[] serialPersistentFields = {
        new ObjectStreamField("digestEncodeWithBytes", byte[].class)
    };

    // Or use getBytes(ByteBuffer) instead
    public abstract byte[] getBytes();

    // Serialization create the object which implements this abstract class, so call a method to init it
    protected abstract void initWithBytes(byte[] b);

    /**
     * Overload {@link ObjectOutputStream#defaultWriteObject()}
     *
     * @param oos object output stream use for write
     * @throws IOException error during write process
     */
    private void writeObject(ObjectOutputStream oos) throws IOException {
        oos.putFields().put("digestEncodeWithBytes", getBytes());
        oos.writeFields();
    }

    /**
     * Overload {@link ObjectInputStream#defaultReadObject()}
     *
     * @param ois object input stream use for read
     * @throws IOException error during read process
     * @throws ClassNotFoundException class use in serialised stream unknown
     */
    private void readObject(ObjectInputStream ois) throws IOException, ClassNotFoundException {
        Object o = ois.readFields().get("digestEncodeWithBytes", null);
        if (o != null) {
            byte[] b = (byte[]) o;
            initWithBytes(b);
        }
    }
}

You can also overwrite deserialisation with : link

class HackedObjectInputStream extends ObjectInputStream {

    public HackedObjectInputStream(InputStream in) throws IOException {
        super(in);
    }

    @Override
    protected ObjectStreamClass readClassDescriptor() throws IOException, ClassNotFoundException {
        ObjectStreamClass resultClassDescriptor = super.readClassDescriptor();

        if (resultClassDescriptor.getName().equals("oldpackage.Clazz"))
            resultClassDescriptor = ObjectStreamClass.lookup(newpackage.Clazz.class);

        return resultClassDescriptor;
    }
}

from t-digest.

anvinjain avatar anvinjain commented on May 25, 2024

Isn't it possible to add a Transcoder interface which has byte[] asBytes(SerializationModel) and SerializationModel fromBytes(byte[]) methods? SerializationModel here being a java object that is good enough to represent any digest. TDigest interface can now have a create method that takes SerializationModel as input which all digests implement. Basically taking out byte-level concerns out of the digest impls.

You can provide a default transcoder which does serialization similar to how it's done today (with bugfixes), while keeping the flexibility of accepting PRs from others that implement transcoders for other formats. If this line of thought makes sense to you and more importantly if we can come up with a common model to represent all digests, I will be happy to contribute a PR!

A little background: we have a project that does jvm profiling for clusters and all our components talk Protobuf. We deal with latencies, scheduler delays and other metrics and streaming quantiles make a lot of sense. Hence the interest in adopting tdigest but it does not have protobuf support today nor the design is friendly enough to submit a PR for the same

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

from t-digest.

anvinjain avatar anvinjain commented on May 25, 2024

No. Basically, I wanted to enable bring your own serialization logic with some sane ones provided by the lib. By SerializationModel, I am implying one java class something on the lines of:

class SerializationModel {
    int encodingType; double min; double max; float compression; // ... and other members
}

class MergingDigest {
    // ...
    static MergingDigest fromSerializationModel(SerializationModel);
}

interface Transcoder {
    byte[] asBytes(SerializationModel);
    SerializationModel fromBytes(byte[]);
}

class ProtobufTranscoder implements Transcoder {
    // ...
}

This keeps the byte magic out of digest implementations and they are responsible for generating a uniform on-heap representation (which is SerializationModel, I know this name sucks)
Users can instantiate any out of the box transcoder or write their own. A way to maintain backward compatibility would be for MergingDigest, etc to optionally take a transcoder as constructor param and use that (or default one if not provided) to keep existing asBytes, fromBytes method working.

If you have any other ideas on how tdigest can support multiple serialization formats (and hopefully you consider that as a feature to be added in the first place), would love to hear about them.

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

from t-digest.

anvinjain avatar anvinjain commented on May 25, 2024

Or DigestModel? I see this as a dto representing digests.
Sure, I have some free time next week so expect a PR then.

from t-digest.

anvinjain avatar anvinjain commented on May 25, 2024

You mention interchangeability of digest implementations. Presently, MergingDigest and AvlTreeDigest serialize different values so their serialized representation is not equivalent. Will need your inputs on structure of DigestModel and what fields of MergingDigest/AvlTreeDigest map to DigestModel fields.
The straightforward ones are min, max, compression, mean[]. But rest differ across impls.

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

from t-digest.

anvinjain avatar anvinjain commented on May 25, 2024

@tdunning Need your inputs on what DigestModel looks like. As mentioned in previous comment, presently different fields are serialized across digest implementations

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

from t-digest.

anvinjain avatar anvinjain commented on May 25, 2024

Apart from bufferSize, MergingDigest serializes lastUsedCell. For verbose encoding, this is deserialized and set as a property in digest. For small encoding, lastUsedCell additionally represents the values to read and assign in mean/weight array. Rest of the array seems to be zero-filled since n(array size) is part of serde as well.

Also, oddly MergingDigest#asBytes serializes tempMean.length but MergingDigest#fromBytes (verbose) does not expect it to be present in buffer. Is it a known issue or am I missing something there?

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

@anvinjain Do you have a PR coming?

We need to cut the new major release. If you can't get this in soon, it will have to wait.

from t-digest.

anvinjain avatar anvinjain commented on May 25, 2024

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

This is being pushed beyond the 3.2 release.

from t-digest.

anvinjain avatar anvinjain commented on May 25, 2024

@tdunning I have raised a PR around this.

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

from t-digest.

anvinjain avatar anvinjain commented on May 25, 2024

@tdunning Did you get a chance to look at the PR?

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

from t-digest.

anvinjain avatar anvinjain commented on May 25, 2024

@tdunning Ping. Is there a timeline for next release? If you don't see this PR #99 making the cut without major changes, let me know.

from t-digest.

scheler avatar scheler commented on May 25, 2024

@tdunning looks like the PR was never merged. This is of interest to us too, so if there are any review comments we can take a look at addressing them too. Thanks in advance.

from t-digest.

tdunning avatar tdunning commented on May 25, 2024

from t-digest.

amalic avatar amalic commented on May 25, 2024

Here's my proposal: #162

  • Serializations can be configured via serialization.properties file.
  • The serialization gets instatniated with a java.util.Properties object containing all properties form the serialization.properties file
  • It can load custom implementations via classForName as long as they have been added to the classpath and configured in the serialization.properties file. It's up to the implementation what to do with those properties.

Example for serialization.properties file

tdigest.serializerClass=org.example.CustomSerializer
someProperty=someValue
someOtherProperty=someOtherValue

This is the current implementation of TDigestDefaultSerializer

public class TDigestDefaultSerializer extends AbstractTDigestSerializer<TDigest, byte[]> {
    
    public TDigestDefaultSerializer(Properties properties) {
        super(properties);
    }

    @Override
    public byte[] serialize(TDigest tDigest) throws TDigestSerializerException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream(5120);
        try (ObjectOutputStream out = new ObjectOutputStream(baos)) {
            out.writeObject(tDigest);
            return baos.toByteArray();
        } catch (IOException e) {
            throw new TDigestSerializerException(e);
        } 
    }

    @Override
    public TDigest deserialize(byte[] object) throws TDigestSerializerException {
        try (ObjectInputStream in = new ObjectInputStream(new ByteArrayInputStream(object))) {
            return (TDigest) in.readObject();
        } catch (ClassCastException | ClassNotFoundException | IOException e) {
            throw new TDigestSerializerException(e);
        }
    }

}

This is the code for AbstractTDigestSerializer

public abstract class AbstractTDigestSerializer<T extends TDigest, O extends Object> implements TDigestSerializer<T, O> {
    private Properties properties;
    
    public AbstractTDigestSerializer(Properties properties) {
        this.properties = properties;
    }    
    
    public Properties getProperties() {
        return properties;
    }

}

Hope this is being seen as future proof.

There's still some work to do:

  • code formatting (I use 🌑Eclipse) and review (especially how I do generics 😅)
  • proper JavaDoc documentation in source code
  • additional tests for custom serialization implementations and property file handling

from t-digest.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.