Giter Site home page Giter Site logo

smarthi / venice Goto Github PK

View Code? Open in Web Editor NEW

This project forked from linkedin/venice

0.0 1.0 0.0 32.78 MB

Venice, Derived Data Platform for planet-scale workloads.

License: BSD 2-Clause "Simplified" License

Shell 0.05% Python 0.11% Java 99.75% TLA 0.06% Dockerfile 0.02%

venice's Introduction

Venice

Derived Data Platform for Planet-Scale Workloads

License Docs Latest Twitter LinkedIn Slack GitHub

Venice is a derived data storage platform, providing the following characteristics:

  1. High throughput asynchronous ingestion from batch and streaming sources (e.g. Hadoop and Samza).
  2. Low latency online reads via remote queries or in-process caching.
  3. Active-active replication between regions with CRDT-based conflict resolution.
  4. Multi-cluster support within each region with operator-driven cluster assignment.
  5. Multi-tenancy, horizontal scalability and elasticity within each cluster.

The above makes Venice particularly suitable as the stateful component backing a Feature Store, such as Feathr. AI applications feed the output of their ML training jobs into Venice and then query the data for use during online inference workloads.

Write Path

The Venice write path can be broken down into three granularities: full dataset swap, insertion of many rows into an existing dataset, and updates of some columns of some rows. All three granularities are supported by Hadoop and Samza, thus leading to the below full matrix of supported operations:

Hadoop Samza
Full dataset swap Full Push Job Reprocessing Job
Insertion of some rows into an existing dataset Incremental Push Job Real-Time Job
Updates to some columns of some rows Incremental Push Job doing Write Compute Real-Time Job doing Write Compute

Hybrid Stores

Moreover, the three granularities of write operations can all be mixed within a single dataset. A dataset which gets full dataset swaps in addition to row insertion or row updates is called hybrid.

As part of configuring a store to be hybrid, an important concept is the rewind time, which defines how far back should recent real-time writes be rewound and applied on top of the new generation of the dataset getting swapped in.

Leveraging this mechanism, it is possible to overlay the output of a stream processing job on top of that of a batch job. If using partial updates, then it is possible to have some of the columns be updated in real-time and some in batch, and these two sets of columns can either overlap or be disjoint, as desired.

Write Compute

Write Compute includes two kinds of operations, which can be performed on the value associated with a given key:

  • Partial update: set the content of a field within the value.
  • Collection merging: add or remove entries in a set or map.

N.B.: Currently, write compute is only supported in conjunction with active-passive replication. Support for active-actice replication is under development.

Read Path

Venice supports the following read APIs:

  • Single get: get the value associated with a single key
  • Batch get: get the values associated with a set of keys
  • Read compute: project some fields and/or compute some function on the fields of values associated with a set of keys.

Read Compute

When using the read compute DSL, the following functions are currently supported:

  • Dot product: perform a dot product on the float vector stored in a given field, against another float vector provided as query param, and return the resulting scalar.
  • Cosine similarity: perform a cosine similarity on the float vector stored in a given field, against another float vector provided as query param, and return the resulting scalar.
  • Hadamard product: perform a Hadamard product on the float vector stored in a given field, against another float vector provided as query param, and return the resulting vector.
  • Collection count: return the number of items in the collection stored in a given field.

Client Modes

There are two main client modes for accessing Venice data:

  • Classical Venice: perform remote queries against Venice's distributed backend service. In this mode, read compute queries are pushed down to the backend and only the computation results are returned to the client.
  • Da Vinci: eagerly load some or all partitions of the dataset and perform queries against the resulting local cache. Future updates to the data continue to be streamed in and applied to the local cache.

Getting Started

Refer to the Venice quickstart to create your own Venice cluster and play around with some features like creating a data store, batch push, incremental push, and single get.

Previously Published Content

The following blog posts have previously been published about Venice:

The following talks have been given about Venice:

Keep in mind that older content reflects an earlier phase of the project and may not be entirely correct anymore.

Community Resources

Feel free to engage with the community using our:

venice's People

Contributors

adamxchen avatar alex-dubrouski avatar anaberezhnov avatar arunthirupathi avatar asaxena76 avatar atcurtis avatar bhasudha avatar clementfung avatar cqgao avatar dengpan-yin avatar felixgv avatar gaojieliu avatar haoxu07 avatar huangminchn avatar lluwm avatar m-nagarajan avatar majisourav99 avatar mattwisein avatar nisargthakkar avatar rabashizade avatar shuhui-liu avatar sidw avatar singhsiddharth avatar sixpluszero avatar ssen-li avatar sushantmane avatar xunyin8 avatar zacattack avatar zhangmeng916 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.