Giter Site home page Giter Site logo

unipop's Introduction

This project is in active development. A stable version will be released soon.

Unipop

Build Status

Analyze data from multiple sources using the power of graphs.

Unipop is a data Federation and Virtualization engine that models your data as a "virtual" graph, exposing a querying API using the Gremlin GQL (Sql and SPARQL are also available.)

This means you get the benefits of a graph data model without migrating/replicating/restructuring your data, whether its stored in a RDBMS, NoSql Store, or any other data source (see "Customize and Extend" below.)

Why Graphs?

Graphs provide a very "natural" way to analyze your data. The simple Vertex/Edge structure makes it very easy to model complex and varied data, and then analyze it by exploring the connections/relationships in it.

This is especially relevant for a data Federation / Virtualization platform, which integrates a large variety of different data sources, structures, and schemas.

Our chosen GQL is Gremlin, which comes as part of the Apache Tinkerpop framework. Let's compare Gremlin to SQL, the industry standard:

Schema Relationships Flexibility Usability
SQL Structured - Tables and their fields need to be explicitly defined. Joins require knowledge of all relationships (PK/FK), and can become quite complicated. Sql's syntax requires very specific, rigid structures. Queries are loosely-typed "free text", often requiring complicated ORMs.
Gremlin Unstructured - Different structures can be created on the fly. Connections (i.e edges) are "First-class citizens", enabling easy exploration of your data. Queries are written in a pipelined ("functional") syntax, providing considerable flexibility. Host Language embedding. Easier to read, write, find errors, and reuse queries.

The Tinkerpop framework also provides us with other useful features "out of the box":

  • Traversal Strategies - an extensible query optimization mechanism. Unipop utilizes this to implement different performance optimizations.
  • Console & Server - production grade tooling.
  • Language Drivers - JavaScript, TypeScript, PHP, Python, Java, Scala, .Net, Go.
  • Extensible Query Languages - Gremlin, SQL, SPARQL
  • DSL support
  • Testing Framework

Getting started

TBD

Setup

  • Console - a local instance with an interactive Shell for issuing queries.
  • Server - a web server with WebSocket & HTTP APIs.
  • Embedded - run Unipop inside any JVM based application.

Configure

Add your data sources to Unipop's configuration. Configuring a source entails mapping its schema to a "property graph" model (i.e. vertices & edges). Unipop is built in an extensible way, enabling many different mapping options.

Query

Console, Server, Embedded, or Language drivers

Customize & Extend

TBD

How it works

TBD

Technical details.

Contributing

TBD

unipop's People

Contributors

babiy8 avatar blakko avatar edeneliel avatar eliranmoyal avatar guiltyxsin avatar gurronen avatar lande24 avatar okram avatar rmagen avatar romanmar1 avatar seanbarzilay avatar spmallette avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unipop's Issues

Bulk mutations

Solution suggestion

Optimize the Mutation steps to issue mutation UniQuerys with a bulk of Element, enabling the Controllers to issue bulk commands to the DB.

Questions

  1. Should we enable bulk mutations directly from the UniGraph? That would entail adding on Tinkerpop's current Graph API.
  2. Should changes be automatically committed to the DB after every query, or should we add a commit() method to UniGraph/Traversal?
  3. Maybe using BulkLoaderVertexProgram, or a similar solution, would be a better choice?

Dependency Management - JSON

At the moment, we depend on 4 different json libraries (including inner dependencies).

  • json-simple
  • json.org
  • Gson
  • Jackson

for the sake of less dependencies, we should consider removing most.

the top candidates are Jackson and Gson, as Jackson is heavily used by tinkerpop, and Gson is heavily used by Jest & Hadoop. requires further analysis of which API is easier to use and what performs better.

  • Tinkerpop
  • Jest
  • Hadoop

either way, this required some refactor to remove the other libraries.

"Customizing & Extending Unipop" guide

Customizing Schemas

  • index by time
  • multiple rows
  • multiple buckets

Extending Controllers

  • Elasticsearch
    • RoutedDocumentController
    • TemplateController
    • AggregationController
    • GeoIntersectController
  • Jdbc
    • StoredProcedureController

Dependency convergence error on build

This sounds cool, so I wanted to give unipop-elastic a try.
Is there a jar hosted on some public repository?

When I build it manually I get the following dependency convergence error, which I couldn't even fix when adding a direct dependency from unipop-elastic on snakeyaml:1.15

Failed while enforcing releasability the error(s) are [
Dependency convergence error for org.yaml:snakeyaml:1.15 paths to dependency are:
+-unipop:unipop-elastic:0.1
  +-unipop:unipop-core:0.1
    +-org.apache.tinkerpop:gremlin-core:3.0.2-incubating
      +-org.yaml:snakeyaml:1.15
and
+-unipop:unipop-elastic:0.1
  +-unipop:unipop-core:0.1
    +-org.yaml:snakeyaml:1.15
and
+-unipop:unipop-elastic:0.1
  +-org.elasticsearch:elasticsearch:1.7.3
    +-org.yaml:snakeyaml:1.12
]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Unipop ............................................. SUCCESS [  0.627 s]
[INFO] Unipop :: Core ..................................... SUCCESS [  0.762 s]
[INFO] Unipop :: Elasticsearch Controllers ................ FAILURE [  0.709 s]
[INFO] Unipop :: JDBC Controllers ......................... SKIPPED
[INFO] Unipop :: Integration tests ........................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE

"inner JOIN" steps into single query

Most Databases have an Sql-like "join" feature. Utilizing this across Traversal Steps can bring a big boost to performance. e.g. g.V().hasLabel('foo').out('bar') could be queried as select ... from foo join bar on ...

This is how I thought to implement this:

Analyze Traversal

Use TraversalStrategys to analyze possible joins:

  • Adjacent steps
  • SelectStep
  • Inner traversal
  • Aggregations
  • MatchStep
  • ...

SearchQuery will be added a property SearchVertexQuery[] getNextQueries(), returning the possible joins found. A controller can use this recursively to get more join possibilities.

Validate Join

The Controller should use the Schemas to validate the 'legitimacy' of joining the query with any of the "next queries". Things to check:

  • Is the "next query"' handled only by this Controller (i.e. is all the data on the same DB).
  • Else, is a full copy available? (We can duplicate often-joined data in our different databases, and mark them as such in the schema configuration).
  • Skip VirtualVertexs and join with the next query.

Query

Each controller implements the join query in a different way:

The Controller should return each result and its corresponding "future steps" results as a set, with each result in the set associated with its relevant stepId.
The issuing Step will create a Traverser from its relevant result, and add the rest as "Traverser side effects", enabling the future steps to access the results when they are called.

Elastic Edges Example

Was digging through the code a bit trying to find an example of what I would do if my documents in elastic search represent nodes and edges. For instance I have documents that look like the following, how would I represent that using unipop? User A and User B are nodes, and a case could be made for the message also being a node and the edge contains the timestamp.

{
"userA": "000000001",
"timestamp": "2015-01-01T05:14:22",
"message": "Writing on your wall",
"userB": "000000002"
}

From what I can tell, my guess would be that I use an ElasticEdgeController, but its not entirely clear how to actually use that to build my graph and run gremlin queries across it. Can you provide an example of how to do that?

Dependency Management - Collections Library

Currently we use multiple collections libraries, including:

  • Apache Collections
  • Google Collections
  • Java Stream API

all of them are lacking some features which exist in other languages such as Scala and the .Net family

a solution to this is using the Seq API from JOOQ/JOOL.

  • Much easier to use than StreamSupport and the stream API
  • has more features
  • very readable
  • encapsulates the required API from both the apache/google libraries and the stream API.

some resources:

from personal experience, the JOOL api fits our needs and gives us greater flexibility.

Usage

I wondered how I can actually use this library to run a simple gremlin graph traversal. ElasticGraphProvider looks promising, but that doesn't actually get shipped in your jar, as it is in test code...

JDBC - MultiRow schema

Implement a schema that is able to consume multiple rows, and treat it as a single vertex, with each row representing an edge to that vertex.

Connection terminated

If the connection to the databases terminates for any reason, Unipop won't reconnect and thus won't work.

Parallelism

Add parallel execution when issuing UniQueries.

Possible Parallelism points:

  • Controller Parallelism
  • Bulk Parallelism
  • Schema Parallelism

Elasticsearch Scroll API

Use elasticsearch query Scroll API to iterate many results in the StartStep.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

According to the docs "Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration". Should we use Scroll in Unipop? If so, should we always use it? maybe only in StartStep (Assuming most times it will iterate a large amount of results).

The scroll functionality is already implemented in QueryIterator, but its currently unused.

SourceProvider should receive standard `Configuration`

Currently SourceProviders, ElementSchemas, and PropertySchemas receive a JSONObject for initialization. The ControllerManager should provide them a standard org.apache.commons.configuration so that other sources of configuration can be provided.

Add Integration Tests

Execute as a total both JDBC and Elastic, run tests that include both.

part of the unipop-test module

How to deploy a gremlin server with Unipop

I am working on a project that require graph queries on ElasticSearch and I found Unipop which fits our use case perfectly. Well done!
However, I found few clues about how to set up a gremlin server with unipop. The unipop-elastic2 is half developed. I found that I could not even package unipop-elastic successfully.
It would be very helpful if you could give any advice. I am really looking forward to the release of a stable version.

Optimize property fetching

Current querying behavior

  • Vertex - fetch all properties + any "inner" edges and their properties.
  • Edge - fetch all properties + both vertices and their properties.
    • If a vertex's schema is of type 'ref', its properties will only be fetched when it passes through a UniGraphVertexPropertiesSideEffectStep, a step that comes before any step that uses properties, and issues a DeferredVertexQuery. This ensures that the vertices will only be queried if and when its needed (i.e. lazy loading).

Problems

  1. When an Element is queried, all its properties are fetched, whether or not they are needed by this traversal.
  2. When a Vertex is queried, all its "inner" edges are fetched, whether or not they are needed by this traversal.

This issue tries to solve problem 1. We should probably create another ticket for solving problem 2 in the future.

Solution suggestion

SearchQuery/SearchVertexQuery/DeferredVertexQuery should pass a list of property keys needed from the queried element. UniGraphPropertiesStepStrategy should provide the property lists to the querying steps by analyzing the traversal. Scenarios:

  1. No step in the traversal needs any property - empty list.
  2. Step(s) in the traversal need specific property(s) - property list.
  3. Step(s) in the traversal iterate over all properties - null list.
  4. Unknown (the strategy couldn't identify which properties are needed) - null list. ???

Next, when a Controller receives these queries it should only fetch the relavent properties, or not issue a query at all when possible.

Json configuration-array support

Make sure that json fields that support arrays also support a single value without an array. E.g:
"foo": "bar" == "foo": ["bar"]

Use JEST library instead of ES native java client

JEST is an elasticsearch java client that communicates through REST with ES. This (hopefully) means that we can use one client for all versions, thus killing the separation (and code duplication) between unipop-elastic and unipop-elastic2.

Theoratically it should also make unipop consume less memory.

Grouping TraversalStrategy

Enable Controllers to implement optimized "group by" and "group count" functions.
This needs to be re-implemented following the changes In #44.

JDBC - smart table union filter.

In order to improve performance when querying SQL databases, we can filter out tables from the union by analysis of which tables conform to the PredicatesHolder and its 'must-have' fields, as any tables that do not have those fields can be cut out.

Dynamic UniFeatures

UniFeatures are part of the graph object.
this does not allow to have controllers with different features.

a solution is needed that will allow a dynamic result. as it is based on what is being executed at the moment and where what action is undergoing considering the controller.

Suggestions are welcome

Reducing TraversalStrategy

Enable Controllers to implement optimized reducing function - count, sum, average, min, max, etc
This needs to be re-implemented following the changes In #44.

Major Refactor

Adding this in retrospect for documentation's sake:

A much needed refactor to split Unipop's code to different components:

  • Structure - implementation of default Tinkerpop model classes that issue UniQuerys to Controllers
  • Procces - implementation of Tinkerpop Strategies that issue UniQuerys to Controllers, and add Unipop-specific optimizations.
  • UniQuery - a set of APIs Controllers can implement.
  • Controller - a component responsible for executing the different UniQuerys.
  • Schema - a set of helper classes meant to ease schema management for Controllers, and standardize schema mappings.

Cardinality Strategy

Most database optimizers use statistics-based cardinality estimates to to determine the optimal order in which to run a query's steps. Should we do something similar?

We could implement a TraversalStrategy that rearranges the Traversal's steps according to its estimated cardinality. Each Controller provide the necessary information by utilizing its database's capabilities. e.g. Oracle Statistics.

Tinkerpop has something similar in its MatchStep, except it does a run-time statistics calculation for every traversal.

  1. Should we use MatchStep's MatchAlgorithm to implement this feature?
  2. Why is this only implemented for MatchStep? Can't the same logic be applied for all steps in the traversal?

OLAP GraphComputer

Theoretically we should be able to run distributed Unipop queries on Spark or something. Some of Unipop's data sources even have Hadoop integration (e.g. Elasticsearch RDD, Jdbc RDD, etc).
Utilizing a Unigraph's schema configuration, this feature should provide Unipop's users a transparent, zero-configuration way to execute distributed queries over their data.

Questions

  1. Should we implement Tinkerpop's GraphComputer? What does that entail?
  2. Can we utilize Tinkerpop's HadoopGraph implementation? If so, how?
  3. Can we utilize Tinkerpop's SparkGraphComputer (is that the name)? If so, how?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.