unipop-graph / unipop Goto Github PK

View Code? Open in Web Editor NEW

203.0 40.0 35.0 32.29 MB

Data Integration Graph

License: Apache License 2.0

Java 99.19% HTML 0.81%

graph gremlin federation virtualization tinkerpop

unipop's Introduction

This project is in active development. A stable version will be released soon.

Unipop

Analyze data from multiple sources using the power of graphs.

Unipop is a data Federation and Virtualization engine that models your data as a "virtual" graph, exposing a querying API using the Gremlin GQL (Sql and SPARQL are also available.)

This means you get the benefits of a graph data model without migrating/replicating/restructuring your data, whether its stored in a RDBMS, NoSql Store, or any other data source (see "Customize and Extend" below.)

Why Graphs?

Graphs provide a very "natural" way to analyze your data. The simple Vertex/Edge structure makes it very easy to model complex and varied data, and then analyze it by exploring the connections/relationships in it.

This is especially relevant for a data Federation / Virtualization platform, which integrates a large variety of different data sources, structures, and schemas.

Our chosen GQL is Gremlin, which comes as part of the Apache Tinkerpop framework. Let's compare Gremlin to SQL, the industry standard:

	Schema	Relationships	Flexibility	Usability
SQL	Structured - Tables and their fields need to be explicitly defined.	Joins require knowledge of all relationships (PK/FK), and can become quite complicated.	Sql's syntax requires very specific, rigid structures.	Queries are loosely-typed "free text", often requiring complicated ORMs.
Gremlin	Unstructured - Different structures can be created on the fly.	Connections (i.e edges) are "First-class citizens", enabling easy exploration of your data.	Queries are written in a pipelined ("functional") syntax, providing considerable flexibility.	Host Language embedding. Easier to read, write, find errors, and reuse queries.

The Tinkerpop framework also provides us with other useful features "out of the box":

Traversal Strategies - an extensible query optimization mechanism. Unipop utilizes this to implement different performance optimizations.
Console & Server - production grade tooling.
Language Drivers - JavaScript, TypeScript, PHP, Python, Java, Scala, .Net, Go.
Extensible Query Languages - Gremlin, SQL, SPARQL
DSL support
Testing Framework

Getting started

TBD

Setup

Console - a local instance with an interactive Shell for issuing queries.
Server - a web server with WebSocket & HTTP APIs.
Embedded - run Unipop inside any JVM based application.

Configure

Add your data sources to Unipop's configuration. Configuring a source entails mapping its schema to a "property graph" model (i.e. vertices & edges). Unipop is built in an extensible way, enabling many different mapping options.

Query

Console, Server, Embedded, or Language drivers

Customize & Extend

TBD

How it works

TBD

Technical details.

Contributing

TBD

unipop's People

Contributors

Stargazers

Watchers

unipop's Issues

Bulk query Repeat/Union/Coalesce/Where Steps

Bulk mutations

Solution suggestion

Optimize the Mutation steps to issue mutation UniQuerys with a bulk of Element, enabling the Controllers to issue bulk commands to the DB.

Questions

Should we enable bulk mutations directly from the UniGraph? That would entail adding on Tinkerpop's current Graph API.
Should changes be automatically committed to the DB after every query, or should we add a commit() method to UniGraph/Traversal?
Maybe using BulkLoaderVertexProgram, or a similar solution, would be a better choice?

Virtual Controller

TBD

Dependency Management - JSON

At the moment, we depend on 4 different json libraries (including inner dependencies).

json-simple
json.org
Gson
Jackson

for the sake of less dependencies, we should consider removing most.

the top candidates are Jackson and Gson, as Jackson is heavily used by tinkerpop, and Gson is heavily used by Jest & Hadoop. requires further analysis of which API is easier to use and what performs better.

Tinkerpop
Jest
Hadoop

either way, this required some refactor to remove the other libraries.

"Customizing & Extending Unipop" guide

Customizing Schemas

index by time
multiple rows
multiple buckets

Extending Controllers

Elasticsearch
- RoutedDocumentController
- TemplateController
- AggregationController
- GeoIntersectController
Jdbc
- StoredProcedureController

elastic2 DocumentController refactor

jdbc - StoredProcedureController

Dependency convergence error on build

This sounds cool, so I wanted to give unipop-elastic a try.
Is there a jar hosted on some public repository?

When I build it manually I get the following dependency convergence error, which I couldn't even fix when adding a direct dependency from unipop-elastic on snakeyaml:1.15

Failed while enforcing releasability the error(s) are [
Dependency convergence error for org.yaml:snakeyaml:1.15 paths to dependency are:
+-unipop:unipop-elastic:0.1
  +-unipop:unipop-core:0.1
    +-org.apache.tinkerpop:gremlin-core:3.0.2-incubating
      +-org.yaml:snakeyaml:1.15
and
+-unipop:unipop-elastic:0.1
  +-unipop:unipop-core:0.1
    +-org.yaml:snakeyaml:1.15
and
+-unipop:unipop-elastic:0.1
  +-org.elasticsearch:elasticsearch:1.7.3
    +-org.yaml:snakeyaml:1.12
]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Unipop ............................................. SUCCESS [  0.627 s]
[INFO] Unipop :: Core ..................................... SUCCESS [  0.762 s]
[INFO] Unipop :: Elasticsearch Controllers ................ FAILURE [  0.709 s]
[INFO] Unipop :: JDBC Controllers ......................... SKIPPED
[INFO] Unipop :: Integration tests ........................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE

"inner JOIN" steps into single query

Most Databases have an Sql-like "join" feature. Utilizing this across Traversal Steps can bring a big boost to performance. e.g. g.V().hasLabel('foo').out('bar') could be queried as select ... from foo join bar on ...

This is how I thought to implement this:

Analyze Traversal

Use TraversalStrategys to analyze possible joins:

Adjacent steps
SelectStep
Inner traversal
Aggregations
MatchStep
...

SearchQuery will be added a property SearchVertexQuery[] getNextQueries(), returning the possible joins found. A controller can use this recursively to get more join possibilities.