google / badwolf Goto Github PK

View Code? Open in Web Editor NEW

979.0 63.0 78.0 3.98 MB

Temporal graph store abstraction layer.

License: Apache License 2.0

Go 100.00%

badwolf's Issues

Build BQL test workbench

Build a test corpus to validate BQL behavior. Also this workbench needs to support multiple backends. T

example using wikidata

Is there any wikidata example available ?

If not, what would be the rough steps to use badwolf with wikidata

Infinite loop? when querying time anchor.

create graph ?world;

insert data into ?world {
  /room<Hallway> "connects_to"@[] /room<Kitchen>.

  /room<Kitchen> "connects_to"@[] /room<Hallway>.
  /room<Kitchen> "connects_to"@[] /room<Bathroom>.
  /room<Kitchen> "connects_to"@[] /room<Bedroom>.

  /room<Bathroom> "connects_to"@[] /room<Kitchen>.

  /room<Bedroom> "connects_to"@[] /room<Kitchen>.
  /room<Bedroom> "connects_to"@[] /room<Fire Escape>.

  /room<Fire Escape> "connects_to"@[] /room<Kitchen>.

  /item/book<000> "in"@[2016-04-10T4:21:00.000000000Z] /room<Hallway>.
  /item/book<000> "in"@[2016-04-10T4:23:00.000000000Z] /room<Kitchen>.
  /item/book<000> "in"@[2016-04-10T4:25:00.000000000Z] /room<Bedroom>
};

select ?item, ?t from ?world where {
  ?item "in"@[?t] /room<Bedroom>
};

drop graph ?world;

results in an infinite loop at

Processing statement (3/4):
select ?item, ?t from ?world where { ?item "in"@[?t] /room<Bedroom> };

Implement a Boolean expression evaluator

The Boolean expression evaluator is required to implement the HAVING clause listed on issue 19.

Comparison of triples and GUID may not be stable

The internal representation of time can lead to difference on equal. Also, since the text version is used on the serialization, it can also affect stability of GUIDs for triples.

Add a way to retrieve all available graphs names from Store

Add a method to retrieve all available graph names from the store.

Implement SELECT table projection

Implement binding projection for the resulting results Table.

Still active?

The idea of using an immutable graph store in conjunction with event sourcing make a lot of sense for an upcoming project, however it looks like BadWolf has been abandoned. Is this the case?

Table merging may have a bug

This is not the expected result. It seems like the merge table is a bad merge.

Welcome to BadWolf vCli (0.4.2-dev)
Using driver "VOLATILE". Type quit; to exit

Session started at 2016-05-17 12:46:28.098374381 -0700 PDT

bql> create graph ?family;
[OK]
bql> load /tmp/family.txt ?family;
Successfully processed 6 lines from file "/tmp/family.txt".
Triples loaded into graphs:
    - ?family
bql> select ?grandparent from ?family where {?s "parent of"@[] /person<Amy Schumer> . ?grandparent "parent of"@[] ?s};
?grandparent
/person<Gavin Belson>
/person<Gavin Belson>
/person<Mary Belson>
/person<Mary Belson>

[OK]
bql>

The data used to run this command in /tmp/family.txt is

/person<Gavin Belson>  "born in"@[]    /city<Springfield>
/person<Gavin Belson>  "parent of"@[]  /person<Peter Belson>
/person<Gavin Belson>  "parent of"@[]  /person<Mary Belson>
/person<Mary Belson>   "parent of"@[]  /person<Amy Schumer>
/person<Mary Belson>   "parent of"@[]  /person<Joe Schumer

According to the BQL overview document, the "as" keyword should be able to be used to return a different name for variables. However, the keyword causes an error when running the program. The keyword only works when used with an aggregation.

When running this program:

# Create a graph.
CREATE GRAPH ?family;

# Insert some data into the graph.
INSERT DATA INTO ?family {
  /u<joe> "parent_of"@[] /u<mary> .
  /u<joe> "parent_of"@[] /u<peter> .
  /u<peter> "parent_of"@[] /u<john> .
  /u<peter> "parent_of"@[] /u<eve>
};

# Find all Joe's offspring names.
# Works fine without "as" keyword.
SELECT ?name
FROM ?family
WHERE {
  /u<joe> "parent_of"@[] ?offspring ID ?name
};

# Find all Joe's offspring names.
# Fails with "as" keyword.
SELECT ?name as ?n
FROM ?family
WHERE {
  /u<joe> "parent_of"@[] ?offspring ID ?name
};

# Count offspring.
# Works with "as" keyword.
SELECT ?parent_name, count(?name) as ?n
FROM ?family
WHERE {
  ?parent ID ?parent_name "parent_of"@[] ?offspring ID ?name
}
GROUP BY ?parent_name;

# Drop the graph.
DROP GRAPH ?family;

The output is:

Processing file bug.bql

Processing statement (1/6):
CREATE GRAPH ?family;

Result:
OK

Processing statement (2/6):
INSERT DATA INTO ?family { /u<joe> "parent_of"@[] /u<mary> . /u<joe> "parent_of"@[] /u<peter> . /u<peter> "parent_of"@[] /u<john> . /u<peter> "parent_of"@[] /u<eve> };

Result:
OK

Processing statement (3/6):
SELECT ?name FROM ?family WHERE { /u<joe> "parent_of"@[] ?offspring ID ?name };

Result:
?name
mary
peter

OK

Processing statement (4/6):
SELECT ?name as ?n FROM ?family WHERE { /u<joe> "parent_of"@[] ?offspring ID ?name };

[FAIL] [ERROR] Failed to execute BQL statement with error cannot project against unknow binding ?n; known bindinds are [?offspring ?name]

Processing statement (5/6):
SELECT ?parent_name, count(?name) as ?n FROM ?family WHERE { ?parent ID ?parent_name "parent_of"@[] ?offspring ID ?name } GROUP BY ?parent_name;

Result:
?parent_name    ?n
joe "2"^^type:int64
peter   "2"^^type:int64

OK

Processing statement (6/6):
DROP GRAPH ?family;

Result:
OK

Implement LIMIT clause

Implement the result returning LIMIT clause.

Review covariant definition

Revisit the covariant definition after comment on https://news.ycombinator.com/reply?id=10432492

BW BQL entry seems to have sticky parser errors

$ bw --driver=VOLATILE bql
...
bql> CREATE GRAPH ?foo;
[OK]
bql> INSERT DATA INTO ?foo {
/u<joe> "parent_of"@[2016-12-12T15:00Z] /u<julia>
};
[ERROR] failed to parse BQL statement with error predicate.Parse failed to parse time anchor
2016-12-12T15:00Z in "parent_of"@[2016-12-12T15:00Z] with error parsing time 
"2016-12-12T15:00Z" as "2006-01-02T15:04:05.999999999Z07:00": cannot parse "Z" as ":"

bql> INSERT DATA INTO ?foo { /u<joe> "parent_of"@[] /u<julia> };
[ERROR] failed to parse BQL statement with error hook.DataAccumulator requires a predicate to
create a predicate, got &{NODE /u<joe> } instead

# This second error is spurious, but sticky. Only quitting and restarting bw seems to allow data to be
# inserted. If I enter the same sequence but with an acceptable timestamp, the second error does not
# occur.

bql> INSERT DATA INTO ?foo { /u<joe> "parent_of"@[] /u<fred> };
[ERROR] failed to parse BQL statement with error hook.DataAccumulator requires a predicate to create a predicate, got &{NODE /u<joe> } instead

bql> quit;
Thanks for all those BQL queries!

$ bw --driver=VOLATILE bql
Welcome to BadWolf vCli (0.5.1-dev @141940248)
Using driver "VOLATILE". Type quit; to exit
Session started at 2016-12-21 13:50:07.001652809 -0500 EST

bql> CREATE GRAPH ?foo;
[OK]
bql> INSERT DATA INTO ?foo {
/u<joe> "parent_of"@[2016-12-12T15:00:00Z] /u<julia>
};
[OK]
bql> INSERT DATA INTO ?foo { /u<joe> "parent_of"@[] /u<julia> };
[OK]
bql> INSERT DATA INTO ?foo { /u<joe> "parent_of"@[] /u<fred> };
[OK]

Long running instance

Excuse for the probably very naive question. I think I have a working badwolf instance, which I obtained by

go get golang.org/x/net/context
go get github.com/peterh/liner
go get github.com/google/badwolf/...

(is this the right way? It is not mentioned anywhere how to install).

In any case, I am able to use the bw tool and follow the examples, use bw bql to get a REPL and so on.

The question is: how do I leave a long running instance of badwolf? Even assuming I want to keep the data in RAM (persistence is not a priority right now, even though I see there are persistent backends ) each time I run the bw an entire new instance of badwolf is created and apparently destroyed.

I assume there must be some way to leave badwolf running in the background and keep querying the existing graphs (even using the bw tool, preferably with some kind of driver/network interface) but I could not find any information on this

For instance, it is not clear to me how to use the bw export command: by the time I run a new bw process, everything in the previous runs is lost, hence there is nothing to export. Similarly, I can run bw load, but then data is lost as soon as the command returns. I am sure I am missing something obvious and fundamental here

Merge efforts with Cayley

Hey! I'm the maintainer of the other Google graph project, https://github.com/google/cayley

I know I've been out of the Google-sphere for a year now, but I've still been contributing to Cayley and it's been used in a number of projects and a few production instances. In short, it works pretty well and I'd like to grow it more.

I've read over your repo (you've written in Go too, good choice ;) ). For storage, you've got 99% the same primitives as Cayley. ~~Triples~~ Quads (went down that road, believe me), memory store, indices. Your methods for things like "TriplesForPredicateAndObject" are pretty standard Cayley iterators.

You're doing some nice things with regard to RDF literals I'd be excited to add to Cayley.

It seems like most of your novel work is in BQL. I'm just reading up on it now, so I haven't quite gotten the full notion of what makes this a new and interesting query language (would love to discuss), but even proposed as a black box, I'd be happy to add it as a query language in Cayley.

I've always been more of a storage guy, so if your interest is in inference and query languages, that works great. The advantages you'd get would be all sorts of backends, various optimizations on the iterators, while still being able to push forward with your temporal graph idea. Everybody wins. What do you think?

TestIsEmptyClause fails

In bql/semantic/semantic_test.go, there is a typo in a test name that causes it to be ignored:

func TesIsEmptyClause(t *testing.T) {
    testTable := []struct {
        in  *GraphClause
        out bool
    }{
        {
            in:  &GraphClause{},
            out: true,
        },
        {
            in:  &GraphClause{SBinding: "?foo"},
            out: true,
        },
    }
    for _, entry := range testTable {
        if got, want := entry.in.IsEmpty(), entry.out; got != want {
            t.Errorf("IsEmpty for %v returned %v, but should have returned %v", entry.in, got, want)
        }
    }

}

After changing the name to the proper TestIsEmptyClause and running the tests, the test fails:

--- FAIL: TestIsEmptyClause (0.00s)
    semantic_test.go:148: IsEmpty for &{<nil> ?foo    <nil>       <nil> <nil>   false <nil>        <nil> <nil>   false} returned false, but should have returned true
FAIL
FAIL    github.com/google/badwolf/bql/semantic  0.028s

Implement sort ORDER BY

Add the collection of bindings and directions to the Statement, enforce validation, and extend the query planner to use the table sort functionality.

Plumb context.Context to storage interfaces

Plumb context.Contexxt

https://godoc.org/golang.org/x/net/context

to all methods on the storage interfaces and fix the volatile memory driver accordingly.

Implement arbitrary Table filtering

Extend the Table functionality to allow arbitrary row filtering with the provided filtering functions. This is requires to solve issue #19.

Update documentation for node and time anchor value extractors

You can create bindings for TYPE and ID for nodes. For predicates you have ID for the id, and AT for extracting the id and time anchor. Documentation should reflect those.

Add stats collection for BQL selects

Right now there is no insight on how selects behave. This would help gain it.

Add a new planner for the CONSTRUCT statement

Implementing #45 will require to extend the planner to be able to create and insert he new facts based on the retrieved data.

Add the test workbench to the bw CLI tool

The test workbench build on issues #25 should be available to run via the command line tool.

[RFC] Advanced graph structural query operations

In preparation for 2017, besides working on extending BQL (see issues #45, #46, #47, and #48), we are planning to start exploring providing support for graph structural query operations. Some examples we could focus one could cover:

Predicate transitive closures and traversals.
Compute the minimal spanning tree.
Path calculations, including shortest path.
Basic structural measures (e.g. betweenness.)

At this point we are considering the list above more or less in the order we would approach them. Is there any other operation you would need to get added? Do you have a pressing operation that would simplify your usage?

Add CONSTRUCT query to BQL

Construct queries allow to create new facts to be added to graphs. The facts are defined based on the bindings provided in the WHERE clause. Basic filtering capabilities are provided by adding a HAVING clause.

A simple example adding new facts based on the current ones:

CONSTRUCT { 
       ?p "grandmother of"@[] ?g .
       ?g "grandchild of"@[]  ?p
}
INTO ?graph1, ?graph2
FROM ?graph3, ?graph4
WHERE {
       ?p         "parent of"@[] ?parent .
       ?parent    "parent of"@[] ?g .
       ?p         "gender"@[]    ?gender 
}
HAVING ?gender == /gender<male>;

It is worth mention that the abover query could be simplified as shown below. Never the less, the goal what to show the full structure of a CONSTRUCT query.

CONSTRUCT {
       ?p "grandmother of"@[] ?g .
       ?g "grandchild of"@[]  ?p
}
INTO ?graph1, ?graph2
FROM ?graph3, ?graph4
WHERE {
       ?p         "parent of"@[] ?parent .
       ?parent    "parent of"@[] ?g .
       ?p         "gender"@[]    /gender<male>
};

Subjects are allowed to specify _ instead of a WHERE clause binding. This will inject a new blank node.

Write a Cayley driver for the current storage.go interface.

As a first step towards unifying efforts with http://github.com/google/cayley we are going to target creating a driver implementation against Cayley together with @barakmich.

Modify the BQL grammar to parse CONSTRUCT

Implementing #45 requires to modify the grammar to accept the new CONSTRUCT statement.

Assert type covariance in query?

How does one use type covariance in a query? The documentation doesn't cover it.

Group by heck should also validate aggregation functions

Check needs validation that bindings outside the group by ones all have aggregation functions.

Add literals, node, or predicate to the HAVING clause

https://github.com/google/badwolf/blob/master/bql/grammar/grammar.go#L612

Does not contain any of the elements that form and object making the comparison bounded on flexibility.

Cut final stable release

Do one last pass to the initial conformance tests and if everything checks out cut the first release.

Implement Table projection

Generated tables via graph clauses needs to be projected via SELECT vars.

Questions: EAV vs Triplestore, Gremlin, Geographical data, immutability, further readings?

Héllo,

First and foremost thanks for sharing this project! This is very interesting!

Me, Myself and I

I am a database modeling enthusiast, I created a database in Python called AjguDB, which is a graphdb on top EAV (on top of wiredtiger ordered key value store, it's similar to boltdb). I did a similar project in Scheme which is can be queried using miniKanren (a logic language embedded in Scheme language). My inspiration is mostly datomic database even if I skipped the immutable part.

EAV vs Triplestores

I used to think that EAV was a triplestore.I am reconsidering that fact. It seems like EAV model is less generic that triplestore model. My understanding, is that both are good at modeling sparse matrix / multidimentional data but EAV is really good at representing documents whereas triplestores are good at representing triples (or facts). One might say, that a document is a set of triples. But in EAV model you don't have control over the entity, it's randomly generated. At the end of the day, I think triplestore is just EAV with E that is not a unique identifier. WDYT?

Gremlin querying

Is it possible to adapt Gremlin to work on quads?

Geographical data

I am surprised that there is no mention of geographical data in some way. Is it something you plan to add?

Immutability

How do you cope with immutability during querying? Here is a pratical example, say there is triple that says that «there is hundred people in a twon in 2017». Now, it's 2018, do I need to create a new triple or update the old triple? Do triples have an history? It seems to me that a database present in BadWolf must be clean, you can not fix typos or it will kludge results.

Further readings

Can you recommend me stuff to read about BadWolf.

I will dive into boltdb drivers.

Implement GROUP BY

Use the Table grouping implemented on issue #17 to provide a functional group by clause.

Bad interaction between parsing predicates and string literals.

create graph ?world;

insert data into ?world {
  /room<000> "named"@[] "Hallway"^^type:text.
  /room<000> "connects_to"@[] /room<001>
};

fails with:

[FAIL] [ERROR] Failed to parse BQL statement with error Parser.parse: Failed to consume symbol INSERT_OBJECT, with error Parser.consume: could not consume token &{ERROR "Hallway" [lexer:0:57] predicates require time anchor information; missing "@[} in production INSERT_OBJECT

Implement tracing functionality for query excecution

To improve execution debugging and performance improvements, we should add a simple tracing mechanism to see detailed traces of query execution.

DeleteRow improvement introduced non deterministic behavior

In commit 9845651 DeleteRow introduced non deterministic behavior. This is a bit strange and should be investigated further.

Add DECONSTRUCT query to BQL

Deconstruct queries allow to removed derived facts from a graphs. The facts are defined based on the bindings provided in the WHERE clause. Basic filtering capabilities are provided by adding a HAVING clause. This is the complementary statement for CONSTRUCT introduced in issue #45.

DECONSTRUCT { 
       ?p "grandmother of"@[] ?g .
       ?g "grandchild of"@[]  ?p
}
AT ?graph1, ?graph2
FROM ?graph3, ?graph4
WHERE {
       ?p         "parent of"@[] ?parent .
       ?parent    "parent of"@[] ?g .
       ?p         "gender"@[]    ?gender 
}
HAVING ?gender == /gender<male>;

It is worth mention that the abover query could be simplified as shown below. Never the less, the goal what to show the full structure of a CONSTRUCT query.

DECONSTRUCT {
       ?p "grandmother of"@[] ?g .
       ?g "grandchild of"@[]  ?p
}
AT ?graph1, ?graph2
FROM ?graph3, ?graph4
WHERE {
       ?p         "parent of"@[] ?parent .
       ?parent    "parent of"@[] ?g .
       ?p         "gender"@[]    /gender<male>
};

_ are not allowed on DECONSTRUCT clauses.

Implement arbitrary Table grouping

Extend the Table functionality to allow arbitrary aggregation with the provided aggregation functions.

Extend the semantic parser to validate the CONSTRUCT statement

Implementation of #45 requires to update the Statement to collect the relevant information on how to construct the new triples.

Cut a first release candidate.

Do another pass to the document ion and compliance stories. Once done, label the latest master commit as RC1 after updating the version number.

Set up continuous testing

Right now I am manually running all tests before commits. We should set up continuous testing for the whole project to run at least all the available unit test.

Implement GLOBAL TIME BOUNDS clause

Add the collection of anchor bounds and properly compute the intervals, enforce validation, and extend the query planner to properly use the provided bounds.

Filtering clause returns unexpected failure

Given the following data set

/_<c175b457-e6d6-4ce3-8312-674353815720>	"_predicate"@[]	"/some/immutable/id"@[]
/_<c175b457-e6d6-4ce3-8312-674353815720>	"_owner"@[2017-05-23T16:41:12.187373-07:00]	/gid<0x9>
/_<c175b457-e6d6-4ce3-8312-674353815720>	"_subject"@[]	/aid</some/subject/id>
/_<c175b457-e6d6-4ce3-8312-674353815720>	"_object"@[]	/aid</some/object/id>
/_<cd8bae87-be96-41af-b1a8-27df990c9825>	"_object"@[2017-05-23T16:41:12.187373-07:00]	/aid</some/object/id>
/_<cd8bae87-be96-41af-b1a8-27df990c9825>	"_owner"@[2017-05-23T16:41:12.187373-07:00]	/gid<0x6>
/_<cd8bae87-be96-41af-b1a8-27df990c9825>	"_predicate"@[2017-05-23T16:41:12.187373-07:00]	"/some/temporal/id"@[2017-05-23T16:41:12.187373-07:00]
/_<cd8bae87-be96-41af-b1a8-27df990c9825>	"_subject"@[2017-05-23T16:41:12.187373-07:00]	/aid</some/subject/id>
/aid</some/subject/id>	"/some/temporal/id"@[2017-05-23T16:41:12.187373-07:00]	/aid</some/object/id>
/aid</some/subject/id>	"/some/immutable/id"@[]	/aid</some/object/id>
/aid</some/subject/id>	"/some/ownerless_temporal/id"@[2017-05-23T16:41:12.187373-07:00]	/aid</some/object/id>

The following query succeeds as expected.

bql> SELECT ?bn,?p, ?o 
     FROM ?test 
     WHERE { 
          ?bn "_subject"@[,]    /aid</some/subject/id>. 
          ?bn "_predicate"@[,] ?p .
          ?bn "_object"@[,] ?o 
      };

?bn	?p	?o
/_<cd8bae87-be96-41af-b1a8-27df990c9825>	"/some/temporal/id"@[2017-05-23T16:41:12.187373-07:00]	/aid</some/object/id>

[OK] Time spent:  578.963µs

However, when you specify ?o, it fails with a filtering error.

bql> SELECT ?bn,?p  
     FROM ?test 
     WHERE { 
          ?bn "_subject"@[,]    /aid</some/subject/id>. 
          ?bn "_predicate"@[,] ?p . 
          ?bn "_object"@[,] /aid</some/object/id>
      };

[ERROR] planner.Execute: failed to execute insert plan with error failed to fully specify clause { ?bn "_object"@[,] /aid</some/object/id> } for row map[?bn:/_<cd8bae87-be96-41af-b1a8-27df990c9825>]
Time spent:  514.294µs

Given the this is just and update on the query above, this should have not failed and return one row with ?bn and ?p bindings.

Ability to retract a triple or set an "unanchor" time

A triple is asserted with an anchor time, but there is no mechanism for unanchoring the triple, that is invalidating the triple. One approach I thought about was have a nil type that denotes the triple has been retracted. Another method could be to implement this in the logic of the storage layer. When a triple is "deleted", the triple is stored in a retracted set. When triples are requested, matched triples would need to be evaluated against the retracted set before being returned. Have you thought about this at all?

Explore an storage abstraction layer using iterator trees

As part of the effort to enable Cayley (http://github.com/google/cayley) as a backend for BadWolf, its efficient usage would require to provide an iterator tree as an output of the BQL parsing for Cayley to pick it up. This would enable Cayley to use the optimizer before executing the query. The iterator tree should be translatable to the final Cayley one as discussed with @barakmich.

Cut RC2 and final pre release candidate

Should be the last pre release cut before the stable initial 0.1.0 release.

CONTRIBUTING guide has typo/extra text

The CONTRIBUTING.md file has a cryptic sentence at the end:

This commit can be part of your first [Differential][] code review.

This appears to be missing a link and out-of-context. What's "Differential"? Is it a component of Phabricator? But then how does it relate to GitHub PRs which is presumably the way to contribute to this project?

Either there should be a link there to explain where Differential comes in, or this sentence should be removed entirely.

Consider SPARQL-style "named graphs" construct for temporal aspects

In W3C SPARQL, a collection of triples can be structured in terms of named graphs, and a query expression can refer to these (directly by identifier, or using variables). Have you considered applying this structure for your temporal query facilities? e.g. http://www.w3.org/TR/2013/REC-sparql11-query-20130321/#namedGraphs and nearby.

Implement HAVING clause

Add the collection of bindings and conditions, directions to the Statement, enforce validation, and extend the query planner to use the table filter functionality.