d2rq / d2rq Goto Github PK

View Code? Open in Web Editor NEW

309.0 309.0 115.0 90.7 MB

Database to RDF mapping engine and SPARQL server

Home Page: http://d2rq.org/

Shell 0.31% Java 90.77% HTML 1.00% JavaScript 7.27% CSS 0.22% XSLT 0.43%

d2rq's Introduction

D2RQ – A Database to RDF Mapper

D2RQ exposes the contents of relational databases as RDF. It consists of:

The D2RQ Mapping Language. Use it to write mappings between database tables and RDF vocabularies or OWL ontologies.
The D2RQ Engine, a SPARQL-to-SQL rewriter that can evaluate SPARQL queries over your mapped database. It extends ARQ, the query engine that is part of Apache Jena.
D2R Server, a web application that provides access to the database via the SPARQL Protocol, as Linked Data, and via a simple HTML interface.

Homepage and Documentation

Learn more about D2RQ at its homepage: http://d2rq.org/

License

Apache License, Version 2.0

http://www.apache.org/licenses/LICENSE-2.0.html

Contact, feedback, discussion

Please use the issue tracker here on GitHub for feature/bug discussion and support requests.

Building from source

Prerequisites

You need some tools in order to be able to build D2RQ. Depending on your operating system, they may or may not be already installed.

git, for forking the source code repository from GitHub. Run git on the command line to see if it's there.
Java JDK v5 or later, for compiling Java sources. Run java -version and javac on the command line to see if it's there.
Apache Ant, for building D2RQ. Run ant on the command line to see if it's there.

Getting the source

Get the code by forking the GitHub repository and cloning your fork, or directly clone the main repository:

git clone [email protected]:d2rq/d2rq.git

Doing Ant builds

D2RQ uses Apache Ant as its build system. You can run ant -p from the project's main directory to get an overview of available targets:

To run the D2RQ tools, you need to do at least ant jar.

ant all	Generate distribution files in zip and tar.gz formats
ant clean	Deletes all generated artefacts
ant compile	Compile project classes
ant compile.tests	Compile test classes
ant jar	Generate project jar file
ant javadoc	Generate Javadoc API documentation
ant tar	Generate distribution file in tar.gz format
ant test	Run tests
ant vocab.config	Regenerate Config vocabulary files from Turtle source
ant vocab.d2rq	Regenerate D2RQ vocabulary files from Turtle source
ant war	Generate war archive for deployment in servlet container
ant zip	Generate distribution file in zip format

Running D2RQ

After building with ant jar, you can test-run the various components. Let's assume you have a MySQL database called mydb on your machine.

Generating a default mapping file

./generate-mapping -u root -o mydb.ttl jdbc:mysql:///mydb

This generates a mapping file mydb.ttl for your database.

Dumping the database

./dump-rdf -m mydb.ttl -o dump.nt

This creates dump.nt, a dump containing the mapped RDF in N-Triples format.

Running D2R Server

./d2r-server mydb.ttl

This starts up a server at http://localhost:2020/

Deploying D2R Server into a servlet container

Edit /webapp/WEB-INF/web.xml to point the configFile parameter to the location of your mapping file.

Build a war file with ant war.

Deploy the war file, e.g., by copying it into the servlet container's webapps directory.

Running the unit tests

The unit tests can be executed with ant test.

Some unit tests rely on MySQL being present, and require that two databases are created:

A database called iswc that contains the data from /doc/example/iswc-mysql.sql:

echo "CREATE DATABASE iswc" | mysql -u root mysql -u root iswc < doc/example/iswc-mysql.sql
An empty database called D2RQ_TEST.

d2rq's People

Contributors

Stargazers

Watchers

Forkers

zazi melgi dlvenable cygri hongsun502 hs502 aftiqb luiseufrasio quinode jgrethe ajtucker tayfunhalac sarasi data2semantics commiezapr regandrey 72124 ddcruver anujsrc web5design kabhinav copierrj sesuncedu w2ogroup anida arekinath mswimmer davide-romanini syapse tomsitter seralf jvsoest jcebidanes arvados inovexcorp mohamedragab arekstasiewicz bilred panzer47 alfredotemiquel tayloj bonjour1992 julian1-forks2 slnsw-webdev paulmurraycbr ksentorias gadrev eschwert 0607060123 valeriapequeno adpelttari fabiovalse amazingzero shanzhaigege anukat2015 bioqwer aniadavid rollend sobolsigizmund eduardobenito10 cnopens suramrit shahedmobydeen madasar miryana333 mdkan cnsky2016 sirinem skynx rahulrawat11 appsecai-test jamieo-ms kartgk devenvag sandeepkunkunuru tidesq ayudhien water160 cordawyn blankrain ioripolo heydsn tibhannover isaiabejan ilibx josemachino vijayeluri shtdsoft franklwl bilalbenma kebiryacine roy61056 shownor ssun3 gaecom versant2612 juanitorius jim2016713 iffahel abhishekbhatia

d2rq's Issues

Exclude servlet-api.jar from war

The war target in build.xml must exclude servlet-api-*.jar. Currently it just excludes one specific version, which is not the one that's actually present. This causes warnings in Tomcat, and might have worse consequences in other servlet containers.

Joins don't work properly if translateWith is involved

Let's say we have two class maps that are supposed to produce the same URIs, but using slightly different transformatons:

one uses translateWith, the other doesn't
one uses translateWith, the other uses a different translateWith
one uses urlify, the other doesn't

SPARQL queries that require joins over these node makers don't work properly. The generated SQL will have joins as if the translateWith or urlify wasn't there.

maven projects

I created maven [3] projects for D2RQ [1] and D2RS [2]. The former reflects more or less D2RQ, since I manually removed the D2RS related sources that are included in this code base for now. Please feel free to use them as you like.

[1] https://github.com/zazi/D2RQ-maven
[2] https://github.com/zazi/D2RS-LinkedBrainz
[3] http://maven.apache.org/

30-character limit for identifier names in Oracle

Oracle has 30-character limit for identifiers. The T1_<schema name>_<table name> pattern used by D2RQ can easily result in identifiers that are too long. Just use A1, A2 if it's longer than 30? Or even always?

Specify inverse property directly in mapping

Add d2rq:inverseProperty to specify the inverse of a property bridge. It's just syntactic sugar.

Handling of views in generate-mapping

Views generally don't have a primary key and therefore should be excluded in generate-mapping.

There should be an option --include-views.

This becomes particularly interesting if we implement some way of auto-detecting uniqueness constraints.

Support reasoning/inference

Limited inference can be done by applying the RDFS rules (and potentially other rules) to the D2RQ mapping. This would add additional triple relations to the mapping.

As a first step, the trivial rules that don't rely on schema knowledge could be implemented (?x a rdfs:Class; ?x a rdf:Property). This is already partially done?

As a next step, the engine could be provided with an RDFS/OWL file, and further rules are taken from that file. This would allow implementing subclass, subproperty, equivalent class/property, domain, range, and more. Most if not all of RDFS-Plus could be implemented that way, perhaps with the exception of transitive properties.

Even OWL-QL might be possible.

This implies adding command line and configuration options for enabling inference (should be off by default), and for pointing to the RDFS/OWL file(s). Loading them automatically by dereferencing term IRIs would also be a good option.

“localhost” vs “localhost:2020” in dump-rdf without base URI

If dump-rdf is used without a base URI, then the vocabulary namespace is localhost:2020, while other resources are just localhost without port.

Enable --fast by default

Enable --fast by default. And test it better.

FixedNodeMaker cannot be cast to TypedNodeMaker

Query:

SELECT ?x ?y ?z
WHERE { {<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromRatingSite1/Review6712>  rev:reviewer ?x .
     ?x ?y ?z.}
UNION
    {<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromRatingSite1/Review6712>  rev:reviewer ?a.
     ?x ?y ?z.
        FILTER(?z = ?a).} }

This yields the following exception:

de.fuberlin.wiwiss.d2rq.nodes.FixedNodeMaker cannot be cast to de.fuberlin.wiwiss.d2rq.nodes.TypedNodeMaker

Make SPARQL Basic Federated Query extension work

This would be a first step towards addressing a frequent D2R use case.

ARQ implements this, it's a matter of working out and documenting how exactly to enable and invoke it, and see what needs to be done in the query engine.

This gives us opportunities towards making multiple DBs in a single D2R instance work. The default dataset could make no-join queries by running the entire query against one DB after the other. To make queries that involve joins across DBs, one would have to use SERVICE to address the individual DBs (using their d2rq:Database resource IRI?)

Issues with backslashes in bNodeIdColumns

Reportedly, if a column value listed in d2rq:bNodeIdColumns contains a backslash, then no triples will show up. In the specific case, the backslash-containing value was also the object of the generated triple.

Numeric values larger than MAX_INT don't work

Correctly handle (or at least warn if they are present) columns that take values larger than MAX_INT, e.g., NUMERIC(20); perhaps using java.math.BigDecimal

NULL value of an object property leads to “Person_null” IRI

Reportedly, NULL values of an object property can lead to IRIs like “Person_null”. No details are known. This should work fine.

Refactor to introduce a mapping API

The map package already contains a pretty decent representation of a D2RQ mapping as Java objects. This should be refactored:

should have setters that allow programmatic creation of these objects (currently they are always initialized from an RDF graph)
mapping generator should produce these objects rather than directly producing Turtle
code for producing the mapping objects from RDF should be separated
code for producing TripleRelations and so on from the objects should (compile()) should be separated
code for validating the mapping objects (currently in validate() and in the RDF-to-object code) should be separated
there should be code that writes mapping objects to Turtle
should also make things like rdfs:label available – useful for HTML frontend

This would allow many good things:

simplify implementation of different generate-mapping algorithms, for example for the W3C Direct Mapping
simplify implementation of alternate mapping languages, in particular R2RML
allow third-party code to produce mappings using its own algorithms
simplify implementation of inference by expanding the mapping objects
easier testing of components in isolation

Unsupported database column types

This is a frequent class of interoperability problems in D2R.

The mapping should be made configurable and moved into an external file. Mapping authors should perhaps be able to add additional type mappings to their mapping files.

The error message for unsupported type codes could be slightly more helpful: it should also show the type name of the type of that column.

Also, is it wise to die whenever we don't know the column type? Perhaps we should just cast all unknown types to string, and only complain if that fails.

Examples for things that have been reported not to work:

~~TIMESTAMP in Oracle produces “Unsupported database type code (1111) for column xxxx”~~ Fixed
~~NVARCHAR2 in Oracle~~ Fixed
ENUM in Postgres (type code 1111)
~~BLOB~~ Partially fixed, see #115
~~Any binary type should be treated as xsd:hexBinary~~ Fixed

Exclude certain tables from generate-mapping

Add --exclude-table and --include-table options that take table names (incl. “*” wildcard) as argument. By default, all tables are included. Only tables that are included and not explicitly excluded will be taken into account.

Later we could use this infrastructure to exclude certain tables by default, such as system tables.

D2R Server deployment issues with Tomcat

Observed on Windows. Happens both when deploying a .war built with the usual instructions, and when deploying just by copying the D2R Server webapp directory into webapps. This is for a non-root webapp.

31-Mar-2010 12:03:08 org.apache.catalina.core.StandardContext addApplicationListener
INFO: The listener "de.fuberlin.wiwiss.d2rs.WebappInitListener" is already configured for this context. The duplicate definition has been ignored.
31-Mar-2010 12:03:09 org.apache.catalina.core.StandardContext start
SEVERE: Error listenerStart
31-Mar-2010 12:03:09 org.apache.catalina.core.StandardContext start
SEVERE: Context [/idac] startup failed due to previous errors

Allow generation of SQL statement without executing it

And allow turning a (virtual) DB result into SPARQL results/triples.

Use cases:

debugging
unit tests
Snorql and d2r-query could support “SQL query” as additional result format (again mostly for debugging)
use cases where third-party code must run the SQL query for some reason (no JDBC driver)

Issues with TEXT columns in MySQL

Reportedly, TEXT columns don't work in MySQL.

Christian Brenninkmeijer [email protected] reports that he fixed it in de.fuberlin.wiwiss.d2rq.sql.ConnectedDB.columnType() like so:

default:
    if ("NVARCHAR2".equals(type.typeName()))
        return TEXT_COLUMN;
    else if ("text".equals(type.typeName()))
        return TEXT_COLUMN;
    else
        throw new D2RQException("Unsupported database type  code (" + type + ") for column " + column.qualifiedName());

Relative path in metadata template doesn't work

Setting d2r:metadataTemplate to "metadata.n3" results in an error message:

com.hp.hpl.jena.shared.NotFoundException: Not found: /d2rqwebapp/WEB-INF/templates/metadata.n3

The path here seems to come somehow from this property in velocity.properties?

file.resource.loader.path=webapp/WEB-INF/templates

A relative-directory reference should be interpreted as relative to WEB-INF!

Guess primary key in generate-mapping

This came up in discussion with Christian Brenninkmeijer [email protected] who has partially implemented this in his own code.

Add a --guess-pk switch. Analyze columns to see if they are unique. If so, use it as key. Perhaps, if none is found, start doing pairs of columns, then triplets etc

SPARQL DESCRIBE not working in some cases where d2rq:join is used

Reported by Evangelos Theodoridis [email protected].

When adding a property to a table that uses information from another database table , the HTML view seems to have a problem, but the SPARQL endpoint works properly.

For example (schema used attached) when adding

a) a property that lists the interactions of a person using info from the interactions table

map:person_interactsWith a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:person;
d2rq:property vocab:person_interactsWith;
d2rq:join "person.ID = interactions.PersonA";
d2rq:uriSqlExpression "CONCAT('http://192.168.2.4:8080/resource/person/', interactions.PersonB)";
d2rq:propertyDefinitionLabel "Has Interacted With the following";
.

b) a property that lists the Categories of persons that a person has interacted with

map:person_CategoryInteractions a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:person;
d2rq:property vocab:person_CategoryInteractions;
d2rq:join "person.ID <= interactions.PersonA";
d2rq:sqlExpression "SELECT person.Category from person WHERE person.ID = interactions.PersonB";

we get (in the HTML view) the following error

HTTP ERROR: 500

Column not found in DESCRIBE result: @@interactions.PersonA@@ (E0)
RequestURI=/page/person/1

SPARQL view works just fine however.

Validate mapping against an RDFS/OWL file

Add validation that checks whether all classes and properties generated by the mapping are actually defined in an ontology (either by dereferencing, or provided in an RDFS/OWL file).

This is related to reasoning/inferencing, which would also benefit from access to an external RDFS/OWL file.

FILTER REGEX doesn't work for non-ASCII characters?

Reportedly, there are problems with Greek characters in FILTER REGEX.

Might just be a generic encoding issue?

Dynamically-generated language tags

There are situations where the language tag of a property bridge is not static, but actually is a column in the DB. The BSBM schema is an example.

Currently this can only be handled by duplicating the bridge with different d2rq:conditions.

There could be a new property d2rq:langMap which points to a sub-resource that allows specification of a d2rq:column etc.

Writer in MappingGenerator is not closed

This causes issues with code that keeps invoking the mapping generator multiple times. Close it!

Serve RDFS/OWL file on the server

Make it easy to serve an RDFS/OWL file (which might come from the web or from the file system) alongside the data in D2R Server.

Ideally, use it for rdfs:label etc, and mix it into the SPARQL dataset.

This could be used to serve the auto-generated vocabulary.

Drop or update the Sesame support

Either drop or update the Sesame API support. It hasn't been touched since 2006 and most likely no longer works for recent Sesame versions. It has no tests.

Server-side timeout for long-running queries

There should be an option to configure a server-side timeout. Queries that take longer should be (cleanly!) aborted.

Re-organize jar files in /lib by source: Joseki, Jetty

It's just one big pile at the moment, making it hard to understand what version of what is being used. No wonder people want to mavenize it :-(

Make subqueries work in d2rq:sqlExpression and d2rq:condition

Making subqueries work would be extremely handy. Currently the problem is that all columns mentioned there will be renamed with aliases. This should be done only for columns on tables that are not in the FROM clause of the subquery (?).

Maybe it works if one uses only aliases as table names in the FROM clause? If so, document that. If not, some other way has to be found.

SPARQLing for a DATE in SQL Server doesn't work

Reportedly, querying for a specific DATE value in SQL Server doesn't work.

The generated query contains: t.foo=DATE '1999-02-13'.

Reportedly, the correct syntax for SQL Server is foo='1999-02-13'.

Supporting multiple databases in a single mapping

Doing this right is very difficult because it requires query planning.

A cheap way to address the use case without actually supporting multiple mappings is to support SPARQL basic federated query, but that's a separate issue.

If we really don't support multiple d2rq:Databases in a single mapping, then we should just consider that a validation failure and abort.

Otherwise, there are a number of things that can be done. Most importantly, NodeMakers over relations from different DBs should never be considered compatible. This should allow at least non-join federation of databases.

VoID support

Add VoID support:

void:inDataset on all RDF representations
assign a dataset IRI
serve a VoID description that includes all information we know, such as SPARQL endpoint and perhaps even stats
allow mixing a user-provided RDF file into the VoID description

Also, SPARQL Service Description should be served from the SPARQL endpoint URI – could likely be the same document as the VoID representation.

Issues with booleans in Postgres

Ensure that returning booleans works in Postgres.

Reportedly, they come out from the DB as “t” instead of “true”, being evaluated to false in TypedNodeMaker

SQL query run twice when containsDuplicates set to true

Reported by Phil Ashworth [email protected]:

I've created a classmap on a table that contains duplicates, so I've used the d2rq:containsDuplicates "true"; statement. See Below. However I noticed that the results were slower returning with this statement. Running a local host I checked the debug statements for the SQL calls.

It runs the correct query (with the distinct clause) but it runs it twice before the results are displayed.

If I comment out the d2rq:containsDuplicates statement, the sql is only run once.

map:Entity a d2rq:ClassMap;
    d2rq:dataStorage map:database;
    d2rq:uriPattern "entity/@@NBE_READER.UCB_NBE_DETAIL_VW.COMPOUND@@";;
    d2rq:class vocabClass:Entity;
    d2rq:classDefinitionLabel "Entity Class definition";
    d2rq:classDefinitionComment "Class defining Entities";
    d2rq:containsDuplicates "true";
    d2rq:condition "NBE_READER.UCB_NBE_DETAIL_VW.TEST_YN = 'N'";
    .

Graceful handling of filenames where IRIs are expected

Users often accidentally specify a file name where an IRI is expected. This results in errors such as “unknown protocol C” or similar (Windows).

There should be more graceful handling of this common user error. Either make filenames work by converting them to file: IRIs. Or error message “IRI expected, file name found as value of foo: xyz”.

See Issue #52.

D2R Server shows Turtle instead of HTML on Firefox 3.6.16

Reportedly, D2R Server shows Turtle instead of HTML on Firefox 3.6.16.

Accept header of that Firefox version is:

text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

ClassMap with d2rq:constantValue is rejected by map parser

This map:

map:Throot a d2rq:ClassMap;
    d2rq:dataStorage map:database;
    d2rq:class skos:ConceptScheme;
    d2rq:constantValue "http://example.com/thesaurus/0.1/";

yields this error:

d2rq:ClassMap map:Throot needs a column/pattern/bNodeID specification (E0)

Either it should complain that d2rq:constantValue is a literal, or it should just work.

Extract best-effort RDF Schema file from mapping

Add code that infers a best-effort RDF Schema file from a mapping.

I guess that this could be done in an “open-world” or “closed-world” fashion. In closed-world, if property foo is only used to relate A and B, then those are its domain and range. In open-world, it would just mean that the domain and range intersects A and B.

There should be a script for saving that schema to a file. This could be part of generate-mapping if the implementation only works for mappings generated by that script; or a separate script if its input is any mapping.

Separate issue: serve the extracted schema in D2R Server (#42).

Support arbitrary characters in table/column names in generate-mapping

With escaping, any character can occur in a table/column name. The generate-mapping script isn't prepared for that; it derives IRIs from those names in a way that fails quite often. A generic escaping mechanism that works for absolutely any character is needed.

Examples for characters known to cause problems: “.”, “@”

Also, the current design causes clashes; "XXX_YYY.ID" and "XXX.YYY_ID" result in the same IRI in the mapping file.

DISTINCT with TEXT and BLOB in Oracle and MS SQL Server

D2RQ routinely generates queries that result in this error if run on MS SQL Server:

The text data type cannot be selected as DISTINCT because it is not comparable.

Similar things happen on Oracle for Text and BLOB.

Example SPARQL query:

SELECT ?title ?text WHERE {
    ?review bsbm:reviewFor <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer30/Product1358> .
    ?review dc:title ?title .
    ?review rev:text ?text .
}

Resulting SQL query:

SELECT DISTINCT "T1_review"."reviewID", "T2_review"."title", "T1_review"."publisher", "T3_review"."text" FROM "review" AS "T3_review", "review" AS "T1_review", "product" AS "T1_product", "review" AS "T2_review" WHERE ("T1_product"."producerID" = 30 AND "T1_product"."productID" = "T1_review"."productID" AND "T1_product"."productID" = 1358 AND "T1_review"."publisher" = "T2_review"."publisher" AND "T1_review"."publisher" = "T3_review"."publisher" AND "T1_review"."reviewID" = "T2_review"."reviewID" AND "T1_review"."reviewID" = "T3_review"."reviewID" AND ("T3_review"."language"='zh'))

This is part of the BSBM suite.

Load database dump into a SPARQL store

There should be a script that loads the mapped contents of a DB directly into a SPARQL store using SPARQL Update or the Graph Store Protocol.

This would make D2R a nicely complete ETL tool.

Duplicates in directory pages

Reportedly, a class map over a denormalized table will have duplicates in the directory pages even if d2rq:containsDuplicates is set to true.

Incorrect SQL query for property bridges with different d2rq:lang

Mapping (this comes from BSBM):

map:review_text1 a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:review;
d2rq:property rev:text;
d2rq:column "review.text";
d2rq:lang "en";
d2rq:condition "review.language='en'";
.
# snip more languages
map:review_text8 a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:review;
d2rq:property rev:text;
d2rq:column "review.text";
d2rq:lang "ko";
d2rq:condition "review.language='ko'";

SPARQL query:

SELECT ?title ?text WHERE {
    ?review bsbm:reviewFor <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer30/Product1358> .
    ?review dc:title ?title .
    ?review rev:text ?text .
}

Resulting SQL:

SELECT DISTINCT "T1_review"."reviewID", "T2_review"."title", "T1_review"."publisher", "T3_review"."text" FROM "review" AS "T3_review", "review" AS "T1_review", "product" AS "T1_product", "review" AS "T2_review" WHERE ("T1_product"."producerID" = 30 AND "T1_product"."productID" = "T1_review"."productID" AND "T1_product"."productID" = 1358 AND "T1_review"."publisher" = "T2_review"."publisher" AND "T1_review"."publisher" = "T3_review"."publisher" AND "T1_review"."reviewID" = "T2_review"."reviewID" AND "T1_review"."reviewID" = "T3_review"."reviewID" AND ("T3_review"."language"='zh'))

So the SQL query explicitly asks only for zh, which isn't what was queried for.