comunica / comunica Goto Github PK

📬 A knowledge graph querying framework for JavaScript

License: Other

TypeScript 99.10% JavaScript 0.83% Dockerfile 0.02% Shell 0.06%

decentralization federation graphql hacktoberfest heterogeneity javascript query-engine rdf sparql triple-pattern-fragments

comunica's Introduction

A knowledge graph querying framework for JavaScript
Flexible SPARQL and GraphQL over decentralized RDF on the Web.

Learn more about Comunica on our website.

Comunica is an open-source project that is used by many other projects, and is being maintained by a group of volunteers. If you would like to support this project, you may consider:

Contributing directly by writing code or documentation; or
Contributing indirectly by funding this project via Open Collective.

Supported by

Comunica is a community-driven project, sustained by the Comunica Association. If you are using Comunica, becoming a sponsor or member is a way to make Comunica sustainable in the long-term.

Our top sponsors are shown below!

Query with Comunica

Read one of our guides to get started with querying:

Or jump right into one of the available query engines:

Comunica SPARQL: SPARQL/GraphQL querying from JavaScript applications or the CLI (Browser-ready via a CDN)

Source Customisation
- Comunica SPARQL File: Engine to query over local RDF files
- Comunica SPARQL RDF/JS: Engine to query over in-memory RDF/JS-compliant sources.
- Comunica SPARQL RDF/JS Lite: Engine optimized for bundle size to query over in-memory RDF/JS-compliant sources.
- Comunica SPARQL HDT: Library to query over local HDT files
Solid Customisation
- Comunica SPARQL Solid: Engine to query over files behind Solid access control.
Link Traversal Research
- Comunica SPARQL Link Traversal: Engine to query over multiple files by following links between them.
- Comunica SPARQL Link Traversal Solid: Engine to query within Solid data vaults by following links between documents.
Reasoning Support
- Comunica SPARQL Reasoning: Engine that adds support for reasoning
- Comunica SPARQL Reasoning File: Engine to query over local RDF files with support for reasoning

Modify or Extending Comunica

Read one of our guides to get started with modifying Comunica, or have a look at some examples:

Contribute

Interested in contributing? Have a look at our contribution guide.

Development Setup

(JSDoc: https://comunica.github.io/comunica/)

This repository should be used by Comunica module developers as it contains multiple Comunica modules that can be composed. This repository is managed as a monorepo using Lerna.

If you want to develop new features or use the (potentially unstable) in-development version, you can set up a development environment for Comunica.

Comunica requires Node.JS 8.0 or higher and the Yarn package manager. Comunica is tested on OSX, Linux and Windows.

This project can be setup by cloning and installing it as follows:

$ git clone https://github.com/comunica/comunica.git
$ cd comunica
$ yarn install

Note: npm install is not supported at the moment, as this project makes use of Yarn's workspaces functionality

This will install the dependencies of all modules, and bootstrap the Lerna monorepo. After that, all Comunica packages are available in the packages/ folder and can be used in a development environment, such as querying with Comunica SPARQL (@comunica/query-sparql).

Furthermore, this will add pre-commit hooks using husky to build, lint and test. These hooks can temporarily be disabled at your own risk by adding the -n flag to the commit command.

Benchmarking

If you want to do benchmarking with Comunica in Node.js, make sure to run Node.js in production mode as follows:

> NODE_ENV=production node packages/some-package/bin/some-bin.js

The reason for this is that Comunica extensively generates internal Error objects. In non-production mode, these also produce long stacktraces, which may in some cases impact performance.

Cite

If you are using or extending Comunica as part of a scientific publication, we would appreciate a citation of our article.

@inproceedings{taelman_iswc_resources_comunica_2018,
  author    = {Taelman, Ruben and Van Herwegen, Joachim and Vander Sande, Miel and Verborgh, Ruben},
  title     = {Comunica: a Modular SPARQL Query Engine for the Web},
  booktitle = {Proceedings of the 17th International Semantic Web Conference},
  year      = {2018},
  month     = oct,
  url       = {https://comunica.github.io/Article-ISWC2018-Resource/}
}

License

This code is copyrighted by the Comunica Association and Ghent University – imec and released under the MIT license.

comunica's People

Contributors

Stargazers

Watchers

Forkers

smalinin variousforks mrgra opensourcedemocracy bcommeine sandervanhove astverha allforabit doctorbud jrgriffiniii tpankowski vinnl michielbdejong hilalisadev kourosh-golpad stephaniech97 florianfv jaw111 jeswr treecg odnodn thomasdevriese kmiopenblockchain bnationsdev technologyarts gertjandemulder markwilkinson brechtvdv jitsedesmet mpvharmelen iosonopersia julianrojas87 jasmineleonard rubeneschauzier polymath-is mpparsley woutermont danielbeeke maximvdw tpt jaxoncreed falx axenderst heartpunk andrewzhurov jonasbovyn constraintautomaton redpencilio marcelomachado laurin-w laurensrietveld zg009 jiaoxlong silenroc1 karelklima langsamu maartyman peeja 00mjk armvndj albaike jonasverschuerenhogent guitton-frantz aronbuzogany simonvbrae smessie spaziocodice

comunica's Issues

Offline unit testing

Make it so that all unit tests still work when not connected to the internet.

All HTTP-related unit tests currently perform actual HTTP requests, so we have to make sure to mock this.

Add identity-based DISTINCT actor

The current DISTINCT implementation works based on hashes. While the chance of clashes is quite small, we should add an identity-based implementation as well.

This could work by for example just stringifying each object, instead of hashing it.

Feature-rdf-metadata

A way to extract metadata from an already-fetched and parsed RDF document, with a Hydra actor. (rdf-metadata)

After that, also a way to fetch paged Hydra RDF documents in a lazy manner, together with metadata. (hydra-paged)

Internally use asynciterator<RDF.Quad> to represent the quad stream, and metadata as a property.

This depends on #6.

Deduplicate webpacked dependencies

Webpack support was added in #64.

Webpack will by default deduplicate dependencies, but because using a (Lerna) monorepo, Webpack can sometimes still duplicate dependencies (webpack/webpack#5593).

There is a plugin that is supposed to fix this (https://github.com/RoboBurned/dedup-resolve-webpack-plugin), but it can cause issues in dependency resolution when packaged in a monorepo.
(Note: line 44 in the plugin should be replaced with if (fs.realpathSync(request.path) !== fs.realpathSync(cacheEntry.path)) {, and the failing modules should be blacklisted via the plugin.

In practise, when creating a web bundle in a regular installation (non-monorepo), this issue shouldn't occur, so let's test this when we get there.

Rename 'entrypoint' in context

Currently, 'entrypoint' is used in the context to indicate a TPF entrypoint.
We should make this more specific, and rename this to 'tpf' or something similar,
to avoid confusion with other source types.

Scripts not working (on Windows of course)

Since all build scripts now changed from tsc to ../../node_modules/.bin/tsc Windows is having issues. Windows can interpret paths with slashes, but not if it's part of the path pointing to the command you're trying to execute: '..' is not recognized as an internal or external command, etc.. I'll have a look if I can find a workaround.

actor-http-native not fully tested

It looks like the tests in actor-http-native do not reach a full coverage.

(the block on line 34 in ActorQueryOperationLeftJoinNestedLoop should be commented/changed a bit so that coverage for that also becomes 100%)

SPARQL endpoint actor

Add an actor that can resolve full SPARQL queries against a SPARQL endpoint.

Metadata not always resolved

This is related to one of my comments in #30, but since I see the "problem" also occurs in another init actor I made a separate issue.

I'm also not sure if this is expected behaviour or not.

When running ActorInitRdfDereferencePaged with the pattern ?movie dbpedia-owl:starring dbpedia:Brad_Pitt. I get the following output:

Metadata: {
  "isFulfilled": false,
  "isRejected": false
}
{"subject":{"value":"http://dbpedia.org/resource/12_Monkeys"},"predicate":{"value":"http://dbpedia.org/ontology/starring"},"object":{"value":"http://dbpedia.org/resource/Brad_Pitt"},"graph":{"value":""}}
...

Meaning the metadata wasn't resolved (yet) but was printed. This is easily solved by adding an await on the metadata output line:

readable.push('Metadata: ' + JSON.stringify(await result.firstPageMetadata, null, '  ') + '\n');

Now the question is, is this the expected behaviour that the metadata can still be a promise at this point or is this a bug? (If it's expected behaviour the paged dereference init actor will need that change).

Stats writer

Implement it like in the current LDF client.

Browser support

It should be possible to use comunica using browserify and/or webpack.

Implement promise cancellations

As we are use cancellable Bluebird promises, we should start adding support for the cancellation behaviour. This will be important on things such as HTTP requests.

For instance, when performing a query with a certain limit, execution should be able to stop immediately after reaching this limit.

Allow HTTP timeout configuration

We should allow users to pass an HTTP timeout value via the context (httpTimeout).

This could be implemented using our own setTimeout and the fetch AbortController: node-fetch/node-fetch#95

We should keep in mind here that we should clear our own timeout once the request completes (response object is available).

Additionally, we need an extra context option (boolean: httpTimeoutOnBody) to make it so that the timeout not only applies to the time until response starts coming in, but also to the time until the response body is fully available. The latter could take longer, or potentially be infinite for e.g. continuous data streams. This should also take into account that response bodies can be cancelled from within Comunica.

Bounty

A bounty has been placed on this issue by:



€1088

Click here to learn more if you're interested in claiming this bounty by resolving this issue.

Support SPARQL operations

The following operations should be supported (assuming the SPARQL algebra types):

Dependent on Expression implementation:

Filter
Orderby
Expression

Dependent on Expression implementation, but not supported in the current client:

Aggregate
Extend
Group

Not supported in current LDF client (so should not supported right away):

Reduced
Values
Minus

Path-related operators. Not supported in current LDF client (so should not supported right away):

Abstract configurations

In the future, we could provide component sets which provide a certain specific functionality. These sets could provide importable config files (using owl:imports) to simplify config files in cases where knowledge of the deeper component levels is not needed.

Add convenience implementations for IHeader and IBody

The HTTP bus provides the IHeaders and IBody interfaces which are based on the node-fetch types.

A default implementation should be provided for implementing HTTP actors that don't necessarily use the fetch API internally.

Ignored tracked file

https://github.com/rubensworks/comunica/blob/master/packages/actor-rdf-resolve-quad-pattern-hdt/test/__mocks__/hdt.js

This file should be ignored according to .npmignore . So either .npmignore should change or we should have a typescript version of this file.

Feature/rdf-dereference

A way to dereference a URI to a quad stream.

This depends on #2 and #4.

bus-rdf-dereference: Bus and abstract actor for dereferencing a URI to an RDF/JS stream. Actor in: URI, Actor out: RDF.Stream
actor-rdf-dereference-http-parse: Uses bus-rdf to get an overview of all available media types, uses bus-http to fetch the contents of the URI with an accept header based on these media types, and bus-rdf to parse these contents.

In the future, we should provide a way to give dynamic priorities to media types. Either statically at config-level.

Memento support

Just realized this, but we don't have Memento time conneg support planned yet.

Ideally, this should also be implemented before we release.

@mielvds Are you up for this?

Query local files

There should be a way to query files on the local filesystem.

One way of doing this would be to add an HTTP actor that proxies the file system. One problem with this would be that conneg would only be best-effort.

Note: We need to ensure somehow that remote resources can not somehow trigger local files to be queried for security. This could be done by adding an additional flag to the context when initializing so that it is stated that local files must be queried.

Different output serializers

Just like the old LDF client, we should support different result writers.

Currently, this is hardcoded to be print JSON bindings to the console.
This should be bus-ified, so that different writers can be easily added.

Also extend the HTTP interface to support this when done.

Federation support

We'll have to support multiple entrypoints.

This could be done by creating a generic sources entry in the context, which can contain multiple sources ('sub-contexts') of different types (key: entrypoint, file, hdtFile, value: any). A certain actor could delegate these sources to a mediator.

Add debug mode

Add profiler (at bus-level?) so that the execution time of each run/test per actor can be seen.
Allow query plan to be dumped. (Probably only after execution, as the left-deep-smallest actor results in dynamic plans)
Make debug mode not disable stack traces.

Add support for prefixes

Allow prefixes to be defined externally from the query AND allow RDF serializations to use prefixes somehow.

Add query operation actor generator

We already have a code generator for actors. Make sure to add one for query operation actors as well, which all share certain properties.

Make webpack building faster

https://slack.engineering/keep-webpack-fast-a-field-guide-for-better-build-performance-f56a5995e8f1

Cache wired engine in query API

Currently, when calling query via the JS API, the engine will be rewired every time.

Change this so that first, the developer has to instantiate an engine, and only then can the query method be called.

Add query API

Currently, the init actor can only run from the command line and print to the console.

We should add an init actor (and a runner-?) that allows query to be evaluated and results to be returned via JavaScript.

Filter

Support filter expressions by simply copy-pasting the impl of the current client.

Add SPARQL optimize bus

This bus should allow queries (in SPARQL algebra) to be rewritten by actors on its bus.

Actors could use this bus to optimize certain query types, or to modify certain operations so that certain specific actors can evaluate them.

Add SPARQL protocol interface

We should add a HTTP-based actor that accepts SPARQL queries.
This should do something similar to what ldf-client-http does.

Unable to build from clean install

Been fighting with this while trying to fix #1 (pretty sure this one is not Windows related!).

When running lerna bootstrap, the following steps get executed in order:

lerna info lifecycle preinstall
lerna info Symlinking packages and binaries
lerna info lifecycle postinstall
lerna info lifecycle prepublish
lerna info lifecycle prepare

The packages have been configured to build typescript to javascript in the prepare step.
In the Symlinking step, lerna tries to link all packages and binaries, which are defined in the bin field of package.json.
runner-cli has the following in its package.json:

"bin": {
    "comunica-run": "./bin/run.js"
  }

This file does not exist yet at this point since the build has not happened yet, only run.ts exists, causing the lerna bootstrap process to fail at this point.

Moving the build process to preinstall is not a solution since there would be missing dependencies due to the symlink of packages not having happened yet.

Only solution I see is to write the binaries in javascript instead of typescript?

Add better support for RDFJS sources

Add a basic actor in which an RDFJS source can be plugged in via the constructor.

ActorQueryOperationQuadpattern not returning all results

When changing config-example-quadpattern.json to use entrypoint http://fragments.dbpedia.org/2016-04/en and the pattern to ?movie dbpedia-owl:direct ?director. I no longer get all results, only the first few pages. The actual number of results differs every time I run this making me think this is a timing problem.

Make SPARQL init actor more user-friendly

Add things like support for URI prefixes, defining entrypoints as CLI argument, query files, ...

A default config file should be available in a separate package that contains all required actors for resolving SPARQL queries. This package should then become the main entrypoint of the query engine.

Make BGP resolving make use of join actors

Incorporate the join actors into BGP resolving as introduced by #30.
This could be done by making a new BGP actors that simply delegates joining.

This is probably only needed after version 1.0.0

Iterators returned by the paged dereference actor do not fire an 'end' event

I did not yet have time to test this one fully, but my small tests seem to indicate that these streams do not fire an 'end' event when they're finished. But the code does terminate, so something stops at least. The lack of 'end' evet is a problem though when combining iterators.

Write documentation

We need extensive documentation.

Documentation (with jsdoc?)
Examples

Feature-rdf-dereference-paged

A way to dereference a URI and get a stream of quads that lazily follows pages.

Make public

A couple of things that need to be done before we make the project public.

Indicate lack of fetch support in browser

Some old browser will not support the fetch API (which we require).
In these browser, the cryptic error "Expected a ReadableStream" will be shown.

We should indicate a more user-friendly error for browsers that do not support the fetch API: https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream#Browser_compatibility

Test helpers

Add test helpers in the root package, for things that are commonly used in tests, such as creating streams from triples and strings, converting from stream to array, or checking if all given element exist in a stream.

Dynamic actor loading

Investigate if actors could dynamically be loaded. (using Components.js?)

For example, certain RDF parsers should only be loaded if they are actually needed. The JSON-LD parser for example takes a long time to load, so this should be avoided until a server only supports JSON-LD.

https://webpack.js.org/guides/code-splitting/

Abstract required constructor args

Make something so that this commonly occurring pattern can be abstracted: https://github.com/rubensworks/comunica/blob/master/packages/actor-init-query-operation/lib/ActorInitQueryOperation.ts#L18-L20

Possibly do this by adding something to Components.js, such as marking a parameter as required.

Reorder subpatterns in BGP

The ReorderingGraphPatternIterator in the current LDF client seems to reorder triple patterns based on the number of free variables. Investigate what this does exactly, and where we can plug it in. (Either in a/the BGP actor, or in the new SPARQL optimize bus, #46)

Make client compilation more convenient

Make it possible for the componentsjs compilation to be done on the engine more easily, so that it can for example also be used for the command line script.

Federated query failure

The following query crashes:

PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX schema: <http://schema.org/>
PREFIX dc: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?person ?name ?book ?title {
  ?person dbpedia-owl:birthPlace [ rdfs:label "San Francisco"@en ].
  ?viafID schema:sameAs ?person;
               schema:name ?name.
  ?book dc:contributor [ foaf:name ?name ];
              dc:title ?title.
}

Sources: http://fragments.dbpedia.org/2016-04/en http://data.linkeddatafragments.org/viaf http://data.linkeddatafragments.org/harvard

Execution only works when in correct root folder

When testing the master branch, I tried to run the command from the root folder (i.e. packages\actor-init-hello-world\node_modules\.bin\comunica-run packages\actor-init-hello-world\config\config-example.json Desmond Hume ) which resulted in an error, while I did get the correct output when executing from the actor-init-hello-world folder.

The error was

Error: Invalid components file "packages\actor-init-hello-world\config\packages\actor-init-hello-world\config\config-example.json":
Error: No valid parser was found, both N3 and JSON-LD failed:
...
    name: 'jsonld.InvalidUrl',
    message: 'Dereferencing a URL did not result in a valid JSON-LD object. Possible causes are an inaccessible URL perhaps due to a same-origin policy (ensure the server uses CORS if you are using client-side JavaScript), too many redirects, a non-JSON response, o
r more than one HTTP Link Header was provided for a remote context.',
    details:
     { code: 'loading remote context failed',
       url: 'https://linkedsoftwaredependencies.org/contexts/comunica-actor-init-hello-world.jsonld',
...

So I assume this has something to do again with components.js having to find those jsonld files and linking them to the URL. (And in this case not finding them due to the path being different).

Feature-bgp

An actor should be created that listens on bus-query-operator and resolves BGPs.

This could be done based on a 'join' bus, for joining bindings streams.

This depends on #8.

Feature-quad-pattern-query

A way to perform a QPF query against an entrypoint and get an s, p, o, g binding stream.
This stream must be an asynciterator of immutable binding objects.

Do this based on the RDFJS Source interface, so that any implementation can work with it.

bus-rdf-resolve-quad-pattern Returns quad stream based on a quad pattern with options. An actor based on a RDFJS.Source factory with query options.
All available context and metadata entries must be documented on the wiki.
bus-query-operation Based on SPARQL Algebra operator. Returns (immutable) bindings asynciterator. For now, just an actor that can handle 'quadpattern'. Streams also have 'metadata', for things such as order and estimated number of elements.

This depends on #7 and #16.