The comunica's discuss from comunica

Abstract configurations

In the future, we could provide component sets which provide a certain specific functionality. These sets could provide importable config files (using owl:imports) to simplify config files in cases where knowledge of the deeper component levels is not needed.

Deduplicate webpacked dependencies

Webpack support was added in #64.

Webpack will by default deduplicate dependencies, but because using a (Lerna) monorepo, Webpack can sometimes still duplicate dependencies (webpack/webpack#5593).

There is a plugin that is supposed to fix this (https://github.com/RoboBurned/dedup-resolve-webpack-plugin), but it can cause issues in dependency resolution when packaged in a monorepo.
(Note: line 44 in the plugin should be replaced with if (fs.realpathSync(request.path) !== fs.realpathSync(cacheEntry.path)) {, and the failing modules should be blacklisted via the plugin.

In practise, when creating a web bundle in a regular installation (non-monorepo), this issue shouldn't occur, so let's test this when we get there.

Memento support

Just realized this, but we don't have Memento time conneg support planned yet.

Ideally, this should also be implemented before we release.

@mielvds Are you up for this?

Test helpers

Add test helpers in the root package, for things that are commonly used in tests, such as creating streams from triples and strings, converting from stream to array, or checking if all given element exist in a stream.

Abstract required constructor args

Make something so that this commonly occurring pattern can be abstracted: https://github.com/rubensworks/comunica/blob/master/packages/actor-init-query-operation/lib/ActorInitQueryOperation.ts#L18-L20

Possibly do this by adding something to Components.js, such as marking a parameter as required.

Iterators returned by the paged dereference actor do not fire an 'end' event

I did not yet have time to test this one fully, but my small tests seem to indicate that these streams do not fire an 'end' event when they're finished. But the code does terminate, so something stops at least. The lack of 'end' evet is a problem though when combining iterators.

Add query API

Currently, the init actor can only run from the command line and print to the console.

We should add an init actor (and a runner-?) that allows query to be evaluated and results to be returned via JavaScript.

Add better support for RDFJS sources

Add a basic actor in which an RDFJS source can be plugged in via the constructor.

ActorQueryOperationQuadpattern not returning all results

When changing config-example-quadpattern.json to use entrypoint http://fragments.dbpedia.org/2016-04/en and the pattern to ?movie dbpedia-owl:direct ?director. I no longer get all results, only the first few pages. The actual number of results differs every time I run this making me think this is a timing problem.

Feature-bgp

An actor should be created that listens on bus-query-operator and resolves BGPs.

This could be done based on a 'join' bus, for joining bindings streams.

This depends on #8.

Implement promise cancellations

As we are use cancellable Bluebird promises, we should start adding support for the cancellation behaviour. This will be important on things such as HTTP requests.

For instance, when performing a query with a certain limit, execution should be able to stop immediately after reaching this limit.

Make SPARQL init actor more user-friendly

Add things like support for URI prefixes, defining entrypoints as CLI argument, query files, ...

A default config file should be available in a separate package that contains all required actors for resolving SPARQL queries. This package should then become the main entrypoint of the query engine.

Ignored tracked file

https://github.com/rubensworks/comunica/blob/master/packages/actor-rdf-resolve-quad-pattern-hdt/test/__mocks__/hdt.js

This file should be ignored according to .npmignore . So either .npmignore should change or we should have a typescript version of this file.

Scripts not working (on Windows of course)

Since all build scripts now changed from tsc to ../../node_modules/.bin/tsc Windows is having issues. Windows can interpret paths with slashes, but not if it's part of the path pointing to the command you're trying to execute: '..' is not recognized as an internal or external command, etc.. I'll have a look if I can find a workaround.

Cache wired engine in query API

Currently, when calling query via the JS API, the engine will be rewired every time.

Change this so that first, the developer has to instantiate an engine, and only then can the query method be called.

Filter

Support filter expressions by simply copy-pasting the impl of the current client.

Metadata not always resolved

This is related to one of my comments in #30, but since I see the "problem" also occurs in another init actor I made a separate issue.

I'm also not sure if this is expected behaviour or not.

When running ActorInitRdfDereferencePaged with the pattern ?movie dbpedia-owl:starring dbpedia:Brad_Pitt. I get the following output:

Metadata: {
  "isFulfilled": false,
  "isRejected": false
}
{"subject":{"value":"http://dbpedia.org/resource/12_Monkeys"},"predicate":{"value":"http://dbpedia.org/ontology/starring"},"object":{"value":"http://dbpedia.org/resource/Brad_Pitt"},"graph":{"value":""}}
...

Meaning the metadata wasn't resolved (yet) but was printed. This is easily solved by adding an await on the metadata output line:

readable.push('Metadata: ' + JSON.stringify(await result.firstPageMetadata, null, '  ') + '\n');

Now the question is, is this the expected behaviour that the metadata can still be a promise at this point or is this a bug? (If it's expected behaviour the paged dereference init actor will need that change).

Feature-quad-pattern-query

A way to perform a QPF query against an entrypoint and get an s, p, o, g binding stream.
This stream must be an asynciterator of immutable binding objects.

Do this based on the RDFJS Source interface, so that any implementation can work with it.

bus-rdf-resolve-quad-pattern Returns quad stream based on a quad pattern with options. An actor based on a RDFJS.Source factory with query options.
All available context and metadata entries must be documented on the wiki.
bus-query-operation Based on SPARQL Algebra operator. Returns (immutable) bindings asynciterator. For now, just an actor that can handle 'quadpattern'. Streams also have 'metadata', for things such as order and estimated number of elements.

This depends on #7 and #16.

Make client compilation more convenient

Make it possible for the componentsjs compilation to be done on the engine more easily, so that it can for example also be used for the command line script.

Federation support

We'll have to support multiple entrypoints.

This could be done by creating a generic sources entry in the context, which can contain multiple sources ('sub-contexts') of different types (key: entrypoint, file, hdtFile, value: any). A certain actor could delegate these sources to a mediator.

Feature-rdf-dereference-paged

A way to dereference a URI and get a stream of quads that lazily follows pages.

Unable to build from clean install

Been fighting with this while trying to fix #1 (pretty sure this one is not Windows related!).

When running lerna bootstrap, the following steps get executed in order:

lerna info lifecycle preinstall
lerna info Symlinking packages and binaries
lerna info lifecycle postinstall
lerna info lifecycle prepublish
lerna info lifecycle prepare

The packages have been configured to build typescript to javascript in the prepare step.
In the Symlinking step, lerna tries to link all packages and binaries, which are defined in the bin field of package.json.
runner-cli has the following in its package.json:

"bin": {
    "comunica-run": "./bin/run.js"
  }

This file does not exist yet at this point since the build has not happened yet, only run.ts exists, causing the lerna bootstrap process to fail at this point.

Moving the build process to preinstall is not a solution since there would be missing dependencies due to the symlink of packages not having happened yet.

Only solution I see is to write the binaries in javascript instead of typescript?

Write documentation

We need extensive documentation.

Documentation (with jsdoc?)
Examples

Stats writer

Implement it like in the current LDF client.

Execution only works when in correct root folder

When testing the master branch, I tried to run the command from the root folder (i.e. packages\actor-init-hello-world\node_modules\.bin\comunica-run packages\actor-init-hello-world\config\config-example.json Desmond Hume ) which resulted in an error, while I did get the correct output when executing from the actor-init-hello-world folder.

The error was

Error: Invalid components file "packages\actor-init-hello-world\config\packages\actor-init-hello-world\config\config-example.json":
Error: No valid parser was found, both N3 and JSON-LD failed:
...
    name: 'jsonld.InvalidUrl',
    message: 'Dereferencing a URL did not result in a valid JSON-LD object. Possible causes are an inaccessible URL perhaps due to a same-origin policy (ensure the server uses CORS if you are using client-side JavaScript), too many redirects, a non-JSON response, o
r more than one HTTP Link Header was provided for a remote context.',
    details:
     { code: 'loading remote context failed',
       url: 'https://linkedsoftwaredependencies.org/contexts/comunica-actor-init-hello-world.jsonld',
...

So I assume this has something to do again with components.js having to find those jsonld files and linking them to the URL. (And in this case not finding them due to the path being different).

actor-http-native not fully tested

It looks like the tests in actor-http-native do not reach a full coverage.

(the block on line 34 in ActorQueryOperationLeftJoinNestedLoop should be commented/changed a bit so that coverage for that also becomes 100%)

Add SPARQL protocol interface

We should add a HTTP-based actor that accepts SPARQL queries.
This should do something similar to what ldf-client-http does.

Feature-rdf-metadata

A way to extract metadata from an already-fetched and parsed RDF document, with a Hydra actor. (rdf-metadata)

After that, also a way to fetch paged Hydra RDF documents in a lazy manner, together with metadata. (hydra-paged)

Internally use asynciterator<RDF.Quad> to represent the quad stream, and metadata as a property.

This depends on #6.

Browser support

It should be possible to use comunica using browserify and/or webpack.

Make webpack building faster

https://slack.engineering/keep-webpack-fast-a-field-guide-for-better-build-performance-f56a5995e8f1

Add SPARQL optimize bus

This bus should allow queries (in SPARQL algebra) to be rewritten by actors on its bus.

Actors could use this bus to optimize certain query types, or to modify certain operations so that certain specific actors can evaluate them.

Allow HTTP timeout configuration

We should allow users to pass an HTTP timeout value via the context (httpTimeout).

This could be implemented using our own setTimeout and the fetch AbortController: node-fetch/node-fetch#95

We should keep in mind here that we should clear our own timeout once the request completes (response object is available).

Additionally, we need an extra context option (boolean: httpTimeoutOnBody) to make it so that the timeout not only applies to the time until response starts coming in, but also to the time until the response body is fully available. The latter could take longer, or potentially be infinite for e.g. continuous data streams. This should also take into account that response bodies can be cancelled from within Comunica.

Bounty

A bounty has been placed on this issue by:



€1088

Click here to learn more if you're interested in claiming this bounty by resolving this issue.

Different output serializers

Just like the old LDF client, we should support different result writers.

Currently, this is hardcoded to be print JSON bindings to the console.
This should be bus-ified, so that different writers can be easily added.

Also extend the HTTP interface to support this when done.

Add identity-based DISTINCT actor

The current DISTINCT implementation works based on hashes. While the chance of clashes is quite small, we should add an identity-based implementation as well.

This could work by for example just stringifying each object, instead of hashing it.

Make public

A couple of things that need to be done before we make the project public.

Federated query failure

The following query crashes:

PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX schema: <http://schema.org/>
PREFIX dc: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?person ?name ?book ?title {
  ?person dbpedia-owl:birthPlace [ rdfs:label "San Francisco"@en ].
  ?viafID schema:sameAs ?person;
               schema:name ?name.
  ?book dc:contributor [ foaf:name ?name ];
              dc:title ?title.
}

Sources: http://fragments.dbpedia.org/2016-04/en http://data.linkeddatafragments.org/viaf http://data.linkeddatafragments.org/harvard

Rename 'entrypoint' in context

Currently, 'entrypoint' is used in the context to indicate a TPF entrypoint.
We should make this more specific, and rename this to 'tpf' or something similar,
to avoid confusion with other source types.

Offline unit testing

Make it so that all unit tests still work when not connected to the internet.

All HTTP-related unit tests currently perform actual HTTP requests, so we have to make sure to mock this.

Reorder subpatterns in BGP

The ReorderingGraphPatternIterator in the current LDF client seems to reorder triple patterns based on the number of free variables. Investigate what this does exactly, and where we can plug it in. (Either in a/the BGP actor, or in the new SPARQL optimize bus, #46)

Query local files

There should be a way to query files on the local filesystem.

One way of doing this would be to add an HTTP actor that proxies the file system. One problem with this would be that conneg would only be best-effort.

Note: We need to ensure somehow that remote resources can not somehow trigger local files to be queried for security. This could be done by adding an additional flag to the context when initializing so that it is stated that local files must be queried.

Support SPARQL operations

The following operations should be supported (assuming the SPARQL algebra types):

Dependent on Expression implementation:

Filter
Orderby
Expression

Dependent on Expression implementation, but not supported in the current client:

Aggregate
Extend
Group

Not supported in current LDF client (so should not supported right away):

Reduced
Values
Minus

Path-related operators. Not supported in current LDF client (so should not supported right away):

Add convenience implementations for IHeader and IBody

The HTTP bus provides the IHeaders and IBody interfaces which are based on the node-fetch types.

A default implementation should be provided for implementing HTTP actors that don't necessarily use the fetch API internally.

Dynamic actor loading

Investigate if actors could dynamically be loaded. (using Components.js?)

For example, certain RDF parsers should only be loaded if they are actually needed. The JSON-LD parser for example takes a long time to load, so this should be avoided until a server only supports JSON-LD.

https://webpack.js.org/guides/code-splitting/

Feature/rdf-dereference

A way to dereference a URI to a quad stream.

This depends on #2 and #4.

bus-rdf-dereference: Bus and abstract actor for dereferencing a URI to an RDF/JS stream. Actor in: URI, Actor out: RDF.Stream
actor-rdf-dereference-http-parse: Uses bus-rdf to get an overview of all available media types, uses bus-http to fetch the contents of the URI with an accept header based on these media types, and bus-rdf to parse these contents.

In the future, we should provide a way to give dynamic priorities to media types. Either statically at config-level.

Make BGP resolving make use of join actors

Incorporate the join actors into BGP resolving as introduced by #30.
This could be done by making a new BGP actors that simply delegates joining.

This is probably only needed after version 1.0.0

Add query operation actor generator

We already have a code generator for actors. Make sure to add one for query operation actors as well, which all share certain properties.

Add debug mode

Add profiler (at bus-level?) so that the execution time of each run/test per actor can be seen.
Allow query plan to be dumped. (Probably only after execution, as the left-deep-smallest actor results in dynamic plans)
Make debug mode not disable stack traces.

comunica / comunica Goto Github PK

comunica's Issues

Bounty

Recommend Projects

Recommend Topics

Recommend Org