skodapetr / discovery Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 337 KB

Java 100.00%

discovery's Introduction

See my personal GitHub Pages

❤️ Repositories

📘 Linked Data Repositories

📘 Bioinformatics & Cheminformatics Repositories

discovery's People

Contributors

Watchers

discovery's Issues

Ordering of experiment-level output CSVs

Output CSVs should be ordered.
discovery.csv and application-discovery.csv should be ordered according to the ordering of discoveries in the input experiment rdf:List.

NPE when running discovery

klimek@KLIMEK-MFF-NTB:/mnt/c/Users/Kuba/Documents/GitHub/discovery/deploy$ java -jar discovery.jar --Filter no-filter -e https://discovery.linkedpipes.com/resource/experiment/label/config -o labels-no-filter
07:54:23 [main] INFO  c.l.d.cli.RunExperiment - Collected 5 discoveries in experiment https://discovery.linkedpipes.com/resource/experiment/label/config
07:54:23 [main] INFO  c.l.d.c.f.RemoteDefinition - Collecting templates for: https://discovery.linkedpipes.com/resource/discovery/label-00/config
07:54:23 [main] INFO  c.l.d.c.f.RemoteDefinition - Loading templates ...
07:54:55 [main] INFO  c.l.d.c.f.RemoteDefinition - Loaded applications: 1 transformers: 0 datasets: 114
Exception in thread "main" java.lang.NullPointerException
        at com.linkedpipes.discovery.cli.factory.DiscoveriesFromUrl.createDiscoveryBuilder(DiscoveriesFromUrl.java:95)
        at com.linkedpipes.discovery.cli.factory.DiscoveriesFromUrl.create(DiscoveriesFromUrl.java:59)
        at com.linkedpipes.discovery.cli.RunDiscovery.runDiscoveriesFromUrl(RunDiscovery.java:72)
        at com.linkedpipes.discovery.cli.RunDiscovery.run(RunDiscovery.java:58)
        at com.linkedpipes.discovery.cli.RunExperiment.run(RunExperiment.java:57)
        at com.linkedpipes.discovery.cli.AppEntry.run(AppEntry.java:31)
        at com.linkedpipes.discovery.cli.AppEntry.main(AppEntry.java:21)

Discovery not generating pipelines.json correctly

With: java -jar discovery.jar -o out -d https://dis covery.linkedpipes.com/resource/discovery/dbpedia-test-01/config, in pipelines.json, a pipeline is seemingly discovered. However, it is missing a transformer, which had to be used, because otherwise the pipeline would not work:

{
  "pipelines" : [ {
    "components" : [ {
      "node" : "node_00001",
      "iri" : "https://ldcp.opendata.cz/resource/dbpedia/datasource-templates/Category-Charter_77_signatories",
      "label" : "Data source"
    }, {
      "node" : "node_00001",
      "iri" : "https://discovery.linkedpipes.com/resource/application/map/template",
      "label" : "Map Application"
    } ]
  } ]
}

Add another "transformers used" column

Now, the "transformers used" column was transofrmers applicable to datasets, even from pipelines not ending with application. We can keep that, but we need to add another number: transformers used in pipelines, which end with an application.

Add "datasets used" to experiment-level CSVs

Right now, it always shows the number of datasets found in discovery definition.
Used means there is a pipeline ending with application containing this dataset.

Update propsal

Now each data sample can contain multiple resources. Alternative is to allow single resource per data sample, as a result components would need to have multiple inputs - each per one resource / class type.

This should make it easier to scale as the data samples are smaller and are less likely to change. It may also help to deal with integration of different resources.

But, we need to specify how exactly split the data samples and how should output data samples be produced.

Extend export

Export data sample for each node as an extra file in the output directory.
The vertices file should contain beside the transformer name also it's IRI, for the first node (data source) IRI of the data source should be used.

The pipeline file then should be a JSON array, where each object refers to a pipeline and pipeline consists of a series of nodes. For each node, we store IRI of transformer and link to the vertices file (node id).

Specify defaults for command-line arguments

Now it is not clear what is default for some of the settings.

Add option to ignore missing templates and report them instead

It would be useful to have a switch enabling the user to skip over missing components (and report them). The tool would report the number of missing templates and produce an extra file with their list.

Exception in thread "main" java.io.FileNotFoundException: https://discovery.linkedpipes.com/resource/lod/templates/http---202.45.139.84-10035-catalogs-fao-repositories-agrovoc
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1915)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1515)
        at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:250)
        at java.base/java.net.URL.openStream(URL.java:1139)
        at com.linkedpipes.discovery.rdf.RdfAdapter.fromHttp(RdfAdapter.java:61)
        at com.linkedpipes.discovery.rdf.RdfAdapter.asStatements(RdfAdapter.java:52)
        at com.linkedpipes.discovery.cli.factory.FromExperiment.loadTemplates(FromExperiment.java:70)
        at com.linkedpipes.discovery.cli.factory.FromExperiment.create(FromExperiment.java:49)
        at com.linkedpipes.discovery.cli.AppEntry.runExperiment(AppEntry.java:141)
        at com.linkedpipes.discovery.cli.AppEntry.run(AppEntry.java:74)
        at com.linkedpipes.discovery.cli.AppEntry.main(AppEntry.java:40)

Enable specifying of discovery settings in input data

All settings should be part of the input discovery (experiment) data, so that I can run the process just with the URL of the experiment/discovery.

Add support for inclusion of components into discovery definition

Right now, many discovery definitions run on the same set of components. However, these lists of compnents are copied into each discovery definition. When this list needs to be updated (like now), we would have to update all the discovery definitions. see example and another example.

This is highly impractical. Therefore, an "inclusion" feature is requested.

Given a discovery input such as this one:

<https://discovery.linkedpipes.com/resource/discovery/label-02/config> a <https://discovery.linkedpipes.com/vocabulary/discovery/Input> ;
    <https://discovery.linkedpipes.com/vocabulary/discovery/hasTemplate> 
      <https://discovery.linkedpipes.com/resource/application/dcterms/template>,

a list of components could be imported into the definition from another location, like this:

<https://discovery.linkedpipes.com/resource/discovery/label-02/config>
  <https://discovery.linkedpipes.com/vocabulary/discovery/import>
    <https://discovery.linkedpipes.com/resource/lod/list>

where

<https://discovery.linkedpipes.com/resource/lod/list> a <https://discovery.linkedpipes.com/vocabulary/discovery/Input>;
  <https://discovery.linkedpipes.com/vocabulary/discovery/hasTemplate> <https://discovery.linkedpipes.com/resource/application/dcterms/template>,
    <https://discovery.linkedpipes.com/resource/application/personal-profiles/template>,
...

This way, the list of imported components can be regenerated by a LP-ETL pipeline without breaking everything.

Discovery not displaying node IDs correctly

With: java -jar discovery.jar -o out -d https://dis covery.linkedpipes.com/resource/discovery/dbpedia-test-01/config, in pipelines.json I get duplicate node IDs:

{
  "pipelines" : [ {
    "components" : [ {
      "node" : "node_00001",
      "iri" : "https://ldcp.opendata.cz/resource/dbpedia/datasource-templates/Category-Charter_77_signatories",
      "label" : "Data source"
    }, {
      "node" : "node_00001",
      "iri" : "https://discovery.linkedpipes.com/resource/application/map/template",
      "label" : "Map Application"
    } ]
  } ]
}

Import definitions not working

java -jar discovery.jar -o label-01 -d https://discovery.linkedpipes.com/resource/discovery/label-01/config now gives me StackOveflow