Giter Site home page Giter Site logo

propi / rdfrules Goto Github PK

View Code? Open in Web Editor NEW
28.0 7.0 2.0 277.64 MB

RDFRules: Analytical Tool for Rule Mining from RDF Knowledge Graphs

License: GNU General Public License v3.0

Scala 98.73% HTML 0.04% JavaScript 0.06% Less 1.17%
rules rule-engine rules-engine rule-mining data-science data-analysis data-mining dataset semantic-web scala

rdfrules's Introduction

RDFRules

RDFRules is a powerful analytical tool for rule mining from RDF knowledge graphs. It offers a complex rule mining solution including RDF data pre-processing, rules post-processing and prediction abilities from rules. The core of RDFRules is written in the Scala language. Besides the Scala API, RDFRules also provides REST web service with graphical user interface via a web browser. RDFRules uses the AMIE algorithm with several extensions as a basis for a complete rule mining solution.

LIVE DEMO: https://br-dev.lmcloud.vse.cz/rdfrules/

Getting started

Requirements: Java 11+

RDFRules is divided into five main modules. They are:

  • Scala API: It is sutable for Scala programmers and for use RDFRules as a framework to invoke mining processes from Scala code.
  • Web Service or Batch Processing: It is suitable for modular web-based applications and remote access via HTTP. Individual tasks can be also started in batch processing mode without any user interactions.
  • GUI: It is suitable for anyone who wants to use the tool quickly and easily without any needs for further programming.
  • Experiments: This module contains some examples using Scala API. There is also a script for the complex benchmark comparing RDFRules with the original AMIE implementation using several threads and mining modes. Some results of performed experiments are placed in the results folder.

Detailed information about these modules with deployment instructions are described in their subfolders.

Quick and easy run of RDFRules

  1. Download the latest release in the .zip format (currently v1.7.2) and unpack it into a folder.
  2. Go to the unpacked RDFRules home folder (with /bin, /webapp and /lib folders) and run RDFRules HTTP API (compiled under Java 11)
    • On Linux: sh bin/main
    • On Windows: .\bin\main.bat
  3. Open GUI via http://localhost:8851/ or ./webapp/index.html in a modern Internet browser.

Batch processing

If you need to run an RDFRules task as a scheduled job then define a json task (read more about task definitions in the http submodule). We recommend to use the GUI to construct a task. Save your task to a file and run the process by following command:

  • On Linux: sh bin/main task.json
  • On Windows: .\bin\main.bat task.json

A result in the json format is printed into stdout with other logs. If you need to export the json result into a separated file without logs, define a second argument as a file target path.

  • On Linux: sh bin/main task.json result.json
  • On Windows: .\bin\main.bat task.json result.json

Compile executable files from source code

RDFRules is written in Scala 2.13, therefore you need to first install SBT - interactive build tool for Scala. Then clone this repository and go to the root ./rdfrules folder. If you want to pack HTTP module to executable files, run the following sbt command in the root folder:

sbt> project http
sbt> pack

If you want to compile the GUI source files, input following commands:

sbt> project gui
sbt> fullOptJS

The compiled javascript file needed for GUI is placed in gui/target/scala-2.13/gui-opt/main.js. You need to edit gui/webapp/index.html where the right path to the compiled main javascript file and the HTTP API URL should be set. Then you can use gui/webapp/index.html in your favorite internet browser to open the RDFRules GUI.

Design and Architecture

RDFRules main processes

The architecture of the RDFRules core is composed of five main data abstractions: RDFGraph, RDFDataset, Index, Ruleset and Prediction. These objects are gradually created during processing of RDF data and rule mining. Each object consists of several operations which either transform the current object or perform some action to create an output. Hence, these operations are classied as transformations or actions.

RDFRules main processes

Transformations

Any transformation is a lazy operation that converts the current data object to another. For example a transformation in the RDFDataset object creates either a new RDFDataset or an Index object.

Actions

An action operation applies all pre-defined transformations on the current and previous objects, and processes (transformed) input data to create a desired output such as rules, histograms, triples, statistics etc. Compared to transformations, actions may load data into memory and perform time-consuming operations.

Caching

If we use several action operations, e.g. with various input parameters, over the same data and a set of transformations, then all the defined transformations are performed repeatedly for each action. This is caused by lazy behavior of main data objects and the streaming process lacking memory of previous steps. These redundant and repeating calculations can be eliminated by caching of performed transformations. Each data object has the cache method that can perform all defined transformations immediately and store the result either into memory or on a disk.

Main Abstractions

RDFGraph

The RDFGraph object is a container for RDF triples and is built once we load an RDF graph. It can either be a file or a stream of triples or quads in a standard RDF format such as N-Triples, N-Quads, JSON-LD, TriG or TriX. If the input format contains a set of quads (with information about named graphs) all triples are merged to one graph. Alternatively, we can create directly the RDFDataset object (see below) from quads and to preserve the distribution of triples in the individual graphs. This object has defined following main operations:

Transformations

Operation Description
map quads Return a new RDFGraph object with updated triples.
filter Return a new RDFGraph object with filtered triples.
shrink Return a new shrinked RDFGraph object.
split Split the loaded KG into several parts with sampling.
discretize Return a new RDFGraph object with discretized numeric literals by a predefined task and filter.
merge Merge all loaded graphs into one RDFDataset.

Actions

Operation Description
get Get and show all triples.
histogram Return histogram by chosen aggregated triple items.
properties Return informations and stats about all properties.
export Export this graph into a file in some familiar RDF format.

RDFDataset

The RDFDataset object is a container for RDF quads and is created from one or many RDFGraph instances. This data object has the same operations as the RDFGraph. The only difference is that operations do not work with triples but with quads.

Transformations (different from RDFGraph)

Operation Description
index Create an fact Index object from this RDFDataset object.

Index

The Index object can be created from the RDFDataset object or loaded from a cache. It contains prepared and indexed data in memory and has operations for rule mining with the RDFRules algorithm.

Transformations

Operation Description
mine Execute a rule mining task with thresholds, constraints and patterns, and return a Ruleset object.

Actions

Operation Description
properties cardinality Get cardinalities from selected properties (such as size, domain, range).
export Serialize and export loaded index into a file for a later use.

Ruleset

The Ruleset object is on the output of the RDFRules workflow. It contains all discovered rules conforming to all input restrictions. This final object has multiple operations for rule analysis, counting additional measures of significance, rule filtering and sorting, rule clustering, prediction from rules, and finally rule exporting for use in other systems.

Transformations

Operation Description
filter Return a new Ruleset object with filtered rules by measures of significance or rule patterns.
shrink Return a new shrinked Ruleset object.
sort Return a new Ruleset object with sorted rules by selected measures of significance.
compute confidence Return a new Ruleset object with the computed confidence measure (CWA or PCA) for each rule by a selected threshold.
make clusters Return a new Ruleset object with clusters computed by a clustering task.
prune Return a new Ruleset object reduced by a selected pruning strategy.
predict Use all rules in the Ruleset for new triples prediction. This operation returns a Prediction object.

Actions

Operation Description
get and show Get and show all mined rules.
export Export this Ruleset object into a file in some selected output format.

Prediction

The Prediction object is a container of all predicted triples by a ruleset. It differs positive, negative or PCA positive types of prediction. Each predicted triple has information about all rules which predict that triple.

Transformations

Operation Description
filter Return a new Prediction object with filtered predicted triples by measures of significance, rule patterns, triple filters and other options.
shrink Return a new shrinked Prediction object.
sort Return a new Prediction object with sorted predicted triples by their rules and their measures of significance.
group Aggregate and score triples predicted by many rules.
to prediction tasks Generate prediction tasks by a user-defined strategy.
to dataset Transform all predicted triples into the RDFGraph object

Actions

Operation Description
get and show Get and show predicted triples with bound rules.
export Export this Prediction object into a file in some selected output format.

PredictionTasks

The PredictionTasks object is a special container of all predicted triples divided into generated prediction tasks. Each prediction task (e.g. <Alice> <wasBornIn> ?) has a list of sorted candidates by their score. This structure allows to select candidates by a chosen selection strategy, and construct a dataset from predicted candidates.

Transformations

Operation Description
filter Return a new PredictionTasks object with filtered prediction tasks.
shrink Return a new shrinked PredictionTasks object.
select candidates Select candidates from each prediction task by a selection strategy.
to prediction Convert this object back to the Prediction object.
to dataset Transform all predicted triples into the RDFGraph object

Actions

Operation Description
get and show Get and show prediction tasks with candidates.
evaluate Evaluate all prediction tasks. It returns ranking metrics (such as hits@k, mean reciprocal rank), and completeness/quality metrics with confusion matrix (such as precision, recall).

Pre-processing

You can use RDFGraph and RDFDataset abstractions to analyze and pre-process input RDF data before the mining phase. First you load RDF datasets into the RDFRules system, and then you can aggregate data, count occurences or read types of individual triple items. Based on the previous analysis you can define some transformations including triples/quads merging, filtering or replacing. Transformed data can either be saved on a disk into a file in some RDF format or the binary format for later use, or used for indexing and rule mining. Therefore, RDFRules is also suitable for RDF data transformations and is not intended only for rules mining.

RDFRules uses the EasyMiner-Discretization module which provides some implemented unsupervised discretization algorithms, such as equal-frequency and equal-width. These algorithms can be easily used within the RDFRules tool where they are adapted to work with RDF triple items.

Indexing

Before mining the input dataset has to be indexed into memory for the fast rules enumeration and measures counting. The RDFRules (enhanced AMIE+) algorithm uses fact indices that hold data in several hash tables. Hence, it is important to realize that the complete input data are replicated several times and then stored into memory before the mining phase.

Data are actually stored in memory once the mining process is started. The system automatically resolves all triples with the owl:sameAs predicate and replaces all objects by their subjects in these kinds of triples. Thanks to this functionality we can mine across several graphs and link statements by the owl:sameAs predicate.

Rule Mining

RDFRules uses the AMIE+ algorithm as the background for rule mining. It mines logical rules in the form of Horn clause with one atom at the head position and with conjunction of atoms at the body position. An atom is a statement (or triple) which is further indivisible and contains a fixed constant at the predicate position and variables or constants at the subject and/or object position, e.g., the atom livesIn(a, b) contains variables a and b, whereas the atom livesIn(a, Prague) contains only one variable a and the fixed constant Prague at the object position.

Horn rules samples:
1: livesIn(a, b) => wasBornIn(a, b)
2: livesIn(a, Prague) => wasBornIn(a, Prague)
3: isMarriedTo(a, b) ^ directed(b, c) => actedIn(a, c)
4: hasChild(a,c) ^ hasChild(b,c) => isMarriedTo(a,b)

The output rule has to fulfill several conditions. First, rule atoms must be connected and mutually reachable. That means variables are shared among atoms to form a continuous path. Second, there are allowed only closed rules. A rule is closed if its atoms involves any variable and each variable appears at least twice in the rule. Finally, atoms must not be reflexive - one atom does not contain two same variables.

There are six parameters that are passing to the rule mining process. They are: indexed data, thresholds, rule patterns, constraints, and consumers. The relevance of rules is determined by their measures of significance. In RDFRules we use all measures defined in AMIE+ and some new measures such as lift or QCA confidence.

Measures of Significance

Measure Description
HeadSize The head size of a rule is a measure which indicates a number of triples (or instances) for a head property.
Support Number of correctly predicted triples.
HeadCoverage This is the relative value of the support measure depending on the head size. HC = Support / HeadSize
BodySize Number of all predicted triples.
Confidence The standard confidence is a measure comparing the body size to the support value and is interpreted as a probability of the head occurrence given the specific body.
PcaBodySize Number of all predicted triples conforming PCA.
PcaConfidence This kind of confidence measure is more appropriate for OWA, since a predicted missing fact may not assume to be a negative example.
QpcaBodySize Number of all predicted triples conforming QPCA.
QpcaConfidence This kind of confidence measure improves the PCA confidence. It can reduce the number of generated negative examples by a computed property cardinality.

| | Lift | The ratio between the standard confidence and the probability of the most frequent item of the given head. With this measure we are able to discover a dependency between the head and the body of the rule. | | Cluster | We can make rule clusters by their similarities. This measure only specifies a number of cluster to which the rule belongs. |

Measures of significance in example

Thresholds

There are several main pruning thresholds which influence the speed of the rules enumeration process:

Threshold Description
MinHeadSize A minimum number of triples matching the rule head.
MinAtomSize A minimum number of triples matching each atom in the rules.
MinHeadCoverage A minimal head coverage.
MaxRuleLength A maximal length of a rule.
TopK A maximum number of returned rules sorted by head coverage.
GlobalTimeout A maximum mining time in minutes.
LocalTimeout A maximum rule refinement time in milliseconds with sampling.

Rule Patterns

All mined rules must match at least one pattern defined in the rule patterns list. If we have an idea of what atoms mined rules should contain, we can define one or several rule patterns. A rule pattern is either exact or partial. The number of atoms in any mined rule must be less than or equal to the length of the exact rule pattern. For a partial mode, if some rule matches the whole pattern then all its extensions also match the pattern.

AIP: Atom Item Pattern
 ?                       // Any item
 ?V                      // Any variable
 ?C,                     // Any constant
 ?a,?b,?c...             // A concrete variable
 <livesIn>               // A concrete constant
 [<livesIn>, <diedIn>]   // An item must match at least one of the pre-defined constants
 ![<livesIn>, <diedIn>]  // An item must not match all of the pre-defined constants

AP: Atom Pattern
 (AIP AIP AIP AIP?)  // A triple with three atom item patterns with optional graph atom item pattern as the fourth item

RP: Rule Pattern
 * ^ AP ^ AP => AP  // Partial rule pattern
     AP ^ AP => AP  // Exact rule pattern

Constraints

Here is a list of implemented constraints that can be used:

Constraint Description
OnlyPredicates(x) Rules must contain only predicates defined in the set x.
WithoutPredicates(x) Rules must not contain predicates defined in the set x.
WithConstants(position) Mining with constants at a specific atom position. Supported positions are: both, subject, object, lower cardinality side.
WithoutDuplicitPredicates Disable to mine rules which contain some predicate in more than one atom.

Post-processing

During the mining process the RDFRules calculates only basic measures of significance: head size, support and head coverage. If you want to compute other measures (such as confidences and lift) you can do it explicitly in the post-processing phase. The RDFRules tool also supports rules clustering by the DBScan algorithm. It uses pre-defined similarity functions comparing rule contents and computed measures of significance. A large rule set can be reduced by pruning strategies (such as data coverege pruning, skyline pruning, quasi-binding pruning)

All mined rules can also be filtered or sorted by used-defined functions and finally exported either into a human-readable text format or into a machine-readable JSON format.

Example of the TEXT output format:

(?a <participatedIn> <Turbot_War>) ^ (?a <imports> ?b) -> (?a <exports> ?b) | support: 14, headCoverage: 0.037, confidence: 0.636, pcaConfidence: 0.636, lift: 100.41, headConfidence: 0.0063, headSize: 371, bodySize: 22, pcaBodySize: 22, cluster: 7

Example of the JSON output format:

[{
  "head": {
    "subject": {
      "type": "variable",
      "value": "?a"
    },
    "predicate": "<exports>",
    "object": {
      "type": "variable",
      "value": "?b"
    }
  },
  "body": [{
    "subject": {
      "type": "variable",
      "value": "?a"
    },
    "predicate": "<participatedIn>",
    "object": {
      "type": "constant",
      "value": "<Turbot_War>"
    }
  }, {
    "subject": {
      "type": "variable",
      "value": "?a"
    },
    "predicate": "<imports>",
    "object": {
      "type": "variable",
      "value": "?b"
    }
  }],
  "measures": [{
    "name": "headSize",
    "value": 371
  }, {
    "name": "confidence",
    "value": 0.6363636363636364
  }, {
    "name": "support",
    "value": 14
  }, {
    "name": "bodySize",
    "value": 22
  }, {
    "name": "headConfidence",
    "value": 0.006337397533925742
  }, {
    "name": "pcaConfidence",
    "value": 0.6363636363636364
  }, {
    "name": "lift",
    "value": 100.41403162055336
  }, {
    "name": "pcaBodySize",
    "value": 22
  }, {
    "name": "headCoverage",
    "value": 0.03773584905660377
  }, {
    "name": "cluster",
    "value": 7
  }]
}]

In RDFRules we can also attach information about graph at every atom and then filter rules based on named graphs. This ability is useful to discover new knowledge based on linking multiple graphs.

(?a <hasChild> ?c <yago>) ^ (?c <dbo:parent> ?b <dbpedia>) -> (?a <isMarriedTo> ?b <yago>)

Licence

RDFRules is licensed under GNU General Public License v3.0

Publications

Acknowledgments

Thanks to these organizations for supporting us:

VŠE

CIMPLE (TAČR TH74010002)

rdfrules's People

Contributors

propi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

kizi sandy4321

rdfrules's Issues

Empty cache file can cause pipeline to fail without log notices

This behaviour can be reproduced

  • load attached task.json.
  • create empty file "rulesPCA" in the workspace
  • run the pipeline

What will happen:
Indexing will not start. The log messages shown are:

2020-09-11 14:55:25:461 +0200 [rdfrules-http-akka.actor.default-dispatcher-6] INFO com.github.propi.rdfrules.http.InMemoryCache - Some value with key '025495e4-8e84-4f30-bfec-325a18dd3499x' was pushed into the memory cache. Number of items in the cache is: 1
2020-09-11 14:56:16:359 +0200 [Thread-1] INFO task-7216ee1d-9a9b-4286-bad1-0425e3c6b6fc - Predicates trimming.
2020-09-11 14:56:16:363 +0200 [rdfrules-http-akka.actor.default-dispatcher-9] INFO akka.actor.LocalActorRef - Message [com.github.propi.rdfrules.http.service.Task$TaskRequest$AddMsg] to Actor[akka://rdfrules-http/user/task-service/task-7216ee1d-9a9b-4286-bad1-0425e3c6b6fc#-1209497236] was not delivered. [1] dead letters encountered. If this is not an expected behavior then Actor[akka://rdfrules-http/user/task-service/task-7216ee1d-9a9b-4286-bad1-0425e3c6b6fc#-1209497236] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

A workaround is either to delete the empty cache file "rulesPCA" or to set the last Cache node in the pipeline to "revalidate" the cache.

task.json.zip

Schema support

It should be possible to attach a schema at a dataset. Then we can do some extended operations:

  • generate triples with types from ontology (domain, range)
  • working with sub/super types; add a new triples with super types

When RDFRules runs out of memory, worker threads are not terminated

When RDFRules runs out of memory (GC overhead limited exceeded), worker threads are not terminated and the load of all CPU cores remains at 100%.

020-09-16 11:34:59:780 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16664 (0.06 per sec) -- processed rules, found closed rules: 25535936, queue size: 25577603, stage: 2, activeThreads: 6 Exception in thread "Thread-40" java.lang.OutOfMemoryError: GC overhead limit exceeded at com.github.propi.rdfrules.algorithm.amie.RuleFilter$RulePatternFilter$$Lambda$1452/1315182476.get$Lambda(Unknown Source) at java.lang.invoke.LambdaForm$DMH/1023714065.invokeStatic_LL_L(LambdaForm$DMH) at java.lang.invoke.LambdaForm$MH/1802598046.linkToTargetMethod(LambdaForm$MH) at com.github.propi.rdfrules.algorithm.amie.RuleFilter$RulePatternFilter.matchAtom(RuleFilter.scala:83) at com.github.propi.rdfrules.algorithm.amie.RuleFilter$RulePatternFilter.apply(RuleFilter.scala:95) at com.github.propi.rdfrules.algorithm.amie.RuleFilter$And.apply(RuleFilter.scala:42) at com.github.propi.rdfrules.algorithm.amie.RuleRefinement.$anonfun$refine$13(RuleRefinement.scala:203) at com.github.propi.rdfrules.algorithm.amie.RuleRefinement$$Lambda$1592/1828223227.apply(Unknown Source) at scala.collection.Iterator$$anon$10.next(Iterator.scala:448) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:501) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:447) at scala.collection.Iterator.foreach(Iterator.scala:929) at scala.collection.Iterator.foreach$(Iterator.scala:929) at scala.collection.AbstractIterator.foreach(Iterator.scala:1417) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1.$anonfun$run$11(Amie.scala:209) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1.$anonfun$run$11$adapted(Amie.scala:202) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1$$Lambda$1422/153380730.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:929) at scala.collection.Iterator.foreach$(Iterator.scala:929) at scala.collection.AbstractIterator.foreach(Iterator.scala:1417) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1.run(Amie.scala:202) at java.lang.Thread.run(Thread.java:748) 2020-09-16 11:35:34:200 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16665 (0.04 per sec) -- processed rules, found closed rules: 25538702, queue size: 25580374, stage: 2, activeThreads: 6 2020-09-16 11:36:24:157 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16666 (0.06 per sec) -- processed rules, found closed rules: 25544096, queue size: 25585771, stage: 2, activeThreads: 6 Exception in thread "Thread-44" java.lang.OutOfMemoryError: GC overhead limit exceeded 2020-09-16 11:37:33:113 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16667 (0.06 per sec) -- processed rules, found closed rules: 25546573, queue size: 25588249, stage: 2, activeThreads: 6 Exception in thread "Thread-43" java.lang.OutOfMemoryError: GC overhead limit exceeded 2020-09-16 11:38:00:114 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16668 (0.06 per sec) -- processed rules, found closed rules: 25547331, queue size: 25589007, stage: 2, activeThreads: 6 Exception in thread "Thread-41" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-45" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-33" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-53" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-68" java.lang.OutOfMemoryError: GC overhead limit exceeded Uncaught error from thread [rdfrules-http-scheduler-1]: GC overhead limit ex

Set limitations for workspace (during upload)

Set immutable and mutable folders + set temporary restrictions for uploaded file (e.g. max one week)... Set memory limitation for app, and restart http if overflowed memory. Show current state of the memory in GUI...

Deserialization exception - Invalid type of measure

If json-serialized rules contain confidence or some other measure, they cannot be deserialized via Load ruleset due to Deserialization exception - Invalid type of measure.

rules.json
[ { "body": [ { "object": { "type": "variable", "value": "?a" }, "predicate": "<interacts_with>", "subject": { "type": "variable", "value": "?b" } } ], "head": { "object": { "type": "variable", "value": "?b" }, "predicate": "<interacts_with>", "subject": { "type": "variable", "value": "?a" } }, "measures": [ { "name": "BodySize", "value": 11702212 }, { "name": "HeadCoverage", "value": 0.9917442958647477 }, { "name": "Support", "value": 11605602 }, { "name": "HeadSize", "value": 11702212 }, { "name": "Confidence", "value": 0.9917442958647477 } ] } ]

add json task command line support

It would be convenient to be able to run RDFRules, e.g., as

java -jar RDFRulesLauncher.jar "task.json"
Where task.json would be generated in the GUI, or modified based on a task.json generated in the GUI.
This could supersede the Java API.

In GUI and REST add revalidate checkbox

By default the revalidate checkbox will be unchecked and once the cache is used within the workflow the next usage will be loaded from the cache and all prepended operations will be omitted. If the revalidate is checked it performs all previous operations and creates the cache again.

Add graph-based atoms/rules and constraints

p(a, b, Dbpedia) ->p(a, b, Yago)
p(a, b, [Dbpedia, Wikidata]) ->p(a, b, Yago)

  • add constraint which enables this behaviour. Default dont use graph-based rules.
  • rule pattern for graphs is working only if the graph-based mode is turned on.
  • print rule, a parameter for showing graphs in rules

State default values of parameters in GUI Mine node

Some thresholds in the Mine node are effective also when not present - defaults apply.
This, e.g., affects "Min head size", which has a default of 100. The default values in effect should be communicated to the user.

sync with new SBT version

It seems that SBT is not compatible with jdk 13 and 14.
sbt/sbt#5509 ("We don't test sbt on JDK 14, so that could also be the problem. Please run it on JDK 8 or 11.")
If this is true, the documentation should warn about this.
For me, it works with JDK 11.

Also, the run-main command on RDFRules homepage does not seem to work with current version of SBT - clulab/eidos#440.
It seems it was replaceed by runMain.

Memory estimation by dataset size and create limits

Estimate memory needed for storing dataset into index. Set limits, e.g., max 1GB = N quads...
There can be some upper limit in combination with System.gc. Once we are closer to the limit then we stop loading the index.

Other restrictions in setting:

  • max quads
  • max triple item size
  • max memory during indexing (or at all)
  • max mined rules

Slow actor debugger

LinkedBlockingQueue is the bottleneck. Try to implement "non-blocking" debugger. One message with counter instead of queue. Thread can sleep just 5 sec and then read the current message.

Not to involve pruned head triples in the refining phase

One some head triples are not mapped to body (they are pruned), we need not involve them in the next refine phase.

If the A_r set is empty, the current binding of the head s,p,o can be omitted within any other refinements of subsequent rules having the basis of the current rule.

Add indicative progress indicator

It would help if there was support for approximate progress indicator for the mine task (number of rules processed + possibly estimate based on the time required to process rules so far)

Better mining debugging

Separate debugging to stages and offer progress bar based on the queue size for each stage.

Constraints enhancements

By default, minining without contraints should mean mining with constants at the subject and object positions. Constraints should be:

  • constants at the subject position
  • constants at the object position
  • constants at the functional item (C hasCitizen ?a), or (?a isCitizenOf C) - we instantiate object for the functions and subject for inversed-functions because these items should have greater support.
  • without constants

Mine will return maximum 10.000 rules

It seems that the output of Mine is capped to 10.000 rules (if Top-K is not used). If Top-k is used, any value set to a higher value seems to be automatically redefined to 10.000.

Instantiation does not work properly

Sometimes in results there are different predicates. Wrong task:

[
  {
    "name": "LoadGraph",
    "parameters": {
      "path": "/dbpedia_yago/mappingbased_objects_sample.ttl",
      "graphName": "<dbpedia>"
    }
  },
  {
    "name": "LoadGraph",
    "parameters": {
      "path": "/dbpedia_yago/yagoFacts.tsv",
      "graphName": "<yago>"
    }
  },
  {
    "name": "LoadGraph",
    "parameters": {
      "path": "/dbpedia_yago/yagoDBpediaInstances.tsv",
      "graphName": "<dbpedia>"
    }
  },
  {
    "name": "MergeDatasets",
    "parameters": {}
  },
  {
    "name": "AddPrefixes",
    "parameters": {
      "prefixes": [
        {
          "prefix": "dbo",
          "nameSpace": "http://dbpedia.org/ontology/"
        },
        {
          "prefix": "dbr",
          "nameSpace": "http://dbpedia.org/resource/"
        }
      ]
    }
  },
  {
    "name": "Index",
    "parameters": {
      "prefixedUris": true
    }
  },
  {
    "name": "Mine",
    "parameters": {
      "thresholds": [
        {
          "name": "TopK",
          "value": 1000
        },
        {
          "name": "MinHeadCoverage",
          "value": 0.01
        }
      ],
      "patterns": [],
      "constraints": [
        {
          "name": "WithoutConstants"
        }
      ]
    }
  },
  {
    "name": "CacheRuleset",
    "parameters": {
      "inMemory": true,
      "path": "e4790ffb-d535-4e14-9478-867d3f4abe2a",
      "revalidate": false
    }
  },
  {
    "name": "ComputePcaConfidence",
    "parameters": {
      "min": 0.5,
      "topk": 50
    }
  },
  {
    "name": "Sorted",
    "parameters": {}
  },
  {
    "name": "GraphBasedRules",
    "parameters": {}
  },
  {
    "name": "Instantiate",
    "parameters": {
      "rule": {
        "body": [
          {
            "graphs": [
              "<dbpedia>"
            ],
            "object": {
              "type": "variable",
              "value": "?c"
            },
            "predicate": {
              "localName": "album",
              "nameSpace": "http://dbpedia.org/ontology/",
              "prefix": "dbo"
            },
            "subject": {
              "type": "variable",
              "value": "?a"
            }
          },
          {
            "graphs": [
              "<yago>"
            ],
            "object": {
              "type": "variable",
              "value": "?c"
            },
            "predicate": "<created>",
            "subject": {
              "type": "variable",
              "value": "?b"
            }
          }
        ],
        "head": {
          "graphs": [
            "<dbpedia>"
          ],
          "object": {
            "type": "variable",
            "value": "?b"
          },
          "predicate": {
            "localName": "musicalBand",
            "nameSpace": "http://dbpedia.org/ontology/",
            "prefix": "dbo"
          },
          "subject": {
            "type": "variable",
            "value": "?a"
          }
        },
        "measures": [
          {
            "name": "HeadCoverage",
            "value": 0.4664823773324119
          },
          {
            "name": "HeadSize",
            "value": 2894
          },
          {
            "name": "PcaBodySize",
            "value": 1368
          },
          {
            "name": "Support",
            "value": 1350
          },
          {
            "name": "PcaConfidence",
            "value": 0.9868421052631579
          }
        ]
      },
      "part": "Whole"
    }
  },
  {
    "name": "GetRules",
    "parameters": {}
  }
]

Comparison of the current method with a new proposal

  • we need not specify dangling variable
  • first count support for all dangling atoms without instances
  • if support is greter then threshold count instances for danglings
  • instead of specifyAtom create function specifyVariable

Thresholds deleting

Remove thresholds which are not using during mining (confidence, pcaconfidence, etc.).

Better logging

Do not show same message more times. Loading dataset logging is wrong, because it does not take into account more graphs and merging to one dataset. Resolve how to disable loading dataset logging if it is indexing since there are very annoying messages which show same for dataset and index loading:

2020-09-08T16:11:28.606Z : Action Dataset indexing, steps: 0 -- started
2020-09-08T16:11:28.612Z : Action Dataset loading, steps: 0 -- started
2020-09-08T16:11:28.612Z : Action Dataset indexing, steps: 0 -- started
2020-09-08T16:11:30.862Z : Action Dataset loading, steps: 14465 -- ended
2020-09-08T16:11:30.862Z : Action Dataset indexing, steps: 14465
2020-09-08T16:11:30.862Z : Action Dataset loading, steps: 0 -- started
2020-09-08T16:11:30.862Z : Action Dataset indexing, steps: 14465
2020-09-08T16:11:32.434Z : Action Dataset loading, steps: 18845 -- ended
2020-09-08T16:11:32.434Z : Action Dataset indexing, steps: 33310
2020-09-08T16:11:32.435Z : Action Dataset loading, steps: 0 -- started
2020-09-08T16:11:32.435Z : Action Dataset indexing, steps: 33310
2020-09-08T16:11:37.436Z : Action Dataset loading, steps: 20205
2020-09-08T16:11:37.436Z : Action Dataset indexing, steps: 53516
2020-09-08T16:11:42.437Z : Action Dataset loading, steps: 52228
2020-09-08T16:11:42.437Z : Action Dataset indexing, steps: 85539
2020-09-08T16:11:47.438Z : Action Dataset loading, steps: 81463
2020-09-08T16:11:47.438Z : Action Dataset indexing, steps: 114774
2020-09-08T16:11:52.440Z : Action Dataset loading, steps: 112571
2020-09-08T16:11:52.443Z : Action Dataset indexing, steps: 145882
2020-09-08T16:11:53.711Z : Action Dataset loading, steps: 121437 -- ended
2020-09-08T16:11:53.712Z : Action Dataset indexing, steps: 154747
2020-09-08T16:11:53.765Z : Action Dataset indexing, steps: 154747 -- ended
2020-09-08T16:11:53.766Z : Action SameAs resolving, steps: 0 -- started
2020-09-08T16:11:54.195Z : Predicates trimming.
2020-09-08T16:11:54.195Z : Action SameAs resolving, steps: 0 -- ended
2020-09-08T16:11:54.318Z : Action Subjects indexing, steps: 0 -- started
2020-09-08T16:11:54.878Z : Subjects trimming.
2020-09-08T16:11:54.878Z : Action Subjects indexing, steps: 140281 -- ended
2020-09-08T16:11:54.948Z : Action Objects indexing, steps: 0 -- started
2020-09-08T16:11:55.341Z : Objects trimming.
2020-09-08T16:11:55.341Z : Action Objects indexing, steps: 140281 -- ended
2020-09-08T16:11:55.407Z : Action Amie rules mining, steps: 0 -- started
2020-09-08T16:11:55.407Z : Action Browsed projections large buckets, steps: 0 -- started
2020-09-08T16:11:55.407Z : Action Amie rules mining, steps: 0 -- started
2020-09-08T16:12:00.423Z : Action Browsed projections large buckets, steps: 0 -- started
2020-09-08T16:12:00.425Z : Action Amie rules mining, steps: 3500 -- processed rules, found closed rules: 1086, queue size: 7609
2020-09-08T16:12:05.436Z : Action Browsed projections large buckets, steps: 0 -- started
2020-09-08T16:12:05.436Z : Action Amie rules mining, steps: 9053 -- processed rules, found closed rules: 2032, queue size: 1631
2020-09-08T16:12:08.207Z : Action Browsed projections large buckets, steps: 0 -- ended
2020-09-08T16:12:08.207Z : Action Amie rules mining, steps: 10206 -- processed rules, found closed rules: 2242, queue size: 0
2020-09-08T16:12:08.208Z : Action Amie rules mining, steps: 10206 -- ended
2020-09-08T16:12:08.261Z : Action PCA Confidence computing, steps: 0 of 1000, progress: 0.0% -- started
2020-09-08T16:12:09.343Z : Action PCA Confidence computing, steps: 1000 of 1000, progress: 100.0% -- ended

In GUI cache into memory

We need to resolve the lifetime (or idle time) of an index in the memory. Or set a limit in setting: max idle time for index.

  • save an index
  • remove the index

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.