The Semantic Motif Initialiser
Summary
Introduction
The KnetMiner Gene Explorer is based on knowledge graphs, that is, datasets that are created and
managed by means of our KnetBuilder framework, also named Ondex. This framework has two major component types:
- a set core components that allow for dealing with knowledge graphs (details below). Namely,
ONDEXGraph
allows for keeping in RAM a graph of ONDEXConcept
(s) (ie, nodes), linked together by
binary Relation
(s) (ie, edges). Concepts and relations have a set of properties, such as their data
source or a list of key/value attributes.
- a plug-in system, which can be used to define and run data workflows to read various data sources (eg,
CSV tables, XML files, web APIs in JSON format) and translate these data into the above graph
components. The typical workflow starts with initialising an empty graph, then multiple data parser/loader
plug-ins populate the graph, finally the graph is saved from memory to an OXL file (ie, an XML
format based on our own schema, which reflects the graph components above). OXL files are then loaded
into the KnetMiner web app (using an OXL's parser), to serve the application functionality.
In addition to loading a (configured) OXL, the web app does much additional initialisation work, which, mostly consist of:
- A Lucene-based index is created on-disk to ease keyword-based searches over the OXL graph elements
(eg, searching nodes by name or identifier)
- A graph traversal step, which employs semantic motifs, ie, graph
patterns, to navigate graph paths from genes until relevant entities. Details can be found here.
As said above, currently, both the Lucene indexing and the traverser are invoked against a given dataset when the KnetMiner web application is started, ie, when its Docker container is started, which triggers the start of its Tomcat server, which starts the API/WS .war
application (see the KnetMiner wiki for
details).
After the traversal stage, The traverser output is saved on disk and the web application avoids to redo the
whole operation again at each restart, if these output files are found on a configured location. This is
possible because the result of the traversal operation is always the same for a given dataset/graph, and the
web application uses these data in read-only mode. Similarly, the Lucene indexing is skipped if the
corresponding Lucene directory is found under a configured path.
Despite the latter optimisations, the web application spends a lot of time into this initialisation stage, which
could be moved offline, so that we could be able to create traversal data once for all, during the creation of
the dataset by means of the KnetBuilder workflow system (aka, Ondex Mini).
So, the purpose of this document is moving the traversal-invoking code that currently is inside the
KnetMiner web service to KnetBuilder, by developing proper wrappers to invoke it from the KnetBuilder
framework (details below).
Some details about the traversing
This is to get a better understanding of the context for this hereby task, not strictly needed here.
Technically, we have a generic GraphTraverser interface and a (still) default implementation,
which is based on the state machine model. Recently, we have started migrating to a
Cypher-based implementation, which relies on both on in-memory data encoded via Ondex
components and on the same data stored in a Neo4j database. Which traverser flavour to use is
decided via KnetMiner configuration.
Requirements
We need a component based on the architecture of many Ondex plug-ins, that is:
- A core component, containing the functionality to invoke the traverser. This should contain the bare
minimum to execute this functionality, possibly, other components should stay elsewhere.
- An Ondex plug-in wrapper, which defines the available options for the traverser (arguments,
in the jargon of the Ondex plug-ins), and invokes the core component above. This will be used in
Ondex Mini workflow, presumably with a graph that was build by previous steps (ie, other plug-ins) in
a workflow.
- A command-line (CLI) interface, which is another wrapper to the core. This should load an OXL file
from CLI parameters and then pass it to the core traverser component above.
For the moment do not create a direct dependency on the Cypher traverser artifact, the component should allow for the choice of this specific traverser (or the default) by means of a string
representing its FQN. We'll look at how to organise this dependency later (eg, optional download in the
workflow binary).
A good (and recent) reference for the architecture outlined above is the graph descriptor component. In particular, note the details:
- the core component (tests and usage examples). The shape of the new traversal
component needs to be agreed, eg, SemanticMotifInitializer
plus a method like init()
(note I use American spelling when naming code units).
- the plug-in. The new plug-in should subclass
ONDEXExport
, since this is the one closest
to the meaning of the component at issue. Its implementation won't be much different than anyway the
descriptor example anyway.
- the CLI interface
- Note that recently we started using the picocli library, which is one of
the best around.
- Also, note that the CLI component is a separated Maven project, since this has to include
much stuff that isn't needed in the core package.
- Moreover, see the CLI POM and the Maven Assembly descriptor file for a
reference on how to organise the build of the final command line tool binary
(this is a a CLI package containing multiple tools/commands, not just the graph descriptor tool).
The two components spawns a .zip
that, among other files, contains the /lib
directory
with all the needed runtime .jar
s, and one or more .sh
wrappers to invoke the
corresponding CLI class (copy-paste from oxl-descriptor.sh
for the new .sh).
See here for info on how these Ondex clients are arranged.
The KnetMiner code
What the new component has to do can be seen (and widely copy-pasted) from the current KnetMiner
code. Namely:
- Current entry point it
OndexServiceProvider.initData( <path> )
- This loads all application an traverser options from the (XML) property file it gets passed. This path will
be a parameter for the new component (see below).
- After the options loading,
DataService.initGraph()
is invoked. This loads the OXL dataset in
memory, in an ONDEXGraph
field. In the new component, this graph will be a class field (see below).
- See the
loadGraph()
code to get an idea of how Parser.loadOXL()
is used for this.
UIUtils.removeOldGraphAttributes ( graph )
is a KnetMiner-specific operation, not relevant here.
- The init of
genomeGenesCount
is instead needed (see below), so the new component will
need to do this, reusing the current code.
- The,
SearchService.indexOndexGraph()
is invoked. This creates a Lucene index of many
parts of the OXL graph, which is then used by KnetMiner to perform fast keyword-based searches.
The new component needs to do the same. As you can see, this method gets info from the graph
field
and the data path defined in the options file.
- After the indexing, we have
semanticMotifDataService.initSemanticMotifData ()
.
This initialises the traverser, using options above, and then invokes it. After that, results are saved into
files, in the form of serialised Java objects. We need to replicate/move all the code you see in this
method.
- We also need to support the
doReset
flag. Best choice I see for this is to replicate the same public
initSemanticMotifData( doReset )
method in the main class SemanticMotifInitializer
for the new component. This because KnetMiner has a CypherDebugger component,
used for testing and debugging purposes, which needs to trigger the data reinitialisation
from this step only.
- Finally, we have
exportService.exportGraphStats()
, which saves an XML file of statistics,
obtained from both the OXL and the traversal results (which are used for visualisations
like this). This has to be replicated to the new component too.
The initialiser options
The new component needs the same options/parameters that KnetMiner uses in the code described in the previous section. So, our new SemanticMotifInitializer
will have this:
- The
ONDEXGraph
to work with. Simplest solution is to make the component stateful and
mantain a graph
field for this, together with other stuff (eg, genomeGenesCount
mentioned above).
- The path to an options file. This contains several details needed to run the traverser (and KnetMiner),
some are generic, some are traverser-specific. See an example in this template. In the core
component, this will be a parameter of the init()
method. In the plug-in, this will be a an argument of
type FileArgument
.
- In the core component, define the
init()
implementation as a private bare method, accepting the
parameter options
of type OptionsMap
. Then define the public wrapper using the option's file path.
- The input OXL to work with. In the core component, this is a parameter of type
OndexGraph
.
The plug-in has already the graph field for this, so no other addition is needed. In the case
of the CLI, this should be the -i
/--oxl
option. This should be an optional parameter, which, when
defined, should override the DataFile
option, found in the options file.
- The path of the output directory. This is where the traverser results have to be stored.
Similarly to the input path, this should override the DataPath
option in the options file.
In the plug-in, should be a FileArgument
(with isDirectory
== true) and in the CLI it should
correspond to the-o
/--data-dir
option.