scigraph / golr-loader Goto Github PK

1.0 2.0 3.0 302 KB

Convert SciGraph queries into json that can be loaded by Golr

License: Apache License 2.0

Java 100.00%

java golr-loader scigraph solr

golr-loader's Issues

Use de-reification in evidence graphs

The evidence graph currently has association objects as nodes, following the RDF reification pattern. This inflates the graph. It would be better to de-reify these by making the association node properties properties of the relevant edge

For each:

?annId oban:association_has_subject ?s
?annId oban:association_has_object ?o
OPTIONAL ?annId oban:association_has_predicate ?p
?annId dc:source ?src
?annId RO_0002558 ?evtype
?annId ?p1 ?v1
...
?annId ?pn ?vn

(the predicate edge is often missing. Why?)

Find this:

{sub: ?s,
 obj: ?o
 pred: ?p}

and replace with:

{sub: ?s,
 obj: ?o
 pred: ?p,
 meta: {
  id: ?annId,
  xrefs: [?src],
 ...
 }
}

Note it may be better to do this further upstream in SciGraph itself, in the BBOPGraph mapping algorithm.

Example here:

[
{
"obj": "GO:0030500PHENOTYPE",
"pred": "http://purl.obolibrary.org/obo/RO_0002200",
"sub": ":_ZDB-GENE-050417-357-ZDB-MRPHLNO-111209-6"
},
{
"obj": "ZFIN:ZDB-GENE-050417-357",
"pred": "http://purl.obolibrary.org/obo/GENO_0000443",
"sub": ":_ZDB-GENE-050417-357-ZDB-MRPHLNO-111209-7"
},
{
"obj": ":_ZDB-GENE-050417-357-ZDB-MRPHLNO-111209-6",
"pred": "http://purl.org/oban/association_has_subject",
"sub": "MONARCH:7c16f8c23d210cb4f7f6e2929333789b"
},
{
"obj": "GO:0030500PHENOTYPE",
"pred": "http://purl.org/oban/association_has_object",
"sub": "MONARCH:7c16f8c23d210cb4f7f6e2929333789b"
},
{
"obj": "PMID:22087291",
"pred": "http://purl.org/dc/elements/1.1/source",
"sub": "MONARCH:dfd7c1916c111384ce0649139c5d4eb6"
},
{
"obj": "GO:0030500PHENOTYPE",
"pred": "http://purl.obolibrary.org/obo/RO_0002200",
"sub": ":_ZDB-GENE-050417-357-ZDB-MRPHLNO-111209-7"
},
{
"obj": "GO:0030500PHENOTYPE",
"pred": "http://purl.org/oban/association_has_object",
"sub": "MONARCH:dfd7c1916c111384ce0649139c5d4eb6"
},
{
"obj": ":_ZDB-GENE-050417-357-ZDB-MRPHLNO-111209-7",
"pred": "http://purl.org/oban/association_has_subject",
"sub": "MONARCH:dfd7c1916c111384ce0649139c5d4eb6"
},
{
"obj": "PMID:22087291",
"pred": "http://purl.org/dc/elements/1.1/source",
"sub": "MONARCH:dfd7c1916c111384ce0649139c5d4eb6"
},
{
"obj": "PMID:22087291",
"pred": "http://purl.org/dc/elements/1.1/source",
"sub": "MONARCH:7c16f8c23d210cb4f7f6e2929333789b"
},
{
"obj": "ECO:0000059",
"pred": "http://purl.obolibrary.org/obo/RO_0002558",
"sub": "MONARCH:dfd7c1916c111384ce0649139c5d4eb6"
},
{
"obj": "PMID:22087291",
"pred": "http://purl.org/dc/elements/1.1/source",
"sub": "MONARCH:7c16f8c23d210cb4f7f6e2929333789b"
},
{
"obj": "ECO:0000059",
"pred": "http://purl.obolibrary.org/obo/RO_0002558",
"sub": "MONARCH:7c16f8c23d210cb4f7f6e2929333789b"
},
{
"obj": "ZFIN:ZDB-GENE-050417-357",
"pred": "http://purl.obolibrary.org/obo/GENO_0000443",
"sub": ":_ZDB-GENE-050417-357-ZDB-MRPHLNO-111209-6"
}
]```

Add edge label to evidence graph

It would be useful to have edge labels available when creating visuals of an evidence graph.

Getting labels for edges is currently a two step process:
Get the iri for an edge
Find the node with the same iri property and get its label

Two ideas:

Add label property to edges for whole graph
Add labels only to evidence graph (as json) in golr json

Add taxon aspect to graph

Add taxon to both subject and object with the following pattern:
(subject)-[:subClassOf|type|part_of*0..]->(thing)-[:RO_0002162]->(taxon)
with the taxon being used as a closure

Add list of orthologs to each gene-phenotype and gene-disease document

When on a gene page, we want to see any phenotype/disease associated with any of its orthologs. At one point we were waiting on upgrading solr to join the orthology/gene-phenotype queries.

However, I think we can get this information if we store a list of orthologs for a gene for each record.

For example, for SHH, we would add:
orthologs: ['ZFIN:ZDB-GENE-980526-41', 'ZFIN:ZDB-GENE-980526-166', 'ZFIN:ZDB-GENE-980526-41', 'NCBIGene:716553','NCBIGene:100016531', 'NCBIGene:100557233', 'NCBIGene:42737','NCBIGene:608860', 'NCBIGene:100512749', 'NCBIGene:29499', 'MGI:98297', 'ZFIN:ZDB-GENE-980526-166']

The pattern is:
(gene)-[:RO:HOM0000017|RO:HOM0000020]-(ortholog)

Remove IRI field from search core

We want to remove IRI and equivalent IRIs from our search index, https://github.com/SciGraph/golr-loader/blob/master/src/main/java/org/monarch/golr/SimpleLoader.java#L99

Refactor loading into "aspects"

Move domain specific logic into more granular classes and add this to the configuration specification.

Rewrite golr loader to convert results to SolrInputDocument instead of JSON

Currently we serialize scigraph results as a list of JSON documents, and then post the entire JSON file to solr. This has worked well in the past, but seems to cause issues as these JSON documents have gotten larger.

As an alternative, we can use the SolrJ API to construct SolrInputDocument objects and post these the server in batches of 100k or so. It's unclear if this will result in any performance boost, as SolrJ appears to be sending them to the server using http regardless. At a minimum this should fix #27 and possibly #30.

As a test I've reworked the golr worker to convert the JSON documents to SolrInputDocuments - which is a pretty minimal change but is slightly less performant. After chatting with @kltm and @cmungall I'm planning on moving ahead with the larger refactor of removing the JSON intermediate step.

@benwbooth I'm wondering if it will conflict with your work on #17 ?

Indexing behavior when a sub-object pair is linked by multiple relations

Consider the following pattern:

(subject:gene)<-[has_locus]-(variant)-[relation]->(object:disease)

Where relation is one of:

pathogenic
likely pathogenic
has phenotype
marker/mechanism
contributes to
...

In many cases, multiple variants of a single gene are linked to a disease via multiple relations (commonly pathogenic and likely pathogenic). Currently, the solr loader seems to pick a relation at random (although this may not be the case and it may in fact be deterministic for a given db).

This is also an issue with combining orthology statements from multiple sources (panther and zfin) where panther specifies whether two orthologs have a 1 to 1 relationship whereas zfin does not.

One option is to store the set of relations linking two nodes. Another option would be to configure a relation priority, where the relation with the highest priority is designated while the others are retrievable via the evidence graph.

@mbrush @selewis @cmungall thoughts?

Solr indexing sometimes fails

Not sure why this is happening, especially with a local solr instance. It doesn't seem to index the file at all.

First thing to do is to fail the process, for now this is a silent exception. Then since the file is not indexed at all, a retry can be attempted, or at least keep the json file for a manual retry.

INFO: Posting JSON genotype-phenotype.yaml.json to http://localhost:8983/solr/golr
Feb 27, 2017 6:07:32 AM org.monarch.golr.GolrWorker call
SEVERE: Failed to post JSON genotype-phenotype.yaml.json
java.net.SocketException: Broken pipe (Write failed)
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
	at org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:126)
	at org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:138)
	at org.apache.http.impl.io.SessionOutputBufferImpl.write(SessionOutputBufferImpl.java:169)
	at org.apache.http.impl.io.ContentLengthOutputStream.write(ContentLengthOutputStream.java:115)
	at org.apache.http.client.fluent.InternalFileEntity.writeTo(InternalFileEntity.java:75)
	at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:158)
	at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:162)
	at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:237)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:122)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
	at org.apache.http.client.fluent.Request.internalExecute(Request.java:173)
	at org.apache.http.client.fluent.Executor.execute(Executor.java:262)
	at org.monarch.golr.GolrWorker.call(GolrWorker.java:79)
	at org.monarch.golr.GolrWorker.call(GolrWorker.java:25)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Queries should accept CURIEs in START statements

At the moment, queries like:
START chromosomeClass = node:node_auto_index(iri='SO:0000340')
won't be resolved to IRIs.

Ideally it should behave as the /cypher/execute and /cypher/resolve services of SciGraph.

Add closure_maps to load

In addition to label and ID closure there should be JSON closure maps that resemble:

regulates_closure_map: "{"GO:0051716":"cellular response to stimulus","GO:0007264":"small GTPase mediated signal transduction","GO:0009987":"cellular process","GO:0023052":"signaling","GO:0044699":"single-organism process","GO:0044763":"single-organism cellular process","GO:0065007":"biological regulation","GO:0005623":"cell","GO:0007265":"Ras protein signal transduction","GO:0005575":"cellular_component","GO:0005622":"intracellular","GO:0035556":"intracellular signal transduction","GO:0008150":"biological_process","GO:0007165":"signal transduction","GO:0007154":"cell communication","GO:0050794":"regulation of cellular process","GO:0044464":"cell part","GO:0050789":"regulation of biological process","GO:0044700":"single organism signaling","GO:0032482":"Rab protein signal transduction","GO:0050896":"response to stimulus"}",

Refactor evidence processing

What is currently evidence should become an evidence_object closure.
The new evidence should be just the closure of all evidence in the graph.
Add a source closure with all the sources from associations.

taxon missing for instances

When something is declared as an instance of a taxon, the taxon_id/label/closure is not getting populated.

For example, from the omia data, their "breeds" are declared as instances of a taxon:

:_omiabreedkey100 a OBO:NCBITaxon_9913 ;
    rdfs:label "Red Spotted Czech (cattle)" .

But these aren't showing up in the various taxon closures. I believe this is because :type is not included in the path when trying to find the taxon. This needs fixing.

ArrayIndexOutOfBoundsException when splitting curies

java.lang.ArrayIndexOutOfBoundsException: 1

    at org.monarch.golr.SimpleLoader.generate (SimpleLoader.java:98)

    at org.monarch.golr.SimpleLoaderMain.main (SimpleLoaderMain.java:91)

    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:254)

    at java.lang.Thread.run (Thread.java:748)

https://github.com/SciGraph/golr-loader/blob/master/src/main/java/org/monarch/golr/SimpleLoader.java#L98

looks like we get a curie without a colon (?)

Too many values for UnInvertedField faceting on field object_closure

Solr-dev is currently returning an exception for any biolink call that facets over object_closures:
java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field object_closure

We've appeared to have hit some limit, but it's not obvious from looking at the data:
Solr Production
Total Docs: 37502996
Unique values in object_closure: 4005165

Solr Dev:
Total Docs: 38759008
Unique values in object_closure: 4350271

It's also possible the limit is based on the number of values per a single document, but this is harder to gather without iterating over each document.

Possible solutions:

Use facet.method=enum, but this is about 10x slower than method=fc
Use docValues - https://lucene.apache.org/solr/guide/6_6/docvalues.html, requires small update to golr schema, https://github.com/berkeleybop/golr-schema
Update solr to 7+
Reindex and hope for the best, possible that small data changes could affect these counts

cc @kltm @DoctorBud @deepakunni3

chromose queries throw NPE

This is due to the fix for the evidence graphs, it is assuming that a query will always have a subject and an object.

NotInTransactionException in golr loader

Stack trace:

May 05, 2017 1:09:48 AM org.monarch.golr.GolrWorker call
INFO: Deleting JSON pathway-phenotype.yaml.json
May 05, 2017 1:09:51 AM org.monarch.golr.GolrWorker call
INFO: pathway-phenotype.yaml.json done
May 05, 2017 1:09:51 AM org.monarch.golr.GolrWorker call
INFO: Posting JSON gene-phenotype.yaml.json to http://localhost:8983/solr/golr
May 05, 2017 1:47:40 AM org.monarch.golr.GolrWorker call
INFO: {"responseHeader":{"status":0,"QTime":2269002}}

**May 05, 2017 1:47:40 AM org.monarch.golr.GolrWorker call
INFO: Deleting JSON gene-phenotype.yaml.json
May 05, 2017 1:47:42 AM org.monarch.golr.GolrWorker call
INFO: gene-phenotype.yaml.json done**
May 05, 2017 1:47:42 AM org.monarch.golr.GolrWorker call
INFO: Posting JSON literature-variant.yaml.json to http://localhost:8983/solr/golr
May 05, 2017 3:52:05 AM org.monarch.golr.GolrWorker call
INFO: {"responseHeader":{"status":0,"QTime":7462131}}

May 05, 2017 3:52:05 AM org.monarch.golr.GolrWorker call
INFO: Deleting JSON literature-variant.yaml.json
May 05, 2017 3:52:07 AM org.monarch.golr.GolrWorker call
INFO: literature-variant.yaml.json done
Exception in thread "Golr processor - gene-phenotype.yaml.json" java.util.concurrent.ExecutionException: org.neo4j.graphdb.NotInTransactionException: The statement has been closed.
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.monarch.golr.Pipeline.main(Pipeline.java:190)
Caused by: org.neo4j.graphdb.NotInTransactionException: The statement has been closed.
	at org.neo4j.kernel.impl.api.KernelStatement.assertOpen(KernelStatement.java:153)
	at org.neo4j.kernel.impl.api.OperationsFacade.nodeGetRelationships(OperationsFacade.java:346)
	at org.neo4j.kernel.impl.core.NodeProxy$1.iterator(NodeProxy.java:189)
	at org.neo4j.kernel.impl.core.NodeProxy$1.iterator(NodeProxy.java:181)
	at io.scigraph.internal.EvidenceAspect.invoke(EvidenceAspect.java:80)
	at org.monarch.golr.EvidenceProcessor.addAssociations(EvidenceProcessor.java:68)
	at org.monarch.golr.GolrLoader.serializeGolrQuery(GolrLoader.java:433)
	at org.monarch.golr.GolrLoader.process(GolrLoader.java:337)
	at org.monarch.golr.GolrWorker.call(GolrWorker.java:53)
	at org.monarch.golr.GolrWorker.call(GolrWorker.java:25)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

add flag to delete json as soon as they're indexed

Since the golr loader is stable, we can delete the generated the json files on the fly to save disk space.

Add a generic object loader

The current loader is association-centric; i.e. each document is a relationship between two objects. This is useful for the majority of queries.

It would also be useful to have an object-centric loader. (TBD: define yaml in monarch repo). E.g. for monarch-initiative/monarch-app#756 (one row per variant).

Here we would have only one document per object. Relationships would be loaded into a multi-valued field named after the property. This means we have a less generic schema than oban.

Example fields (core):

id/curie
label
category
type
type_closure
description/definition
synomyms

This would be extended depending on the object type. E.g. for genomic features like variants we may have chrom, start, end (for simplicitly we would flatten to a single reference; for more complexity use cypher). For variants we may have a pathogenicity score. Etc.

@cmungall will define core schema, cc @nlwashington

Note that the mechanism here could be used to load ontology classes; but may as well just use owltools loader for this (cc @hdietze @kltm)

Add `relation` to all queries

All direct queries should fill in their relation. Inferred relations can be left null.

Extend golr pattern to include subj/obj equivalents and xrefs

In addition sub/obj_closure, it would be useful to have

sub/obj_equivalents - all members of equivalence clique (reflexive: includes self)
sub/obj_xrefs- all xrefs annotations for all members of clique (reflexive: includes self.curie)

These would be useful for query purposes. Currently in biolink we take the input ID and use subject_closure, which includes equivalence, but also allows parents which we don't want.

xrefs are useful for near-equivalence. E.g. when querying for gene-disease, we want to allow input of uniprot protein xref

add a top-level meta tag to track the cypher query used to write the inferred association

From https://github.com/monarch-initiative/configs/issues/18

This will make the queries transparent and reusable (and is more meaningful that 'config' - this is our biological inferences, more than configuration!).

We should also add something to the golr export as provenance that states "we have association A as inferred using query Q", where Q is something like monarch:cypher/gene-phenotype
nodes: [...]
edges: [...]
meta: {
  query: "monarch:cypher/gene-phenotype"
}

Add configuration option to include / exclude evidence

Evidence graph, evidence, and evidence_object should be optional.

Index all associations instead of merging subject-object pairs per query

Currently we merge associations with the same subject-object pairs from a single query, and track evidence/provenance in an evidence graph object. It has been requested that we no longer do this in order to better show evidence/provenance of an association.

cc @cmungall

Filter out obsoleted terms from search

Currently obsoleted terms make into our search index. Anchoring the category/categories can fix this, but it would be better to not index them in the first place - unless we foresee a use case where we want to search for obsoleted terms

Version mismatch?

Hello, I was just exploring this project and wanted to compile but the compilation was failed due to the mismatches of SciGraph's version, i.e. SciGraph's version is 2.2 while the Golr-loader looks for 2.1 before it compiles the codes. Here is the output from my console: https://gist.github.com/yy20716/a5685e97e2125718939da095231b4b77 I wonder whether I just need to simply update the version in pom.xml. Thank you.

add scores/stats for subject/object

I would like for all nodes to have some summary stats attached to them for us to perform some downstream enhancements on...

minimally, this would be the IC score for any ontology node, but in the future we could look at storing some other things like the annot sufficiency score on other node types, or perhaps the min/max/sum of the annotations for annotated nodes like genes/genotypes (though those could be computed on the fly if we had the IC).

we could start by simply adding a subject_ic and object_ic. alternatively, @cmungall should the "quad" just be updated to be a "quint" to have the IC always part of it?

Orphanet disease-phenotype associations not loading into Golr?

From: monarch-initiative/hpo-annotation-data#3

Solr has no results:

http://geoffrey.crbs.ucsd.edu:8080/solr/golr/select/?q=*:*&fq=object_category:phenotype&fq=subject:%22Orphanet:881%22&rows=10&facet=true&facet.field=object_category&facet.field=subject_category&wt=json

But the data is in SciGraph:

http://rosie.crbs.ucsd.edu:9000/scigraph/dynamic/diseases/Orphanet:881/phenotypes.json

However, I noticed the Orphanet node lacks a category (should be disease) - is this the cause?

cc @mellybelly

Fix sources and evidences

For scalabilty issue, we removed the paths from the golr queries. So now we lack of information to get the right associations.

Consider aligning getNeighbors and getClosure methods

We use depthFirst() when generating a closure for an ID; however, the we should consider using breadthFirst() instead to get the full set of classes.

See

golr-loader/src/main/java/org/monarch/golr/ClosureUtil.java

Line 89 in 1f3c4af

    
           TraversalDescription description = graphDb.traversalDescription().depthFirst().uniqueness(Uniqueness.NODE_GLOBAL);

vs.

https://github.com/SciGraph/SciGraph/blob/8e46c9682e7bdfb6afbd5e1bde830bee2a745db7/SciGraph-core/src/main/java/io/scigraph/internal/GraphApi.java#L84

Investigate on slow queries

gene-gene.yaml
gene-phenotype.yaml
literature-phenotype.yaml

These ones are the slowest to complete.

English spelling/synonyms

For example, colour blindness: https://api.monarchinitiative.org/api/search/entity/Colour%20blindness?prefix=HP

Unable to get gene phenotype relationships in current view

I'm unable to fetch gene phenotype associations from the current GOlr view, although I can get them using the SciGraph dynamic services, for example: http://duckworth.crbs.ucsd.edu:9000/scigraph/dynamic/features/NCBIGene:6469/phenotypes

Example Golr attempt:
object_category = phenotype, subject_closure = NCBIGene:6469

Consider indexing non clique leaders

I ran into an issue today where I could not find a label for an HPO class, and tracked it to:
obophenotype/upheno#190 and,
obophenotype/upheno#149

Should we be indexing all non-clique leaders? This wouldn't work well with MONDO since non clique-leaders do not contain rdfs:label annotation properties.

Index synonyms from non-clique leader IDs

See monarch-initiative/monarch-ui#319

The clique merge does not merge node properties, meaning that things like synonyms will not get passed along to the search index (which only operates on clique leader ids)

Configure paths for anatomy queries

Typically when we populate the closure field, we traverse the closure over SubClassOf only

If the category is Anatomy, then the closure should be over SubClassOf|BFO:0000050.

This can be tested when this is implemented:
monarch-initiative/dipper#147 (bgee gene expression loads)

Side note: really this should not need the addition of procedural code. We should have a generic configuration for this. This will be a separate ticket, for now see geneontology/amigo#210 (comment)

use abbreviations for evidence codes

we've had a request to display the abbreviation label for "evidence" instead of the full label. (see monarch-initiative/dipper#177)

does this information come in the current golr, or do we need a special way of adding this at golr-generation time? if it is already in the golr, then we need to move this ticket to the monarch-app layer.

This seems like it ought to be a configurable thing, where the label that is placed in golr may be the "label" by default, but could be a special synonym type.

scigraph / golr-loader Goto Github PK

golr-loader's Issues

Recommend Projects

Recommend Topics

Recommend Org