scigraph / golr-loader Goto Github PK
View Code? Open in Web Editor NEWConvert SciGraph queries into json that can be loaded by Golr
License: Apache License 2.0
Convert SciGraph queries into json that can be loaded by Golr
License: Apache License 2.0
The evidence graph currently has association objects as nodes, following the RDF reification pattern. This inflates the graph. It would be better to de-reify these by making the association node properties properties of the relevant edge
For each:
?annId oban:association_has_subject ?s
?annId oban:association_has_object ?o
OPTIONAL ?annId oban:association_has_predicate ?p
?annId dc:source ?src
?annId RO_0002558 ?evtype
?annId ?p1 ?v1
...
?annId ?pn ?vn
(the predicate edge is often missing. Why?)
Find this:
{sub: ?s,
obj: ?o
pred: ?p}
and replace with:
{sub: ?s,
obj: ?o
pred: ?p,
meta: {
id: ?annId,
xrefs: [?src],
...
}
}
Note it may be better to do this further upstream in SciGraph itself, in the BBOPGraph mapping algorithm.
Example here:
[
{
"obj": "GO:0030500PHENOTYPE",
"pred": "http://purl.obolibrary.org/obo/RO_0002200",
"sub": ":_ZDB-GENE-050417-357-ZDB-MRPHLNO-111209-6"
},
{
"obj": "ZFIN:ZDB-GENE-050417-357",
"pred": "http://purl.obolibrary.org/obo/GENO_0000443",
"sub": ":_ZDB-GENE-050417-357-ZDB-MRPHLNO-111209-7"
},
{
"obj": ":_ZDB-GENE-050417-357-ZDB-MRPHLNO-111209-6",
"pred": "http://purl.org/oban/association_has_subject",
"sub": "MONARCH:7c16f8c23d210cb4f7f6e2929333789b"
},
{
"obj": "GO:0030500PHENOTYPE",
"pred": "http://purl.org/oban/association_has_object",
"sub": "MONARCH:7c16f8c23d210cb4f7f6e2929333789b"
},
{
"obj": "PMID:22087291",
"pred": "http://purl.org/dc/elements/1.1/source",
"sub": "MONARCH:dfd7c1916c111384ce0649139c5d4eb6"
},
{
"obj": "GO:0030500PHENOTYPE",
"pred": "http://purl.obolibrary.org/obo/RO_0002200",
"sub": ":_ZDB-GENE-050417-357-ZDB-MRPHLNO-111209-7"
},
{
"obj": "GO:0030500PHENOTYPE",
"pred": "http://purl.org/oban/association_has_object",
"sub": "MONARCH:dfd7c1916c111384ce0649139c5d4eb6"
},
{
"obj": ":_ZDB-GENE-050417-357-ZDB-MRPHLNO-111209-7",
"pred": "http://purl.org/oban/association_has_subject",
"sub": "MONARCH:dfd7c1916c111384ce0649139c5d4eb6"
},
{
"obj": "PMID:22087291",
"pred": "http://purl.org/dc/elements/1.1/source",
"sub": "MONARCH:dfd7c1916c111384ce0649139c5d4eb6"
},
{
"obj": "PMID:22087291",
"pred": "http://purl.org/dc/elements/1.1/source",
"sub": "MONARCH:7c16f8c23d210cb4f7f6e2929333789b"
},
{
"obj": "ECO:0000059",
"pred": "http://purl.obolibrary.org/obo/RO_0002558",
"sub": "MONARCH:dfd7c1916c111384ce0649139c5d4eb6"
},
{
"obj": "PMID:22087291",
"pred": "http://purl.org/dc/elements/1.1/source",
"sub": "MONARCH:7c16f8c23d210cb4f7f6e2929333789b"
},
{
"obj": "ECO:0000059",
"pred": "http://purl.obolibrary.org/obo/RO_0002558",
"sub": "MONARCH:7c16f8c23d210cb4f7f6e2929333789b"
},
{
"obj": "ZFIN:ZDB-GENE-050417-357",
"pred": "http://purl.obolibrary.org/obo/GENO_0000443",
"sub": ":_ZDB-GENE-050417-357-ZDB-MRPHLNO-111209-6"
}
]```
It would be useful to have edge labels available when creating visuals of an evidence graph.
Getting labels for edges is currently a two step process:
Get the iri for an edge
Find the node with the same iri property and get its label
Two ideas:
Add taxon to both subject and object with the following pattern:
(subject)-[:subClassOf|type|part_of*0..]->(thing)-[:RO_0002162]->(taxon)
with the taxon being used as a closure
When on a gene page, we want to see any phenotype/disease associated with any of its orthologs. At one point we were waiting on upgrading solr to join the orthology/gene-phenotype queries.
However, I think we can get this information if we store a list of orthologs for a gene for each record.
For example, for SHH, we would add:
orthologs: ['ZFIN:ZDB-GENE-980526-41', 'ZFIN:ZDB-GENE-980526-166', 'ZFIN:ZDB-GENE-980526-41', 'NCBIGene:716553','NCBIGene:100016531', 'NCBIGene:100557233', 'NCBIGene:42737','NCBIGene:608860', 'NCBIGene:100512749', 'NCBIGene:29499', 'MGI:98297', 'ZFIN:ZDB-GENE-980526-166']
The pattern is:
(gene)-[:RO:HOM0000017|RO:HOM0000020]-(ortholog)
We want to remove IRI and equivalent IRIs from our search index, https://github.com/SciGraph/golr-loader/blob/master/src/main/java/org/monarch/golr/SimpleLoader.java#L99
Move domain specific logic into more granular classes and add this to the configuration specification.
Currently we serialize scigraph results as a list of JSON documents, and then post the entire JSON file to solr. This has worked well in the past, but seems to cause issues as these JSON documents have gotten larger.
As an alternative, we can use the SolrJ API to construct SolrInputDocument objects and post these the server in batches of 100k or so. It's unclear if this will result in any performance boost, as SolrJ appears to be sending them to the server using http regardless. At a minimum this should fix #27 and possibly #30.
As a test I've reworked the golr worker to convert the JSON documents to SolrInputDocuments - which is a pretty minimal change but is slightly less performant. After chatting with @kltm and @cmungall I'm planning on moving ahead with the larger refactor of removing the JSON intermediate step.
@benwbooth I'm wondering if it will conflict with your work on #17 ?
Consider the following pattern:
(subject:gene)<-[has_locus]-(variant)-[relation]->(object:disease)
Where relation is one of:
In many cases, multiple variants of a single gene are linked to a disease via multiple relations (commonly pathogenic and likely pathogenic). Currently, the solr loader seems to pick a relation at random (although this may not be the case and it may in fact be deterministic for a given db).
This is also an issue with combining orthology statements from multiple sources (panther and zfin) where panther specifies whether two orthologs have a 1 to 1 relationship whereas zfin does not.
One option is to store the set of relations linking two nodes. Another option would be to configure a relation priority, where the relation with the highest priority is designated while the others are retrievable via the evidence graph.
Not sure why this is happening, especially with a local solr instance. It doesn't seem to index the file at all.
First thing to do is to fail the process, for now this is a silent exception. Then since the file is not indexed at all, a retry can be attempted, or at least keep the json file for a manual retry.
INFO: Posting JSON genotype-phenotype.yaml.json to http://localhost:8983/solr/golr
Feb 27, 2017 6:07:32 AM org.monarch.golr.GolrWorker call
SEVERE: Failed to post JSON genotype-phenotype.yaml.json
java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:126)
at org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:138)
at org.apache.http.impl.io.SessionOutputBufferImpl.write(SessionOutputBufferImpl.java:169)
at org.apache.http.impl.io.ContentLengthOutputStream.write(ContentLengthOutputStream.java:115)
at org.apache.http.client.fluent.InternalFileEntity.writeTo(InternalFileEntity.java:75)
at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:158)
at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:162)
at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:237)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:122)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.apache.http.client.fluent.Request.internalExecute(Request.java:173)
at org.apache.http.client.fluent.Executor.execute(Executor.java:262)
at org.monarch.golr.GolrWorker.call(GolrWorker.java:79)
at org.monarch.golr.GolrWorker.call(GolrWorker.java:25)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
At the moment, queries like:
START chromosomeClass = node:node_auto_index(iri='SO:0000340')
won't be resolved to IRIs.
Ideally it should behave as the /cypher/execute
and /cypher/resolve
services of SciGraph.
In addition to label and ID closure there should be JSON closure maps that resemble:
regulates_closure_map: "{"GO:0051716":"cellular response to stimulus","GO:0007264":"small GTPase mediated signal transduction","GO:0009987":"cellular process","GO:0023052":"signaling","GO:0044699":"single-organism process","GO:0044763":"single-organism cellular process","GO:0065007":"biological regulation","GO:0005623":"cell","GO:0007265":"Ras protein signal transduction","GO:0005575":"cellular_component","GO:0005622":"intracellular","GO:0035556":"intracellular signal transduction","GO:0008150":"biological_process","GO:0007165":"signal transduction","GO:0007154":"cell communication","GO:0050794":"regulation of cellular process","GO:0044464":"cell part","GO:0050789":"regulation of biological process","GO:0044700":"single organism signaling","GO:0032482":"Rab protein signal transduction","GO:0050896":"response to stimulus"}",
evidence
should become an evidence_object
closure.evidence
should be just the closure of all evidence
in the graph.source
closure with all the sources from associations.When something is declared as an instance of a taxon, the taxon_id/label/closure is not getting populated.
For example, from the omia data, their "breeds" are declared as instances of a taxon:
:_omiabreedkey100 a OBO:NCBITaxon_9913 ; rdfs:label "Red Spotted Czech (cattle)" .
But these aren't showing up in the various taxon closures. I believe this is because :type
is not included in the path when trying to find the taxon. This needs fixing.
java.lang.ArrayIndexOutOfBoundsException: 1
at org.monarch.golr.SimpleLoader.generate (SimpleLoader.java:98)
at org.monarch.golr.SimpleLoaderMain.main (SimpleLoaderMain.java:91)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:254)
at java.lang.Thread.run (Thread.java:748)
looks like we get a curie without a colon (?)
Solr-dev is currently returning an exception for any biolink call that facets over object_closures:
java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field object_closure
We've appeared to have hit some limit, but it's not obvious from looking at the data:
Solr Production
Total Docs: 37502996
Unique values in object_closure: 4005165
Solr Dev:
Total Docs: 38759008
Unique values in object_closure: 4350271
It's also possible the limit is based on the number of values per a single document, but this is harder to gather without iterating over each document.
Related:
https://issues.apache.org/jira/browse/SOLR-11240
Possible solutions:
This is due to the fix for the evidence graphs, it is assuming that a query will always have a subject and an object.
Stack trace:
May 05, 2017 1:09:48 AM org.monarch.golr.GolrWorker call
INFO: Deleting JSON pathway-phenotype.yaml.json
May 05, 2017 1:09:51 AM org.monarch.golr.GolrWorker call
INFO: pathway-phenotype.yaml.json done
May 05, 2017 1:09:51 AM org.monarch.golr.GolrWorker call
INFO: Posting JSON gene-phenotype.yaml.json to http://localhost:8983/solr/golr
May 05, 2017 1:47:40 AM org.monarch.golr.GolrWorker call
INFO: {"responseHeader":{"status":0,"QTime":2269002}}
**May 05, 2017 1:47:40 AM org.monarch.golr.GolrWorker call
INFO: Deleting JSON gene-phenotype.yaml.json
May 05, 2017 1:47:42 AM org.monarch.golr.GolrWorker call
INFO: gene-phenotype.yaml.json done**
May 05, 2017 1:47:42 AM org.monarch.golr.GolrWorker call
INFO: Posting JSON literature-variant.yaml.json to http://localhost:8983/solr/golr
May 05, 2017 3:52:05 AM org.monarch.golr.GolrWorker call
INFO: {"responseHeader":{"status":0,"QTime":7462131}}
May 05, 2017 3:52:05 AM org.monarch.golr.GolrWorker call
INFO: Deleting JSON literature-variant.yaml.json
May 05, 2017 3:52:07 AM org.monarch.golr.GolrWorker call
INFO: literature-variant.yaml.json done
Exception in thread "Golr processor - gene-phenotype.yaml.json" java.util.concurrent.ExecutionException: org.neo4j.graphdb.NotInTransactionException: The statement has been closed.
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.monarch.golr.Pipeline.main(Pipeline.java:190)
Caused by: org.neo4j.graphdb.NotInTransactionException: The statement has been closed.
at org.neo4j.kernel.impl.api.KernelStatement.assertOpen(KernelStatement.java:153)
at org.neo4j.kernel.impl.api.OperationsFacade.nodeGetRelationships(OperationsFacade.java:346)
at org.neo4j.kernel.impl.core.NodeProxy$1.iterator(NodeProxy.java:189)
at org.neo4j.kernel.impl.core.NodeProxy$1.iterator(NodeProxy.java:181)
at io.scigraph.internal.EvidenceAspect.invoke(EvidenceAspect.java:80)
at org.monarch.golr.EvidenceProcessor.addAssociations(EvidenceProcessor.java:68)
at org.monarch.golr.GolrLoader.serializeGolrQuery(GolrLoader.java:433)
at org.monarch.golr.GolrLoader.process(GolrLoader.java:337)
at org.monarch.golr.GolrWorker.call(GolrWorker.java:53)
at org.monarch.golr.GolrWorker.call(GolrWorker.java:25)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Since the golr loader is stable, we can delete the generated the json files on the fly to save disk space.
The current loader is association-centric; i.e. each document is a relationship between two objects. This is useful for the majority of queries.
It would also be useful to have an object-centric loader. (TBD: define yaml in monarch repo). E.g. for monarch-initiative/monarch-app#756 (one row per variant).
Here we would have only one document per object. Relationships would be loaded into a multi-valued field named after the property. This means we have a less generic schema than oban.
Example fields (core):
This would be extended depending on the object type. E.g. for genomic features like variants we may have chrom, start, end (for simplicitly we would flatten to a single reference; for more complexity use cypher). For variants we may have a pathogenicity score. Etc.
@cmungall will define core schema, cc @nlwashington
Note that the mechanism here could be used to load ontology classes; but may as well just use owltools loader for this (cc @hdietze @kltm)
All direct queries should fill in their relation. Inferred relations can be left null.
In addition sub/obj_closure, it would be useful to have
sub/obj_equivalents
- all members of equivalence clique (reflexive: includes self)sub/obj_xrefs
- all xrefs annotations for all members of clique (reflexive: includes self.curie)These would be useful for query purposes. Currently in biolink we take the input ID and use subject_closure, which includes equivalence, but also allows parents which we don't want.
xrefs are useful for near-equivalence. E.g. when querying for gene-disease, we want to allow input of uniprot protein xref
From https://github.com/monarch-initiative/configs/issues/18
This will make the queries transparent and reusable (and is more meaningful that 'config' - this is our biological inferences, more than configuration!).
We should also add something to the golr export as provenance that states "we have association A as inferred using query Q", where Q is something like monarch:cypher/gene-phenotype
nodes: [...] edges: [...] meta: { query: "monarch:cypher/gene-phenotype" }
Evidence graph, evidence, and evidence_object should be optional.
Currently we merge associations with the same subject-object pairs from a single query, and track evidence/provenance in an evidence graph object. It has been requested that we no longer do this in order to better show evidence/provenance of an association.
cc @cmungall
Currently obsoleted terms make into our search index. Anchoring the category/categories can fix this, but it would be better to not index them in the first place - unless we foresee a use case where we want to search for obsoleted terms
Hello, I was just exploring this project and wanted to compile but the compilation was failed due to the mismatches of SciGraph's version, i.e. SciGraph's version is 2.2 while the Golr-loader looks for 2.1 before it compiles the codes. Here is the output from my console: https://gist.github.com/yy20716/a5685e97e2125718939da095231b4b77 I wonder whether I just need to simply update the version in pom.xml. Thank you.
I would like for all nodes to have some summary stats attached to them for us to perform some downstream enhancements on...
minimally, this would be the IC score for any ontology node, but in the future we could look at storing some other things like the annot sufficiency score on other node types, or perhaps the min/max/sum of the annotations for annotated nodes like genes/genotypes (though those could be computed on the fly if we had the IC).
we could start by simply adding a subject_ic and object_ic. alternatively, @cmungall should the "quad" just be updated to be a "quint" to have the IC always part of it?
From: monarch-initiative/hpo-annotation-data#3
Solr has no results:
But the data is in SciGraph:
http://rosie.crbs.ucsd.edu:9000/scigraph/dynamic/diseases/Orphanet:881/phenotypes.json
However, I noticed the Orphanet node lacks a category (should be disease) - is this the cause?
cc @mellybelly
For scalabilty issue, we removed the paths from the golr queries. So now we lack of information to get the right associations.
We use depthFirst() when generating a closure for an ID; however, the we should consider using breadthFirst() instead to get the full set of classes.
See
vs.
These ones are the slowest to complete.
For example, colour blindness: https://api.monarchinitiative.org/api/search/entity/Colour%20blindness?prefix=HP
I'm unable to fetch gene phenotype associations from the current GOlr view, although I can get them using the SciGraph dynamic services, for example: http://duckworth.crbs.ucsd.edu:9000/scigraph/dynamic/features/NCBIGene:6469/phenotypes
Example Golr attempt:
object_category = phenotype, subject_closure = NCBIGene:6469
I ran into an issue today where I could not find a label for an HPO class, and tracked it to:
obophenotype/upheno#190 and,
obophenotype/upheno#149
Should we be indexing all non-clique leaders? This wouldn't work well with MONDO since non clique-leaders do not contain rdfs:label annotation properties.
See monarch-initiative/monarch-ui#319
The clique merge does not merge node properties, meaning that things like synonyms will not get passed along to the search index (which only operates on clique leader ids)
Typically when we populate the closure field, we traverse the closure over SubClassOf only
If the category is Anatomy, then the closure should be over SubClassOf|BFO:0000050
.
This can be tested when this is implemented:
monarch-initiative/dipper#147 (bgee gene expression loads)
Side note: really this should not need the addition of procedural code. We should have a generic configuration for this. This will be a separate ticket, for now see geneontology/amigo#210 (comment)
we've had a request to display the abbreviation label for "evidence" instead of the full label. (see monarch-initiative/dipper#177)
does this information come in the current golr, or do we need a special way of adding this at golr-generation time? if it is already in the golr, then we need to move this ticket to the monarch-app layer.
This seems like it ought to be a configurable thing, where the label that is placed in golr may be the "label" by default, but could be a special synonym type.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.