julielab / gepi Goto Github PK

GePI (GEne - Protein Interactions) is a web portal for quick and convenient access to gene - protein interaction mentions automatically extracted from the biomedical literature, i.e. PubMed and PubMed Central (Open Access Subset).

License: GNU General Public License v3.0

Java 27.03% Shell 2.00% Python 0.57% JavaScript 67.23% CSS 0.60% Dockerfile 0.16% Less 0.02% SCSS 0.06% Perl 2.33%

bionlp interactions molecular ppi retrieval webapplication

gepi's People

Contributors

Stargazers

Watchers

gepi's Issues

input field - responsiveness

Id mapping (e.g. UniProt2Gene) - howto?

How exactly do we want to map UniProt IDs to NCBI Gene IDs? One possibility would be the file described at ftp://ftp.pir.georgetown.edu/databases/idmapping/idmapping.tb.readme. We have this on our server's harddisc and use it for GeNo resources creation. This would result in a kind of "static" mapping since we had to update our resources for a new mapping. Would that be an issue? Hosting the mapping ourselves would be much quicker then doing queries to an external web service, I suppose. Especially for long lists of IDs.

automated update of resources and index

multiple line entries

Some of the results contain references, which are mainly titles of other publications, but also supplementaries like uniprot id lists.
They appear to be on multiple lines ('\n'), when using es_query csv output.

Should be handled one way (e.g. not included in database) or another (post-filtering).

user specific accounts

journal lists
saved searches
saved input ids

elasticsearch index: pmid/pmcid fields wrong?

The ES index has fields for the id of the document (pmcid or pmid). However, it seems the fields are populated in a wrong way: we got mappings for pmids that are pmcids (recognizable by their _id value). Further hints for this suspicion is the fact, that there is always a field pmid but never only a field pmcid; the latter only when both are populated.

Add index field for number of arguments

With the current index appearance, we have to possibility to filter events that only have a single argument within the ElasticSearch query.
Add a field to the index that only holds the number of arguments so we can filter against it.

Problematic Medline Document: 23700993

 <MedlineCitation Status="MEDLINE" Owner="NLM">
        <PMID Version="1">23700993</PMID>
        <DateCreated>
            <Year>2013</Year>
            <Month>08</Month>
            <Day>06</Day>
        </DateCreated>
        <DateCompleted>
            <Year>2014</Year>
            <Month>02</Month>
            <Day>26</Day>
        </DateCompleted>
        <DateRevised>
            <Year>2013</Year>
            <Month>08</Month>
            <Day>06</Day>
        </DateRevised>
        <Article PubModel="Print">
            <Journal>
                <ISSN IssnType="Electronic">1875-6697</ISSN>
                <JournalIssue CitedMedium="Internet">
                    <Volume>9</Volume>
                    <Issue>2</Issue>
                    <PubDate>
                        <Year>2013</Year>
                        <Month>Jun</Month>
                    </PubDate>
                </JournalIssue>
                <Title>Current computer-aided drug design</Title>
                <ISOAbbreviation>Curr Comput Aided Drug Des</ISOAbbreviation>
            </Journal>
            <ArticleTitle>Molecular design and QSARs/QSPRs with molecular descriptors family.</ArticleTitle>
            <Pagination>
                <MedlinePgn>195-205</MedlinePgn>
            </Pagination>
            <Abstract>
                <AbstractText>The aim of the present paper is to present the methodology of the molecular descriptors family (MDF) as an integrative tool in molecular modeling and its abilities as a multivariate QSAR/QSPR modeling tool. An algorithm for extracting useful information from the topological and geometrical representation of chemical compounds was developed and integrated to calculate MDF members. The MDF methodology was implemented and the software is available online (http://l.academicdirect.org/Chemistry/SARs/MDF_SARs/). This integrative tool was developed in order to maximize performance, functionality, efficiency and portability. The MDF methodology is able to provide reliable and valid multiple linear regression models. Furthermore, in many cases, the MDF models were better than the published results in the literature in terms of correlation coefficients (statistically significant Steiger's Z test at a significance level of 5%) and/or in terms of values of information criteria and Kubinyi function. The MDF methodology developed and implemented as a platform for investigating and characterizing quantitative relationships between the chemical structure and the activity/property of active compounds was used on more than 50 study cases. In almost all cases, the methodology allowed obtaining of QSAR/QSPR models improved in explanatory power of structure-activity and structure-property relationships. The algorithms applied in the computation of geometric and topological descriptors (useful in modeling physicochemical or biological properties of molecules) and those used in searching for reliable and valid multiple linear regression models certain enrich the pool of low-cost low-time drug design tools.</AbstractText>
            </Abstract>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Bolboacă</LastName>
                    <ForeName>Sorana D</ForeName>
                    <Initials>SD</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Medical Informatics and Biostatistics, Iuliu Ha􀀅ieganu University of Medicine and Pharmacy Cluj-Napoca, 6 Louis Pasteur, 400349 Cluj, Romania.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Jäntschi</LastName>
                    <ForeName>Lorentz</ForeName>
                    <Initials>L</Initials>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Diudea</LastName>
                    <ForeName>Mircea V</ForeName>
                    <Initials>MV</Initials>
                </Author>
            </AuthorList>
            <Language>eng</Language>
            <PublicationTypeList>
                <PublicationType UI="D016428">Journal Article</PublicationType>
                <PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
            </PublicationTypeList>
        </Article>
        <MedlineJournalInfo>
            <Country>United Arab Emirates</Country>
            <MedlineTA>Curr Comput Aided Drug Des</MedlineTA>
            <NlmUniqueID>101265750</NlmUniqueID>
            <ISSNLinking>1573-4099</ISSNLinking>
        </MedlineJournalInfo>
        <CitationSubset>IM</CitationSubset>
        <MeshHeadingList>
            <MeshHeading>
                <DescriptorName UI="D000465" MajorTopicYN="N">Algorithms</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D015195" MajorTopicYN="N">Drug Design</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D021281" MajorTopicYN="Y">Quantitative Structure-Activity Relationship</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D012984" MajorTopicYN="N">Software</DescriptorName>
            </MeshHeading>
        </MeshHeadingList>
    </MedlineCitation>
    <PubmedData>
        <History>
            <PubMedPubDate PubStatus="received">
                <Year>2013</Year>
                <Month>03</Month>
                <Day>10</Day>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="revised">
                <Year>2012</Year>
                <Month>10</Month>
                <Day>26</Day>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="accepted">
                <Year>2013</Year>
                <Month>04</Month>
                <Day>27</Day>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="entrez">
                <Year>2013</Year>
                <Month>5</Month>
                <Day>25</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="pubmed">
                <Year>2013</Year>
                <Month>5</Month>
                <Day>25</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="medline">
                <Year>2014</Year>
                <Month>2</Month>
                <Day>27</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
        </History>
        <PublicationStatus>ppublish</PublicationStatus>
        <ArticleIdList>
            <ArticleId IdType="pubmed">23700993</ArticleId>
            <ArticleId IdType="pii">CCADD-EPUB-20130514-4</ArticleId>
        </ArticleIdList>
    </PubmedData>

Hahn's idea:

Prof. Hahn thought that it would add significant application value if a click on a portion of a diagram - e.g. a single beam of a sankey diagram - would take the user to the textual data (sentences?) from which the diagram portion was derived.
The vision is a close connection between extracted, condensed information and the underlying text. Like a zooming function: Papers are 100% zoom, a diagram is like 30% (just made up numbers)

index restructuring for performance

jpp: jules-longdocument-skipper (jls)

I can find the jls neither in our nexus nor in the svn?
The dependency is only used (as far as I'm aware of) in the CPEAllPMC.xml so it shouldn't be that time sensitive to find out where it's gone.

include chemicals tagger

request from GlioPATH members: chem <> gene, chem <> chem interactions should be possible as well
identical chem ids available
is there an annotation for this in the available bionlp corpus? or better (for now): is there a chem tagger available?

Write FieldsGenerator

Write a FieldsGenerator (see jules-cas-to-elasticsearch-consumer) which generates Document objects appropriate for the GePi index format.

es_query - no pmc hits

Title says all:
Using es_query to query ES yields no event matches that are outside of title/abstract.
In other words, currently we can deliver only medline hits.
Is pmc indexing truly saved to the same ES, as medline before?
Thoughts?

Provide article id in resulting table per sentence

Result Portal

On the current sketch of the GePi frontend, there is just a single page where, on the left side, search lists are entered and, on the right side, results are shown.
We have space issues with this. Also, the user still has to click through different result presentations (table, pie chart, sankey...).
What about a whole new page dedicated for result portrayal that just shows the most important diagrams and the table on a single view? Users should just search and see. Of course there should be interaction possibilities but everything should be as accessible as possible.

Table pager

Nature methods review

comparable papers (e.g. http://www.nature.com/nmeth/journal/v14/n5/full/nmeth.4260.html)
reasonable target journal?
format

complete medline & pmc index

GePi webapplication server location

ToDo: wie wo was - Benjamin & Erik (wie läuft das bspw. mit Semedico?)

pie chart resizing

enable search exploration via widget interaction

Current status of modules relevant for GePi

Please add here information about what needs to be done and what is present, etc.

Organise semedico-app input resources from gene database

The two of you should come up with a solid strategy as to how Sascha can create the semedico-app resources and Franz can actually work with them. Possibly even some mechanism which discloses at which state the current resources are (easiest thing to do: look at creation date; is that enough to avoid confusion?)

Input list recoverable from left side

Optionally by showing a small hint sign/handle on the left side of the browser screen.

bar diagram - numbers

Full index gepi available on coling servers

web application and ES on same server with ssh
julielab ES cluster with ssh
open port from outside (public available)

stat widget

how many genes were in the input lists? how many homologous genes? how many different interaction targets
upon enlarging the widget show more stats

provide filter possibilities to narrow down search / for search exploration

possibility to ignore reviews
possibility to select interaction type
possibility to select likelihood level of events
high impact journals, journals in general selectable?

event class - atid order in meaningful order?

Currently, it is not guaranteed that atids are always provided in the exact order, that is, e.g. the first occurring atid links all homologous gene ids.

proper Inject usage for googleChartManager

Who does what?

The title says it all, content is missing here.
I would like for everyone to know, what he is up to and what he needs to accomplish.
Basically, how we organise ourselves.

Switch to an event-centric index structure

We should probably switch to another index structure in the future. Currently, we get all documents and get from each document all inner event hits. This is fine for the moment. But another index design would most probably be much more performant.

Related discussion on ElasticSearch: elastic/elasticsearch#14229

Download functionality

API functionality

interactivity of results

Upon (visual) search result, e.g. sankey edge, user can narrow down results upon click, etc.
In general, results are filterable upon base search result.

protected access for partners until publication

paper outline

points to address

who are our rivals? Where and why are we significantly more awesome?
how to tell the story? most likely, we should include several query scenarios, ideally showing something with existing data that has not been seen before?
- verification experiments?
where to go? Reach and try high: Nature Methods
- Policy, Requirements, etc.

Build Gene Database

Build the Neo4j Gene Database from scratch and export the resource file required by the semedico-app.

Allow gene name search

including stats on how these had been resolved

atid's

Once, we get the results, we are able to deliver a sentence and one or two interaction partners.
While looking over the first results I realised we are getting different names for the same entrez ids (e.g. 'Arnt', 'Arnt mRNA'). I guess there are different tid's refering to these entrez IDs.
For the sake of a powerful summary we will need to use one gene name for all grouped entrez ids, in other words a gene name representative for any atid with several tids.
I would like to discuss how we can achieve this efficiently.

result pane small

only half of its size is used for charts.

correct table resizing

scrolling disabled after fullscreen expansion

Sentence highlighting

PMC XMI Parsing issue: Maximum attribute size limit exceeded

Some BioC PMC documents run into issues with the XMISplitter class when trying to parse the XMI data created by the BioC-PMC-CollectionReader:

javax.xml.stream.XMLStreamException: Maximum attribute size limit (524288) exceeded
       at com.ctc.wstx.sr.StreamScanner.constructLimitViolation(StreamScanner.java:2469)
       at com.ctc.wstx.sr.StreamScanner.verifyLimit(StreamScanner.java:2462)
       at com.ctc.wstx.sr.BasicStreamReader.parseAttrValue(BasicStreamReader.java:1962)
       at com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:3065)
       at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2963)
       at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2839)
       at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1073)
       at org.codehaus.stax2.ri.Stax2EventReaderImpl.nextEvent(Stax2EventReaderImpl.java:255)
       at de.julielab.xml.XmiSplitter.processAndParse(XmiSplitter.java:356)
       at de.julielab.xml.XmiSplitter.storeSelected(XmiSplitter.java:314)
       at de.julielab.xml.XmiSplitter.process(XmiSplitter.java:241)
       at de.julielab.jules.consumer.CasToXmiDBConsumer.process(CasToXmiDBConsumer.java:410)
       at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
       at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:374)
       at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:298)
       at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
       at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.processNext(ProcessingUnit.java:897)
       at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.run(ProcessingUnit.java:577)
Jan 20, 2017 10:13:35 AM org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl callAnalysisComponentProcess(406)

We use the Woodstox parser because it solves issues with Unicode characters that the default Java7 StAX parser has (I don't know if that's gone by Java8). The Woodstox parser factory is WstxInputFactory and it has a configuration ReaderConfig. ReaderConfig has a constant DEFAULT_MAX_ATTRIBUTE_LENGTH = 65536 * 8. By setting the property WstxInputProperties.P_MAX_ATTRIBUTE_SIZE to another value on WstxInputFactory I would expect to raise this limit.

The XMISplitter currently initializes the XMLInputFactory in line 48. This should move into a constructor so that the above mentioned property may be set.

julielab / gepi Goto Github PK

gepi's People

Contributors

Stargazers

Watchers

gepi's Issues

Recommend Projects

Recommend Topics

Recommend Org