Giter Site home page Giter Site logo

julielab / gepi Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 159.44 MB

GePI (GEne - Protein Interactions) is a web portal for quick and convenient access to gene - protein interaction mentions automatically extracted from the biomedical literature, i.e. PubMed and PubMed Central (Open Access Subset).

License: GNU General Public License v3.0

Java 27.03% Shell 2.00% Python 0.57% JavaScript 67.23% CSS 0.60% Dockerfile 0.16% Less 0.02% SCSS 0.06% Perl 2.33%
bionlp interactions molecular ppi retrieval webapplication

gepi's People

Contributors

fmatthies avatar khituras avatar schsascha avatar

Stargazers

 avatar

Watchers

 avatar  avatar

gepi's Issues

Id mapping (e.g. UniProt2Gene) - howto?

How exactly do we want to map UniProt IDs to NCBI Gene IDs? One possibility would be the file described at ftp://ftp.pir.georgetown.edu/databases/idmapping/idmapping.tb.readme. We have this on our server's harddisc and use it for GeNo resources creation. This would result in a kind of "static" mapping since we had to update our resources for a new mapping. Would that be an issue? Hosting the mapping ourselves would be much quicker then doing queries to an external web service, I suppose. Especially for long lists of IDs.

multiple line entries

Some of the results contain references, which are mainly titles of other publications, but also supplementaries like uniprot id lists.
They appear to be on multiple lines ('\n'), when using es_query csv output.

Should be handled one way (e.g. not included in database) or another (post-filtering).

elasticsearch index: pmid/pmcid fields wrong?

The ES index has fields for the id of the document (pmcid or pmid). However, it seems the fields are populated in a wrong way: we got mappings for pmids that are pmcids (recognizable by their _id value). Further hints for this suspicion is the fact, that there is always a field pmid but never only a field pmcid; the latter only when both are populated.

Add index field for number of arguments

With the current index appearance, we have to possibility to filter events that only have a single argument within the ElasticSearch query.
Add a field to the index that only holds the number of arguments so we can filter against it.

Problematic Medline Document: 23700993

 <MedlineCitation Status="MEDLINE" Owner="NLM">
        <PMID Version="1">23700993</PMID>
        <DateCreated>
            <Year>2013</Year>
            <Month>08</Month>
            <Day>06</Day>
        </DateCreated>
        <DateCompleted>
            <Year>2014</Year>
            <Month>02</Month>
            <Day>26</Day>
        </DateCompleted>
        <DateRevised>
            <Year>2013</Year>
            <Month>08</Month>
            <Day>06</Day>
        </DateRevised>
        <Article PubModel="Print">
            <Journal>
                <ISSN IssnType="Electronic">1875-6697</ISSN>
                <JournalIssue CitedMedium="Internet">
                    <Volume>9</Volume>
                    <Issue>2</Issue>
                    <PubDate>
                        <Year>2013</Year>
                        <Month>Jun</Month>
                    </PubDate>
                </JournalIssue>
                <Title>Current computer-aided drug design</Title>
                <ISOAbbreviation>Curr Comput Aided Drug Des</ISOAbbreviation>
            </Journal>
            <ArticleTitle>Molecular design and QSARs/QSPRs with molecular descriptors family.</ArticleTitle>
            <Pagination>
                <MedlinePgn>195-205</MedlinePgn>
            </Pagination>
            <Abstract>
                <AbstractText>The aim of the present paper is to present the methodology of the molecular descriptors family (MDF) as an integrative tool in molecular modeling and its abilities as a multivariate QSAR/QSPR modeling tool. An algorithm for extracting useful information from the topological and geometrical representation of chemical compounds was developed and integrated to calculate MDF members. The MDF methodology was implemented and the software is available online (http://l.academicdirect.org/Chemistry/SARs/MDF_SARs/). This integrative tool was developed in order to maximize performance, functionality, efficiency and portability. The MDF methodology is able to provide reliable and valid multiple linear regression models. Furthermore, in many cases, the MDF models were better than the published results in the literature in terms of correlation coefficients (statistically significant Steiger's Z test at a significance level of 5%) and/or in terms of values of information criteria and Kubinyi function. The MDF methodology developed and implemented as a platform for investigating and characterizing quantitative relationships between the chemical structure and the activity/property of active compounds was used on more than 50 study cases. In almost all cases, the methodology allowed obtaining of QSAR/QSPR models improved in explanatory power of structure-activity and structure-property relationships. The algorithms applied in the computation of geometric and topological descriptors (useful in modeling physicochemical or biological properties of molecules) and those used in searching for reliable and valid multiple linear regression models certain enrich the pool of low-cost low-time drug design tools.</AbstractText>
            </Abstract>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Bolboacă</LastName>
                    <ForeName>Sorana D</ForeName>
                    <Initials>SD</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Medical Informatics and Biostatistics, Iuliu Ha􀀅ieganu University of Medicine and Pharmacy Cluj-Napoca, 6 Louis Pasteur, 400349 Cluj, Romania.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Jäntschi</LastName>
                    <ForeName>Lorentz</ForeName>
                    <Initials>L</Initials>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Diudea</LastName>
                    <ForeName>Mircea V</ForeName>
                    <Initials>MV</Initials>
                </Author>
            </AuthorList>
            <Language>eng</Language>
            <PublicationTypeList>
                <PublicationType UI="D016428">Journal Article</PublicationType>
                <PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
            </PublicationTypeList>
        </Article>
        <MedlineJournalInfo>
            <Country>United Arab Emirates</Country>
            <MedlineTA>Curr Comput Aided Drug Des</MedlineTA>
            <NlmUniqueID>101265750</NlmUniqueID>
            <ISSNLinking>1573-4099</ISSNLinking>
        </MedlineJournalInfo>
        <CitationSubset>IM</CitationSubset>
        <MeshHeadingList>
            <MeshHeading>
                <DescriptorName UI="D000465" MajorTopicYN="N">Algorithms</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D015195" MajorTopicYN="N">Drug Design</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D021281" MajorTopicYN="Y">Quantitative Structure-Activity Relationship</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D012984" MajorTopicYN="N">Software</DescriptorName>
            </MeshHeading>
        </MeshHeadingList>
    </MedlineCitation>
    <PubmedData>
        <History>
            <PubMedPubDate PubStatus="received">
                <Year>2013</Year>
                <Month>03</Month>
                <Day>10</Day>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="revised">
                <Year>2012</Year>
                <Month>10</Month>
                <Day>26</Day>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="accepted">
                <Year>2013</Year>
                <Month>04</Month>
                <Day>27</Day>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="entrez">
                <Year>2013</Year>
                <Month>5</Month>
                <Day>25</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="pubmed">
                <Year>2013</Year>
                <Month>5</Month>
                <Day>25</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="medline">
                <Year>2014</Year>
                <Month>2</Month>
                <Day>27</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
        </History>
        <PublicationStatus>ppublish</PublicationStatus>
        <ArticleIdList>
            <ArticleId IdType="pubmed">23700993</ArticleId>
            <ArticleId IdType="pii">CCADD-EPUB-20130514-4</ArticleId>
        </ArticleIdList>
    </PubmedData>

Hahn's idea:

Prof. Hahn thought that it would add significant application value if a click on a portion of a diagram - e.g. a single beam of a sankey diagram - would take the user to the textual data (sentences?) from which the diagram portion was derived.
The vision is a close connection between extracted, condensed information and the underlying text. Like a zooming function: Papers are 100% zoom, a diagram is like 30% (just made up numbers)

jpp: jules-longdocument-skipper (jls)

I can find the jls neither in our nexus nor in the svn?
The dependency is only used (as far as I'm aware of) in the CPEAllPMC.xml so it shouldn't be that time sensitive to find out where it's gone.

include chemicals tagger

  • request from GlioPATH members: chem <> gene, chem <> chem interactions should be possible as well
  • identical chem ids available
  • is there an annotation for this in the available bionlp corpus? or better (for now): is there a chem tagger available?

es_query - no pmc hits

Title says all:
Using es_query to query ES yields no event matches that are outside of title/abstract.
In other words, currently we can deliver only medline hits.
Is pmc indexing truly saved to the same ES, as medline before?
Thoughts?

Result Portal

On the current sketch of the GePi frontend, there is just a single page where, on the left side, search lists are entered and, on the right side, results are shown.
We have space issues with this. Also, the user still has to click through different result presentations (table, pie chart, sankey...).
What about a whole new page dedicated for result portrayal that just shows the most important diagrams and the table on a single view? Users should just search and see. Of course there should be interaction possibilities but everything should be as accessible as possible.

Organise semedico-app input resources from gene database

The two of you should come up with a solid strategy as to how Sascha can create the semedico-app resources and Franz can actually work with them. Possibly even some mechanism which discloses at which state the current resources are (easiest thing to do: look at creation date; is that enough to avoid confusion?)

stat widget

  • how many genes were in the input lists? how many homologous genes? how many different interaction targets
  • upon enlarging the widget show more stats

Who does what?

The title says it all, content is missing here.
I would like for everyone to know, what he is up to and what he needs to accomplish.
Basically, how we organise ourselves.

Switch to an event-centric index structure

We should probably switch to another index structure in the future. Currently, we get all documents and get from each document all inner event hits. This is fine for the moment. But another index design would most probably be much more performant.

Related discussion on ElasticSearch: elastic/elasticsearch#14229

interactivity of results

Upon (visual) search result, e.g. sankey edge, user can narrow down results upon click, etc.
In general, results are filterable upon base search result.

paper outline

points to address

  • who are our rivals? Where and why are we significantly more awesome?
  • how to tell the story? most likely, we should include several query scenarios, ideally showing something with existing data that has not been seen before?
    • verification experiments?
  • where to go? Reach and try high: Nature Methods
    • Policy, Requirements, etc.

Build Gene Database

Build the Neo4j Gene Database from scratch and export the resource file required by the semedico-app.

atid's

Once, we get the results, we are able to deliver a sentence and one or two interaction partners.
While looking over the first results I realised we are getting different names for the same entrez ids (e.g. 'Arnt', 'Arnt mRNA'). I guess there are different tid's refering to these entrez IDs.
For the sake of a powerful summary we will need to use one gene name for all grouped entrez ids, in other words a gene name representative for any atid with several tids.
I would like to discuss how we can achieve this efficiently.

PMC XMI Parsing issue: Maximum attribute size limit exceeded

Some BioC PMC documents run into issues with the XMISplitter class when trying to parse the XMI data created by the BioC-PMC-CollectionReader:

javax.xml.stream.XMLStreamException: Maximum attribute size limit (524288) exceeded
       at com.ctc.wstx.sr.StreamScanner.constructLimitViolation(StreamScanner.java:2469)
       at com.ctc.wstx.sr.StreamScanner.verifyLimit(StreamScanner.java:2462)
       at com.ctc.wstx.sr.BasicStreamReader.parseAttrValue(BasicStreamReader.java:1962)
       at com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:3065)
       at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2963)
       at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2839)
       at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1073)
       at org.codehaus.stax2.ri.Stax2EventReaderImpl.nextEvent(Stax2EventReaderImpl.java:255)
       at de.julielab.xml.XmiSplitter.processAndParse(XmiSplitter.java:356)
       at de.julielab.xml.XmiSplitter.storeSelected(XmiSplitter.java:314)
       at de.julielab.xml.XmiSplitter.process(XmiSplitter.java:241)
       at de.julielab.jules.consumer.CasToXmiDBConsumer.process(CasToXmiDBConsumer.java:410)
       at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
       at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:374)
       at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:298)
       at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
       at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.processNext(ProcessingUnit.java:897)
       at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.run(ProcessingUnit.java:577)
Jan 20, 2017 10:13:35 AM org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl callAnalysisComponentProcess(406)

We use the Woodstox parser because it solves issues with Unicode characters that the default Java7 StAX parser has (I don't know if that's gone by Java8). The Woodstox parser factory is WstxInputFactory and it has a configuration ReaderConfig. ReaderConfig has a constant DEFAULT_MAX_ATTRIBUTE_LENGTH = 65536 * 8. By setting the property WstxInputProperties.P_MAX_ATTRIBUTE_SIZE to another value on WstxInputFactory I would expect to raise this limit.

The XMISplitter currently initializes the XMLInputFactory in line 48. This should move into a constructor so that the above mentioned property may be set.

Point to missing and useful documentation

Please create issues if you could need further explanation of components authored by me. For now it doesn't make sense to just describe everything no matter if you need it or not. But if you need something, don't hesitate.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.