jervenbolleman / faldo-paper Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 5.0 3.18 MB

License: Other

TeX 100.00%

faldo-paper's People

Contributors

Stargazers

Watchers

Forkers

tfuji cmungall peterjc rjpbonnal fstrozzi

faldo-paper's Issues

Figure 5: property chain

"Figure 5: OWL2 property chain axiom to infer that all positions described in a INSDC record are positioned relative to the main sequence of the record."

I'm suspicious of this - I would have to see the complete ontology with all axioms plus preferably some examples.

Presumably beginOf is the inverse of begin?

Note that property chains are unidirectional. You can infer some relationships given some chain of relationships. But given a relationship, you can't infer that there must be some specified chain of relationships. So you can infer INSDC:reference given the chain, but not the other way round.

I'm not sure the chain beginOf o endOf can ever be satisfied coherently.

Title: should it include FALDO?

e.g.
FALDO: a semantic standard describing the location of nucleotide and protein feature annotation.

Genome Variation Format (GVF)

Currently we mention GVF once in the BioInterchange text (without definition in place, or in the abbreviations).

We could expand this, perhaps cite Reese et al. (2010) http://dx.doi.org/10.1186/gb-2010-11-8-r88 and include GVF in the abbreviations?

In the short term I have simply removed the mention of GVF and just used GTF and GFF3 as example formats in the BioInterchange description.

Split locations.tex into multiple text files one per section

Trans-splicing example

"In a process called transplicing exons of one gene can be found on multiple chromosomes"

This is a sufficient but not a necessary condition. Trans-spliced genes can have exons from the same chromosome - either from a distant site (as is common in C elegans), or from the same region

http://purl.obolibrary.org/obo/SO_0001420 ! trans_splice_site [DEF: "Primary transcript region bordering trans-splice junction."]

My favorite is mod(mdg4):

Mongelard F, Labrador M, Baxter EM, Gerasimova TI, Corces VG: Trans-splicing as a novel mechanism to explain interallelic complementation in Drosophila.
Genetics 2002, 160:1481-1487

You can get the GFF3 from FlyBase:

http://flybase.org/reports/FBgn0002781.html

(but it may need "re-stitching, as the translation to GFF3 may lose some information)

Figure 8 and 9 use faldo:ExactlyKnownPosition instead of ExactPosition

(new) Figure 1: The classes and object properties used in FALDO

The caption needs to be expanded: The left half of the figure is the classes, and the indentation and down arrows presumably indicate subclasses [I suspect a tree like presentation make this clearer, or at least increasing the indention?]. The right side is the properties, but what is the meaning of the blue icon (rectangle with white on the left end) versus the green icon (plain rectangle)?

Also more visual separation between the classes (left) and properties (right) might help avoid any confusion with any apparent mapping between class owl:Nothing and the horizontally aligned property after (etc).

Write paragraphs for main paper

We need from each sub-group is a contribution of 2-3 paragraphs describing your group's hackathon successes and ongoing activities. Also (to save Mark look-up time) please list all authors from your sub-group in that document.

Switch to licence CC-BY 4.0?

Currently FALDO uses the CC-BY 3.0 licence, should it switch to the newer CC-BY 4.0 licence?

Position position

The predicate faldo:position with lower case can be confusing with the uppercase class faldo:Position. Should we change one of the labels? or should we point to the convention as used in e.g. DCAT.

Make clearer that we talk about database record when using the word sequence.

In review, the use of the word sequence led to confusion. As talking about the real molecules in nature instead of the thing imported into EMBL.

Address circular genomes

I think the bacterial folks would be most happy if you address circular genomes, even if it is just to say that it's currently underspecified or not supported, but possible in the future.

JBrowse screenshot as new Figure?

We currently only mention the JBrowse example in passing. One idea would be an additional figure showing a JBrowse screenshot, perhaps displaying one of the real examples we already discuss, or a multi-dataset federated query?

Relation of Vario to Faldo

Could you please also check out the work of Mauno Vihinen to compare and comment? http://t.co/K4RXGFnPXj

ENA vs EMBL-Bank

Should we be talking about the ENA, or EMBL Nucleotide Sequence Database (EMBL-Bank), or both?

http://www.ebi.ac.uk/ena/about/formats
"Data tiers within ENA provide a level of abstraction from the underlying infrastructure that has resulted from the integration of three databases: the EMBL Nucleotide Sequence Database (EMBL-Bank), the Trace Archive and the Sequence Read Archive (SRA)."

Currently we mostly use ENA, but there are still references to EMBL (not currently in the abbreviations table). Probably in terms of annotation, we're mainly concerned with EMBL-Bank (as part of the triple mirror under the INSDC with NCBI/GenBank and DDBJ).

Alternative formats/content negotation

This is a tricky issue and might require a uri change of the ontology :(

http://biohackathon.org/resources/faldo is 302 redirected to http://www.biohackathon.org/resource/faldo then is redirected to http://78462f86-a-7141bcef-s-sites.googlegroups.com/a/biohackathon.org/www/resource/faldo and is finally redirected to https://78462f86-a-7141bcef-s-sites.googlegroups.com/a/biohackathon.org/www/resource/faldo by Google Sites. It might be inconvenient for some applications.

is there a way we could set up content-type negotiation and auto conversion between formats for faldo?

I think it's a good idea to have a release process with automatic validation, junit suites and publishing of the ontology in different forms.

As the biohackathon.org web site is run by Google Sites, I don't know how much control we can have over it...

One possible solution would be to host those resources (including ontology files) on the other server by assigning new subdomain (e.g., purl.biohackathon.org), however it requires a change of the ontology URI.
Alternatively, we may keep the current way but put the versions of FALDO files also on BioPortal.

Cite 1st and 2nd BioHackathon papers too?

In the main text,

As part of the Integrated Database Project (http://lifesciencedb.mext.go.jp/en/)
and the Core Technology Development Program (http://biosciencedbc.jp/en/tec-dev-prog/programs)
to integrate life science databases in Japan, the National Bioscience Database
Center (NBDC) and the Database Center for Life Science (DBCLS) have hosted
an annual “BioHackathon” series of meetings bringing together biological
database teams, open source programmers, and domain experts in Semantic
Web and Linked Data [6,7].

Given the text covers the entire series, including the citations for the 1st and 2nd meeting too makes sense to me [Katayama et al 2010, 2011].
http://dx.doi.org/10.1186/2041-1480-1-8
http://dx.doi.org/10.1186/2041-1480-2-4

Figure 1 changes

Figure 1 is quite visually appealing, but I think could do with some improvement.

First of all, this should probably be figure 2. Figure 1 should orient the reader and give them some kind of overview. Perhaps the SubClass hierarchy of FALDO. This figure is already getting down in the weeds with some quite specific details.

In fact it may be better to precede this with a figure comparing a chunk of GFF with a FALDO instance graph, giving a "bigger picture" view.

Comments on the figure as it stands:

The blank node notation ("_:foo") is quite geeky and probably needs explained.
Should chr1 be a blank node?
The diagram uses the same convention (circles) for resources and literals. Maybe it's just me but I found this confusing
The usual W3 convention of circles=classes boxes=individuals is reversed
1(a) and 1(b) should be reversed. Introduce the problem ("how do we represent this thing you're familiar with") and then show the solution ("here is how it is represented in FALDO")
The notation "a" should be defined (= rdf:type)
What is the type of fr and rr?
I would prefer a more concrete example. What kind of feature is this? What's its ID?
It's not clear from the diagram but for this to be useful :fr and :rr should connect to some feature of interest

Claiming to handle all biological uses cases

We have a strong evidence of its power in that FALDO can handle all of the annotations in INSDC/DDBJ and UniProt, but biological systems have a habit of throwing up more strange cases. However, I feel that the current wording in the abstract and conclusion is too strong, "expressive enough to describe all known biological use cases accurately" and "power to describe all biological feature positions". As a reviewer I would ask for this language to be toned down.

inverse properties

Figure 2: OWL2 property chain axiom - this refers to faldo:endOf. In the current version of faldo, there are no inverses declared. These should be added to faldo, or the document should substitute the named properties with OWL inverse property expressions (which starts to look ugly in RDF syntax)

Details for BioPerl's FALDO exporter

We say in the text that BioPerl now includes a FALDO feature exporter - from which version onwards - and is this in the main bundle, or separate?

Visual diagrams to supplement the (text only) examples?

The current text has a number of text only "Figures", e.g. showing a partial INSDC feature table, or a fragment of a UniProt flat file, and the FALDO equivalent. It would brighten up the paper (and hopefully explain the annotation example more immediately) if these were supplemented with an actual figure.

I could probably produce some line art and/or generate figures using Biopython's GenomeDiagram for this is people thought it would be a sensible addition.