phenopacket-format's Issues
provide a top level entry page
We have a few possibilities for sending people to:
- the org -- https://github.com/phenopackets/
- the main format repo -- https://github.com/phenopackets/phenopacket-format
- Getting Started in the wiki (linked from the README in 2)
all are a bit geeky to the non-github familiar
It may be better if we have a splash page (authored in github pages and visible on phenopackets/github.io). Could just be very minimal, with quick links to the wiki, possibly duplicating the getting started page?
negation
assertions of the absence of a phenotype might be needed.
Ensure molecular phenotypes are representable
Things like the impact of a mutation on the affected protein's Molecular Function
Extend, revise and synchronize examples
Many of the examples in this repo are out of date. The ref implementation test resources is currently the best place for these. We should either auto-sync these (possible confusing) or simply designate the ref as the canonical place.
We should also have more examples of snippets in the wiki
Modifiers
Consider allowing modifiers such as bilateral, unilateral. For instance these modifiers could be taken from the HPO subontology for clinical modifiers.
For instance, we might want ** bilataral ** iris coloboma
Add a JSON-LD context file
Although the likely schema-level specification will be JSON-Schema (#31) this will live alongside a JSON-LD context that will specify the complete semantics of the format, and will be used to convert between RDF and JSON
need association types
such as "is causal for", "has been associated with", "is protective for" etc.
Need to define these.
see also monarch-initiative/dipper#195
Traceable author statements that are referred to in other papers
"Slavotinek et al. [6] have reviewed the phenotypes associated with 2q24-2q31 and 2q31-33 deletions. The 2q31.1 region includes the HOXD cluster, one of four highly evolutionally conserved, homologous gene clusters coding for transcription factors with crucial roles in embryonic development. More specifically, the HOXD cluster has been implicated in limb formation [4, 7]."
This was a statement in passing and not the assertion of the enclosing paper. PMC4498842
However, we probably do want to capture this stuff but how? As a GO annotation?
GSoC 2016 @ ga4gh.org
@cmungall, I wish to contribute to the project Phenotype exchange standard ( #6 at https://docs.google.com/document/d/15yUku7fdR3x_nH3eziI3VgXzvkUQrcVeN7C4pwqk3nU/edit ) and participate in GSoC for ga4gh.org in same project. I have experience making web-applications and have worked with NodeJS, HTML, CSS, Javascript. I also am familiar with Python and network usage (HTTP, SSH). Please help me get started.
I am sorry for opening an maybe irrelevant issue like this but the irc is dead and I have no replies to my mails.
Use JSON-schema instead of kwalify
kwalify becomes unreadable for complex schemas.
We can adopt a JSON schema standard, so long as we stick to the JSON subset of YAML. ProtoBuf may be a possibility if there is a standard mapping to JSON.
Further afield, SHACL may be worth considering
Should YAML subset be restricted to what is expressible in JSON (meaning exclude the ability to use references)
YAML allows us to reference the same object (first mention with &
, future with *
).
While we currently use this in the schema, this may be problematic in the packets. It could be confusing for producers: when to reuse vs when to duplicate? The semantics may be subtle here.
Also, this prohibits a translation to JSON, which is strictly trees.
Note that we do allow referencing of some entities via keys (see #22). But this is not at the level of YAML itself, it's at the level of our structural schema. Having two ways of doing this is definitely confusing
Specify allowable age/DOB values
We need to be able to specify age at the time of phenotype capture, as well as DOB, or a range, or none. Many times age or DOB may not be known.
A few examples of each showing allowable values would be helpful
normalize documentation
Consider reviewing various documentation pages for consistency of tone and coverage. For example, the (https://github.com/phenopackets/phenopacket-format/wiki/Identifiers)[identifiers page] refers to a default JSON-LD context, which might or might not make sense for YAML versions.
Similarly, we might want to help folks bridge JSON-LD vs. JSON-Schema....
Add JSON-Schema validator to travis
Dependent on #31
investigate possibility for a global IRI scheme for uniquely identifying a variant
There are schemes such as HGVS that unambiguously map a genomic variation to a string. It is suggested we use this in #10. However, in relation to #22, if we want to reference a variant we need to do it via an identifier, and not a string, and we have constraints on the syntax of identifiers to ensure uniqueness, persistence, resolvability.
In many cases we can use a pre-existing database identifier (e.g. if the variant is in clinvar), but for some cases there will be no public ID and we will need to reference via an identifier scheme.
As all identifiers in this format are URIs (although they are typically shortened to CURIEs), there are a few possibilities:
- A non-http URI. E.g.
urn:hgvs:...
See for example https://en.wikipedia.org/wiki/Uniform_Resource_Name#Examples - A http URI (URL) with recommendations for a neutral URL prefix, TBD
There has always been a debate about coupling or decoupling of identity to resolvability, going back before LSIDs. The format can remain neutral here, but we could potentially push forward in this direction.
Of course, we need to ensure whatever the technical scheme that the coupled standard (e.g. HGVS) is sufficient and can do things like uniquely reference any build of any chromosome/scaffold in any species. This may require extensions, more research required.
consider scope.
"The format is intended for rare and undiagnosed disease patients. It is not intended for cancer patients (although presence of cancer may be a feature of the disease). It is not intended for common disease patients."
Cancer is certainly a special case that will create many difficulties (some of which we're working on), but why not common diseases? Can we define the format as something that might reasonably be extended to handle common disease?
If not, we should say why common diseases won't work.
Confusion regarding entity versus admin profile declarations
Why are the DOBs or age not asserted within the "admin profile"?
from @cmungall :
Every assertion about an entity is partitioned into a module, and can have full provenance/audit info attached. By separating this into its own chunk, we have the flexibility of swapping out this piece and referencing a more dedicated format. This is the same principle for representing anything that is not a phenotype. There is a dedicated PED format, but we can capture this in the packet if we need to. Same for variants.
My confusion is more about the fact that we are recording sex and type on the entity declaration, but age on the admin profile. Is the idea that you could have the same person entity in the same phenopacket at different ages? What if the sex changes? What does the sex refer to anyway - chromosomal sex or phenotypic sex? Should potentially use the new PATO classes here?
entities:
- id: "doi: 10.1101/mcs.a000661#patient1"
type: human
biological_sex: female
admin_profile:
- entity: "doi: 10.1101/mcs.a000661#patient1"
property: age
value:
literal: 23 years
type: age
source:
id: "doi: 10.1101/mcs.a000661"
Suggestion for output format
Hi Tudor,
as we are discussing right now here are some suggestions
-
Allow users to upload either word or to paste in text into a window.
-
After initial text mining, it is an issue that a paper may describe two or more patients. It would be useful to find a way for users to assign mined HPO terms to individual patients. One simple thing is to allow a user to mark part of the text that pertains to a single patient. Or allow users to enter the name of all patients described in an article and have the GUI present a table like this
* patient ID 1 * patient ID 2 * patients ID 3
HP1 * x * * x
HP2 * x * x * x
HP3 * * * x
etc
each HP is the ID and prefered name of one of the mined terms. It should be possible to delete entire rows if the HPO term was a false positive.
Document mechanism for referencing entities within and across documents
the current proposal typically follows referencing by key value over nesting, for entities. This has some advantages - representation of entities can be shifted to a different document, and referenced from the ppkt (for example, pedigree info in a ped file, variant info in a vcf file, admin info in hospital records). At the same time the format currently allows these to be represented directly in the packet, for convenience. Recall also that the format is not just for cases, but also for entities such as variants, genes, genotypes, etc that may be represented in standard biomedical and bioinformatics databases.
Regardless of the entity type, we have 3 different scenarios.
- referencing an entity within the same ppk document
- referencing an external entity in a separate ppk document
- referencing an external entity in a non-ppk document, e.g. a VCF or PED file; or in some transient database
- referencing an external entity in a database that mints stable identifiers
Especially for 4, the need for a global unambiguous scheme is paramount. We will use CURIEs here, with a set of default prefixes, and the ability to add more THIS NEEDS DOCUMENTED.
For other cases, the requirement to have either pre-registered prefixes or a URL scheme may be onerous. For case 1, it's not strictly required, as identifiers can remain local (so long as this is clearly indicated, and client code makes no assumptions that these are global). One idea is to use something semantically equivalent to the concept of blank nodes (existential variables) in RDF. Currently the variant example uses a blank node, with the RDF convention of '_'
as the prefix. This is potentially confusing (@pnrobinson had a question about this). We could use a different convention here. (in this particular example, where we are referencing a variant, we can obviate the requirement by having a convention for universal global URIs for variants).
It may be better to simply enforce urn:uuids here (see https://en.wikipedia.org/wiki/Uniform_Resource_Name#Examples )
2 and 3 may be more difficult. We can simply ban 2. For 3, it is hard because we may not be in control of how external formats handle identifiers. We may need a bipartite scheme - a way to reference a particular document, and a local scheme for entities in that document, that is format specific, with us referencing entities by concatenating this tuple.
Flags for phenopacket types (manual, automatic, others?)
Related to #27 (comment)
Algorithms that aggregate phenopackets with the aim of determining causal relationships need the ability to distinguish phenopackets that are the result of automatic entity recognition (eg. from journal articles) from those that are the result of manual curation. I'm not sure if we want to get more granular than that (eg. computationally inferred and manually verified). Thoughts? @cmungall @DoctorBud @tudorgroza
Define ontology to be used for encoding relationships from PED files
Should be subset of RO
See also #7
Investigate versioning of phenopacket instances due to evolution in representation and disease progression
There are two types of versioning we need to consider: representational and temporal.
1) Representational
Evolution due to change in ontology or scientific understanding. (Perhaps even to correct an error.)
2) Temporal
Evolution of sequential observations over time in a given patient/cohort/organism.
In both cases, we need a way to uniquely reference a specific version of a phenopacket instance, while being able to trace its history. This may have implications for the phenopacket registry more broadly. Long-tail repos like Dryad, Zenodo, etc are great at issuing DOIs but currently not up to that challenge of exposing versions in a sensible way. I love the way that F1000 displays/handles versioning. We should aim for that with the temporal considerations somehow woven in.
consider RTD for doucmentaton
Remove redundancy in JSON-schema
The json-schema has many repetitive sections. This is not in itself wrong, but it can be confusing and/or intimidating to people trying to introspect the format directly from the jsons (in fact they should use model documentation, which we need to provide - phenopackets/phenopacket-reference-implementation#17 )
If we use $ref tags we can simplify it - however, some tools may not respect these (cc @kshefchek )
Investigate approaches to patient identification
Should the standard make any effort to standardize the ways that patient identifiers are represented?
Eg. for a paper, is it adequate to say "patient 1"?
This is a bag of worms--vicious, slippery ones. Gahhhh ... I don't want to touch it with a 10-foot pole.
But we should nevertheless at least park it as an issue for (much) later.
We should first think about scenarios where machine-actionable identification of patients is important.
Situations that come to mind are the usual suspects:
Deduplicating results of parallel text-mining / data integration pipelines
We are a long way off from when that is going to be the bottleneck.
age
patient age is a rat's nest. we should allow age if that's all people have and prefer date of birth if needed.
Make phenopacket examples for model organism
For fly, maybe @dosumis can help.
a potential fly article:
Differential Masking of Natural Genetic Variation by miR-9a in Drosophila. Justin J. Cassidy, Alexander J. Straughan, Richard W. Carthew. GENETICS February 11, 2016 vol. 202 no. 2 675-687; DOI: 10.1534/genetics.115.183822
I'm thinking of doing one of these for zebrafish:
A Novel Ribosomopathy Caused by Dysfunction of RPL10 Disrupts Neurodevelopment and Causes X-Linked Microcephaly in Humans
Susan S. Brooks, Alissa L. Wall, Christelle Golzio, David W. Reid, Amalia Kondyles, Jason R. Willer, Christina Botti, Christopher V. Nicchitta, Nicholas Katsanis, Erica E. Davis
GENETICS October 14, 2014 vol. 198 no. 2 723-733; DOI: 10.1534/genetics.114.168211
I especially like this one as it has both human and zebrafish.
another option:
snow white, a Zebrafish Model of Hermansky-Pudlak Syndrome Type 5
Christina M. S. Daly, Jason Willer, Ronald Gregg, Jeffrey M. Gross
GENETICS October 2, 2013 vol. 195 no. 2 481-494; DOI: 10.1534/genetics.113.154898
Add CONTRIBUTING.md, and make contact info more explicit
We welcome any comments via github, this should be made more explicit in the intro. For non-github people, we should have either a list of email contacts, or a list.
phesub-model.md file mentioned everywhere doesn't exist
Revise documentation on local identifiers in wiki
https://github.com/phenopackets/phenopacket-format/wiki/Identifiers
cc @balhoff
hashes are problematic as they need quoted in yaml
Evaluate protobuf
See also #31
treatment histories
" For example, it would not record any history of treatment".
Certain treatments may be important for discussions of medication sensitivities, etc.
Variant representation
We should discuss how to best represent variants. Probably we need something flexible like
HGVS
NM_123:c.-123C>T
with various types that also work for chromosomes, microdeletions, and other sets of findings that might be protein biomarkers etc, so that this standard can be used with a wide range of diseases and publications.
In BioLark, enable user to configure ontologies used for mining
relationship to json:API
What is the relation of a phenopacket to the best practices described at http://jsonapi.org/
Are packets just nested under data: there?
M
temporal onset
relative temporal ordering of phenotype onset may be important for diagnosis. I think @mellybelly has some examples where a before b vs. b before a is a vital difference
negation
negation not handled in schema:
phenotype:
type:
not:
id: HP:0001608
label: Abnormality of the voice
does not validate
Further refine evidence references
Currently, we have evidence on the phenotype profile:
phenotype_profile:
- entity: "doi: 10.1101/mcs.a000661#patient1"
evidence:
type: TAS # Traceable author statement
source:
id: "doi: 10.1101/mcs.a000661"
title: "De novo pathogenic variants in CHAMP1 are associated with global developmental delay, intellectual disability, and dysmorphic facial features"
Does evidence go on any element? e.g. a phenotype profile, a genotype profile, a PED/Family reference? Does it go on individual phenotypes? Or does the whole phenopacket get just one or more evidence assertions?
We should also further decide how/which evidence codes to use and what source information should be described with different evidence codes. @mbrush can you help define a few. E.g. is TAS good here? For OMIM, the example has an IEA, for patient example1, it says "observation".
Add monarch links/screenshots for export phenopacket button
example for VCF files
Could VCF files have a file header that points at a phenopacket DOI?
This could be nice and flexible as you could have other types of things you might want to point at, such as microbiome distribution, metabolomics, etc.
latest VCF format spec: https://samtools.github.io/hts-specs/VCFv4.2.pdf
In BioLark output, include frequency of mined terms
In BioLark, allow user to relax precision stringency
In order to capture all possible synonyms if desired.
consider VCF format specification
in comparison to what we will do with with phenopackets:
https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex
add table to paper that discusses need for genotype standardization
we need to think about this in the context of journal associated data.
make a table that has examples, reference GENO
@probinson
Document strategy for implementation in APIs
This repo defines an exchange format(s), it is API-neutral.
However, it would be useful to have a reference API, and to document strategies for incorporating pxf into existing APIs. A compatibility layer between both protobuf and swagger would be useful. Not clear the extent that could be autogenerated from the jsonschema (see #31).
Also consider hydra (related to #40 )
Automate generation of documentation from schema
This of course complements high level documentation, examples, etc
Dependent on #31
There are various options for JSON-Schema
This is for language-binding independent docs. See also phenopackets/phenopacket-reference-implementation#4
make data available outside paywall
We need to make the recommendation that whatever minimum phenotype standard we implement, that it is referencable within the JATS standard for journals so that it can be made available outside the paywall.
variant
It would be better to put the variant and the genotye into separate elements
- type: OPA1
value: "homozygous c.1601T>G" - value: c.1601T>G
- genotype: homozygous
Also, c.1601T>G is not yet correct HGVS. We also need to demand a transcript for instance. We should, at least in the future for our uber-phenopacket-widget, run something like Mutalyzer to check the mutation nomenclature, this is extremely important for interoperability!
tool + PED
If there is a tool to be developed, it should be possible to enter pedigree information in that tool as well. the output of the tool should then be two files:
- patient -> phenotypes
- PED file
This way the user won't have to use another tool for PED-file generation and it is ensured that PATIENT-IDs are consistent.
Add version info to JSON Schema, and implement a procedure to ensure in sync with reference implementation
Currently the JSON-Schema is purely derived. The procedure is to run SchemaGeneratorTest in the reference implementation. The Makefile in this repo copies this across. This is potentially confusing, things can get out of sync.
- The JSON-Schema should be tagged with version info, and this should sync'd with the reference implementation release on mvn central
- There should be a proper maven assembly target to generate the schema in the target/directory. Should this be part of the release?
- The format and reference repos should be better synced.
- The overall process should be better documented. We have minimal docs here: https://github.com/phenopackets/phenopacket-format/wiki/JSON-Schema
In many ways a merger of the two repos might make some of this easier
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.