phenopackets / phenopacket-format Goto Github PK

Makefile 13.29% Python 86.03% Shell 0.68%

phenopacket-format's Introduction

PhenoPackets

CAUTION THIS REPO HAS BEED RETIRED!

This initial implementation has now been archived - please refer to the phenopacket-schema repository for the current implementation.

Overview

PhenoPackets is an open standard for representing and sharing detailed descriptions of phenotypic abnormalities and characteristics of individual patients, organisms, diseases, and publications. This repository serves as the primary documentation about the PhenoPacket Exchange Format (PXF), including the JSON and YAML representations. Other repositories (see Implementations below) contain Java, JavaScript, Python and other language-specific tools and implementations.

Motivation

The health of an individual organism results from a complex interplay between its genes and environment. Although great strides have been made in standardizing the representation of genetic information for exchange, there are no comparable standards to represent phenotypes (e.g. patient symptoms and disease features) and environmental factors. Phenotypic abnormalities of individual organisms are currently described in diverse places and in diverse formats: publications, databases, health records, registries, clinical trials, and even social media. However, the lack of standardization, accessibility, and computability among these contexts makes it extremely difficult to effectively extract and utilize these data, hindering the understanding of genetic and environmental contributions to disease.

Documentation

See the Phenopackets.org site for the public-facing project documentation.

Or, see the detailed Markdown-based documentation via GitHub.

The Wiki has additional documentation, although it may be out-of-date.

Implementations

Contributing

The PhenoPackets standard is still evolving, and there are many opportunities to help, including improving the expressivity of the format and providing implementations that enable.

The Issue Tracker is a good start.

phenopacket-format's People

Contributors

Stargazers

Watchers

Forkers

mellybelly jmcmurry harryhoch tudorgroza balhoff heuermh doctorbud dfear2112 menggezhao nlharris

phenopacket-format's Issues

relationship to json:API

What is the relation of a phenopacket to the best practices described at http://jsonapi.org/

Are packets just nested under data: there?

phesub-model.md file mentioned everywhere doesn't exist

https://github.com/phenopackets/phenopacket-format/blob/master/phesub-model.md

Further refine evidence references

Currently, we have evidence on the phenotype profile:
phenotype_profile:

entity: "doi: 10.1101/mcs.a000661#patient1"
evidence:
type: TAS # Traceable author statement
source:
id: "doi: 10.1101/mcs.a000661"
title: "De novo pathogenic variants in CHAMP1 are associated with global developmental delay, intellectual disability, and dysmorphic facial features"

Does evidence go on any element? e.g. a phenotype profile, a genotype profile, a PED/Family reference? Does it go on individual phenotypes? Or does the whole phenopacket get just one or more evidence assertions?

We should also further decide how/which evidence codes to use and what source information should be described with different evidence codes. @mbrush can you help define a few. E.g. is TAS good here? For OMIM, the example has an IEA, for patient example1, it says "observation".

In BioLark, enable user to configure ontologies used for mining

age

patient age is a rat's nest. we should allow age if that's all people have and prefer date of birth if needed.

Investigate versioning of phenopacket instances due to evolution in representation and disease progression

There are two types of versioning we need to consider: representational and temporal.

1) Representational

Evolution due to change in ontology or scientific understanding. (Perhaps even to correct an error.)

2) Temporal

Evolution of sequential observations over time in a given patient/cohort/organism.

In both cases, we need a way to uniquely reference a specific version of a phenopacket instance, while being able to trace its history. This may have implications for the phenopacket registry more broadly. Long-tail repos like Dryad, Zenodo, etc are great at issuing DOIs but currently not up to that challenge of exposing versions in a sensible way. I love the way that F1000 displays/handles versioning. We should aim for that with the temporal considerations somehow woven in.

consider scope.

"The format is intended for rare and undiagnosed disease patients. It is not intended for cancer patients (although presence of cancer may be a feature of the disease). It is not intended for common disease patients."

Cancer is certainly a special case that will create many difficulties (some of which we're working on), but why not common diseases? Can we define the format as something that might reasonably be extended to handle common disease?

If not, we should say why common diseases won't work.

Investigate approaches to patient identification

Should the standard make any effort to standardize the ways that patient identifiers are represented?

Eg. for a paper, is it adequate to say "patient 1"?

This is a bag of worms--vicious, slippery ones. Gahhhh ... I don't want to touch it with a 10-foot pole.
But we should nevertheless at least park it as an issue for (much) later.
We should first think about scenarios where machine-actionable identification of patients is important.

Situations that come to mind are the usual suspects:
Deduplicating results of parallel text-mining / data integration pipelines

We are a long way off from when that is going to be the bottleneck.

Evaluate protobuf

investigate possibility for a global IRI scheme for uniquely identifying a variant

There are schemes such as HGVS that unambiguously map a genomic variation to a string. It is suggested we use this in #10. However, in relation to #22, if we want to reference a variant we need to do it via an identifier, and not a string, and we have constraints on the syntax of identifiers to ensure uniqueness, persistence, resolvability.

In many cases we can use a pre-existing database identifier (e.g. if the variant is in clinvar), but for some cases there will be no public ID and we will need to reference via an identifier scheme.

As all identifiers in this format are URIs (although they are typically shortened to CURIEs), there are a few possibilities:

A non-http URI. E.g. urn:hgvs:... See for example https://en.wikipedia.org/wiki/Uniform_Resource_Name#Examples
A http URI (URL) with recommendations for a neutral URL prefix, TBD

There has always been a debate about coupling or decoupling of identity to resolvability, going back before LSIDs. The format can remain neutral here, but we could potentially push forward in this direction.

Of course, we need to ensure whatever the technical scheme that the coupled standard (e.g. HGVS) is sufficient and can do things like uniquely reference any build of any chromosome/scaffold in any species. This may require extensions, more research required.

consider VCF format specification

in comparison to what we will do with with phenopackets:
https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex

@jmcmurry
@drseb
@pnrobinson

Variant representation

We should discuss how to best represent variants. Probably we need something flexible like

HGVS
NM_123:c.-123C>T

with various types that also work for chromosomes, microdeletions, and other sets of findings that might be protein biomarkers etc, so that this standard can be used with a wide range of diseases and publications.

Remove redundancy in JSON-schema

The json-schema has many repetitive sections. This is not in itself wrong, but it can be confusing and/or intimidating to people trying to introspect the format directly from the jsons (in fact they should use model documentation, which we need to provide - phenopackets/phenopacket-reference-implementation#17 )

If we use $ref tags we can simplify it - however, some tools may not respect these (cc @kshefchek )

consider RTD for doucmentaton

http://read-the-docs.readthedocs.org

Add JSON-Schema validator to travis

Dependent on #31

Add a JSON-LD context file

Although the likely schema-level specification will be JSON-Schema (#31) this will live alongside a JSON-LD context that will specify the complete semantics of the format, and will be used to convert between RDF and JSON

Add version info to JSON Schema, and implement a procedure to ensure in sync with reference implementation

Currently the JSON-Schema is purely derived. The procedure is to run SchemaGeneratorTest in the reference implementation. The Makefile in this repo copies this across. This is potentially confusing, things can get out of sync.

The JSON-Schema should be tagged with version info, and this should sync'd with the reference implementation release on mvn central
There should be a proper maven assembly target to generate the schema in the target/directory. Should this be part of the release?
The format and reference repos should be better synced.
The overall process should be better documented. We have minimal docs here: https://github.com/phenopackets/phenopacket-format/wiki/JSON-Schema

In many ways a merger of the two repos might make some of this easier

Make phenopacket examples for model organism

For fly, maybe @dosumis can help.
a potential fly article:

Differential Masking of Natural Genetic Variation by miR-9a in Drosophila. Justin J. Cassidy, Alexander J. Straughan, Richard W. Carthew. GENETICS February 11, 2016 vol. 202 no. 2 675-687; DOI: 10.1534/genetics.115.183822

I'm thinking of doing one of these for zebrafish:

A Novel Ribosomopathy Caused by Dysfunction of RPL10 Disrupts Neurodevelopment and Causes X-Linked Microcephaly in Humans
Susan S. Brooks, Alissa L. Wall, Christelle Golzio, David W. Reid, Amalia Kondyles, Jason R. Willer, Christina Botti, Christopher V. Nicchitta, Nicholas Katsanis, Erica E. Davis
GENETICS October 14, 2014 vol. 198 no. 2 723-733; DOI: 10.1534/genetics.114.168211

I especially like this one as it has both human and zebrafish.

another option:
snow white, a Zebrafish Model of Hermansky-Pudlak Syndrome Type 5
Christina M. S. Daly, Jason Willer, Ronald Gregg, Jeffrey M. Gross
GENETICS October 2, 2013 vol. 195 no. 2 481-494; DOI: 10.1534/genetics.113.154898

tool + PED

If there is a tool to be developed, it should be possible to enter pedigree information in that tool as well. the output of the tool should then be two files:

patient -> phenotypes
PED file

This way the user won't have to use another tool for PED-file generation and it is ensured that PATIENT-IDs are consistent.

temporal onset

relative temporal ordering of phenotype onset may be important for diagnosis. I think @mellybelly has some examples where a before b vs. b before a is a vital difference

Extend, revise and synchronize examples

Many of the examples in this repo are out of date. The ref implementation test resources is currently the best place for these. We should either auto-sync these (possible confusing) or simply designate the ref as the canonical place.

We should also have more examples of snippets in the wiki

Revise documentation on local identifiers in wiki

https://github.com/phenopackets/phenopacket-format/wiki/Identifiers

cc @balhoff

hashes are problematic as they need quoted in yaml

Automate generation of documentation from schema

This of course complements high level documentation, examples, etc

Dependent on #31

There are various options for JSON-Schema

This is for language-binding independent docs. See also phenopackets/phenopacket-reference-implementation#4

add table to paper that discusses need for genotype standardization

we need to think about this in the context of journal associated data.
make a table that has examples, reference GENO
@probinson

need association types

such as "is causal for", "has been associated with", "is protective for" etc.
Need to define these.

Specify allowable age/DOB values

We need to be able to specify age at the time of phenotype capture, as well as DOB, or a range, or none. Many times age or DOB may not be known.

A few examples of each showing allowable values would be helpful

Traceable author statements that are referred to in other papers

"Slavotinek et al. [6] have reviewed the phenotypes associated with 2q24-2q31 and 2q31-33 deletions. The 2q31.1 region includes the HOXD cluster, one of four highly evolutionally conserved, homologous gene clusters coding for transcription factors with crucial roles in embryonic development. More specifically, the HOXD cluster has been implicated in limb formation [4, 7]."

This was a statement in passing and not the assertion of the enclosing paper. PMC4498842

However, we probably do want to capture this stuff but how? As a GO annotation?

In BioLark, allow user to relax precision stringency

In order to capture all possible synonyms if desired.

Use JSON-schema instead of kwalify

kwalify becomes unreadable for complex schemas.

We can adopt a JSON schema standard, so long as we stick to the JSON subset of YAML. ProtoBuf may be a possibility if there is a standard mapping to JSON.

Further afield, SHACL may be worth considering

Add monarch links/screenshots for export phenopacket button

Dependent on:

https://github.com/monarch-initiative/monarch-app/pull/1212

Modifiers

Consider allowing modifiers such as bilateral, unilateral. For instance these modifiers could be taken from the HPO subontology for clinical modifiers.
For instance, we might want ** bilataral ** iris coloboma

Confusion regarding entity versus admin profile declarations

Why are the DOBs or age not asserted within the "admin profile"?

from @cmungall :
Every assertion about an entity is partitioned into a module, and can have full provenance/audit info attached. By separating this into its own chunk, we have the flexibility of swapping out this piece and referencing a more dedicated format. This is the same principle for representing anything that is not a phenotype. There is a dedicated PED format, but we can capture this in the packet if we need to. Same for variants.

My confusion is more about the fact that we are recording sex and type on the entity declaration, but age on the admin profile. Is the idea that you could have the same person entity in the same phenopacket at different ages? What if the sex changes? What does the sex refer to anyway - chromosomal sex or phenotypic sex? Should potentially use the new PATO classes here?

entities:

id: "doi: 10.1101/mcs.a000661#patient1"
type: human
biological_sex: female

admin_profile:

entity: "doi: 10.1101/mcs.a000661#patient1"
property: age
value:
literal: 23 years
type: age
source:
id: "doi: 10.1101/mcs.a000661"

Ensure molecular phenotypes are representable

Things like the impact of a mutation on the affected protein's Molecular Function

variant

It would be better to put the variant and the genotye into separate elements

type: OPA1
value: "homozygous c.1601T>G"
value: c.1601T>G
genotype: homozygous

Also, c.1601T>G is not yet correct HGVS. We also need to demand a transcript for instance. We should, at least in the future for our uber-phenopacket-widget, run something like Mutalyzer to check the mutation nomenclature, this is extremely important for interoperability!

negation

assertions of the absence of a phenotype might be needed.

Define ontology to be used for encoding relationships from PED files

Should be subset of RO

provide a top level entry page

We have a few possibilities for sending people to:

the org -- https://github.com/phenopackets/
the main format repo -- https://github.com/phenopackets/phenopacket-format
Getting Started in the wiki (linked from the README in 2)

all are a bit geeky to the non-github familiar

It may be better if we have a splash page (authored in github pages and visible on phenopackets/github.io). Could just be very minimal, with quick links to the wiki, possibly duplicating the getting started page?

Should YAML subset be restricted to what is expressible in JSON (meaning exclude the ability to use references)

YAML allows us to reference the same object (first mention with &, future with *).

While we currently use this in the schema, this may be problematic in the packets. It could be confusing for producers: when to reuse vs when to duplicate? The semantics may be subtle here.

Also, this prohibits a translation to JSON, which is strictly trees.

Note that we do allow referencing of some entities via keys (see #22). But this is not at the level of YAML itself, it's at the level of our structural schema. Having two ways of doing this is definitely confusing

GSoC 2016 @ ga4gh.org

@cmungall, I wish to contribute to the project Phenotype exchange standard ( #6 at https://docs.google.com/document/d/15yUku7fdR3x_nH3eziI3VgXzvkUQrcVeN7C4pwqk3nU/edit ) and participate in GSoC for ga4gh.org in same project. I have experience making web-applications and have worked with NodeJS, HTML, CSS, Javascript. I also am familiar with Python and network usage (HTTP, SSH). Please help me get started.
I am sorry for opening an maybe irrelevant issue like this but the irc is dead and I have no replies to my mails.

Suggestion for output format

Hi Tudor,
as we are discussing right now here are some suggestions

Allow users to upload either word or to paste in text into a window.
After initial text mining, it is an issue that a paper may describe two or more patients. It would be useful to find a way for users to assign mined HPO terms to individual patients. One simple thing is to allow a user to mark part of the text that pertains to a single patient. Or allow users to enter the name of all patients described in an article and have the GUI present a table like this
```
  * patient ID 1 * patient ID 2 * patients ID 3
```
HP1 * x * * x
HP2 * x * x * x
HP3 * * * x

etc

each HP is the ID and prefered name of one of the mined terms. It should be possible to delete entire rows if the HPO term was a false positive.

In BioLark output, include frequency of mined terms

example for VCF files

Could VCF files have a file header that points at a phenopacket DOI?
This could be nice and flexible as you could have other types of things you might want to point at, such as microbiome distribution, metabolomics, etc.

latest VCF format spec: https://samtools.github.io/hts-specs/VCFv4.2.pdf

Document strategy for implementation in APIs

This repo defines an exchange format(s), it is API-neutral.

However, it would be useful to have a reference API, and to document strategies for incorporating pxf into existing APIs. A compatibility layer between both protobuf and swagger would be useful. Not clear the extent that could be autogenerated from the jsonschema (see #31).

Also consider hydra (related to #40 )

negation

negation not handled in schema:

phenotype:
  type:
    not:
      id: HP:0001608
      label: Abnormality of the voice

does not validate

Document mechanism for referencing entities within and across documents

the current proposal typically follows referencing by key value over nesting, for entities. This has some advantages - representation of entities can be shifted to a different document, and referenced from the ppkt (for example, pedigree info in a ped file, variant info in a vcf file, admin info in hospital records). At the same time the format currently allows these to be represented directly in the packet, for convenience. Recall also that the format is not just for cases, but also for entities such as variants, genes, genotypes, etc that may be represented in standard biomedical and bioinformatics databases.

Regardless of the entity type, we have 3 different scenarios.

referencing an entity within the same ppk document
referencing an external entity in a separate ppk document
referencing an external entity in a non-ppk document, e.g. a VCF or PED file; or in some transient database
referencing an external entity in a database that mints stable identifiers

Especially for 4, the need for a global unambiguous scheme is paramount. We will use CURIEs here, with a set of default prefixes, and the ability to add more THIS NEEDS DOCUMENTED.

For other cases, the requirement to have either pre-registered prefixes or a URL scheme may be onerous. For case 1, it's not strictly required, as identifiers can remain local (so long as this is clearly indicated, and client code makes no assumptions that these are global). One idea is to use something semantically equivalent to the concept of blank nodes (existential variables) in RDF. Currently the variant example uses a blank node, with the RDF convention of '_' as the prefix. This is potentially confusing (@pnrobinson had a question about this). We could use a different convention here. (in this particular example, where we are referencing a variant, we can obviate the requirement by having a convention for universal global URIs for variants).

It may be better to simply enforce urn:uuids here (see https://en.wikipedia.org/wiki/Uniform_Resource_Name#Examples )

2 and 3 may be more difficult. We can simply ban 2. For 3, it is hard because we may not be in control of how external formats handle identifiers. We may need a bipartite scheme - a way to reference a particular document, and a local scheme for entities in that document, that is format specific, with us referencing entities by concatenating this tuple.

Add CONTRIBUTING.md, and make contact info more explicit

We welcome any comments via github, this should be made more explicit in the intro. For non-github people, we should have either a list of email contacts, or a list.

normalize documentation

Consider reviewing various documentation pages for consistency of tone and coverage. For example, the (https://github.com/phenopackets/phenopacket-format/wiki/Identifiers)[identifiers page] refers to a default JSON-LD context, which might or might not make sense for YAML versions.

Similarly, we might want to help folks bridge JSON-LD vs. JSON-Schema....

treatment histories

" For example, it would not record any history of treatment".

Certain treatments may be important for discussions of medication sensitivities, etc.

Flags for phenopacket types (manual, automatic, others?)

Related to #27 (comment)

Algorithms that aggregate phenopackets with the aim of determining causal relationships need the ability to distinguish phenopackets that are the result of automatic entity recognition (eg. from journal articles) from those that are the result of manual curation. I'm not sure if we want to get more granular than that (eg. computationally inferred and manually verified). Thoughts? @cmungall @DoctorBud @tudorgroza

make data available outside paywall

We need to make the recommendation that whatever minimum phenotype standard we implement, that it is referencable within the JATS standard for journals so that it can be made available outside the paywall.

http://jats.nlm.nih.gov/