Giter Site home page Giter Site logo

phenopacket-format's Introduction

PhenoPackets

Build Status DOI

CAUTION THIS REPO HAS BEED RETIRED!

This initial implementation has now been archived - please refer to the phenopacket-schema repository for the current implementation.

Overview

PhenoPackets is an open standard for representing and sharing detailed descriptions of phenotypic abnormalities and characteristics of individual patients, organisms, diseases, and publications. This repository serves as the primary documentation about the PhenoPacket Exchange Format (PXF), including the JSON and YAML representations. Other repositories (see Implementations below) contain Java, JavaScript, Python and other language-specific tools and implementations.

Motivation

The health of an individual organism results from a complex interplay between its genes and environment. Although great strides have been made in standardizing the representation of genetic information for exchange, there are no comparable standards to represent phenotypes (e.g. patient symptoms and disease features) and environmental factors. Phenotypic abnormalities of individual organisms are currently described in diverse places and in diverse formats: publications, databases, health records, registries, clinical trials, and even social media. However, the lack of standardization, accessibility, and computability among these contexts makes it extremely difficult to effectively extract and utilize these data, hindering the understanding of genetic and environmental contributions to disease.

Documentation

See the Phenopackets.org site for the public-facing project documentation.

Or, see the detailed Markdown-based documentation via GitHub.

The Wiki has additional documentation, although it may be out-of-date.

Implementations

Contributing

The PhenoPackets standard is still evolving, and there are many opportunities to help, including improving the expressivity of the format and providing implementations that enable.

The Issue Tracker is a good start.

phenopacket-format's People

Contributors

balhoff avatar cmungall avatar doctorbud avatar harryhoch avatar jmcmurry avatar julesjacobsen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phenopacket-format's Issues

Further refine evidence references

Currently, we have evidence on the phenotype profile:
phenotype_profile:

  • entity: "doi: 10.1101/mcs.a000661#patient1"
    evidence:
    type: TAS # Traceable author statement
    source:
    id: "doi: 10.1101/mcs.a000661"
    title: "De novo pathogenic variants in CHAMP1 are associated with global developmental delay, intellectual disability, and dysmorphic facial features"

Does evidence go on any element? e.g. a phenotype profile, a genotype profile, a PED/Family reference? Does it go on individual phenotypes? Or does the whole phenopacket get just one or more evidence assertions?

We should also further decide how/which evidence codes to use and what source information should be described with different evidence codes. @mbrush can you help define a few. E.g. is TAS good here? For OMIM, the example has an IEA, for patient example1, it says "observation".

age

patient age is a rat's nest. we should allow age if that's all people have and prefer date of birth if needed.

Investigate versioning of phenopacket instances due to evolution in representation and disease progression

There are two types of versioning we need to consider: representational and temporal.

1) Representational

Evolution due to change in ontology or scientific understanding. (Perhaps even to correct an error.)

2) Temporal

Evolution of sequential observations over time in a given patient/cohort/organism.

In both cases, we need a way to uniquely reference a specific version of a phenopacket instance, while being able to trace its history. This may have implications for the phenopacket registry more broadly. Long-tail repos like Dryad, Zenodo, etc are great at issuing DOIs but currently not up to that challenge of exposing versions in a sensible way. I love the way that F1000 displays/handles versioning. We should aim for that with the temporal considerations somehow woven in.

consider scope.

"The format is intended for rare and undiagnosed disease patients. It is not intended for cancer patients (although presence of cancer may be a feature of the disease). It is not intended for common disease patients."

Cancer is certainly a special case that will create many difficulties (some of which we're working on), but why not common diseases? Can we define the format as something that might reasonably be extended to handle common disease?

If not, we should say why common diseases won't work.

Investigate approaches to patient identification

Should the standard make any effort to standardize the ways that patient identifiers are represented?

Eg. for a paper, is it adequate to say "patient 1"?

This is a bag of worms--vicious, slippery ones. Gahhhh ... I don't want to touch it with a 10-foot pole.
But we should nevertheless at least park it as an issue for (much) later.
We should first think about scenarios where machine-actionable identification of patients is important.

Situations that come to mind are the usual suspects:
Deduplicating results of parallel text-mining / data integration pipelines

We are a long way off from when that is going to be the bottleneck.

investigate possibility for a global IRI scheme for uniquely identifying a variant

There are schemes such as HGVS that unambiguously map a genomic variation to a string. It is suggested we use this in #10. However, in relation to #22, if we want to reference a variant we need to do it via an identifier, and not a string, and we have constraints on the syntax of identifiers to ensure uniqueness, persistence, resolvability.

In many cases we can use a pre-existing database identifier (e.g. if the variant is in clinvar), but for some cases there will be no public ID and we will need to reference via an identifier scheme.

As all identifiers in this format are URIs (although they are typically shortened to CURIEs), there are a few possibilities:

There has always been a debate about coupling or decoupling of identity to resolvability, going back before LSIDs. The format can remain neutral here, but we could potentially push forward in this direction.

Of course, we need to ensure whatever the technical scheme that the coupled standard (e.g. HGVS) is sufficient and can do things like uniquely reference any build of any chromosome/scaffold in any species. This may require extensions, more research required.

Variant representation

We should discuss how to best represent variants. Probably we need something flexible like

HGVS
NM_123:c.-123C>T

with various types that also work for chromosomes, microdeletions, and other sets of findings that might be protein biomarkers etc, so that this standard can be used with a wide range of diseases and publications.

Add a JSON-LD context file

Although the likely schema-level specification will be JSON-Schema (#31) this will live alongside a JSON-LD context that will specify the complete semantics of the format, and will be used to convert between RDF and JSON

Add version info to JSON Schema, and implement a procedure to ensure in sync with reference implementation

Currently the JSON-Schema is purely derived. The procedure is to run SchemaGeneratorTest in the reference implementation. The Makefile in this repo copies this across. This is potentially confusing, things can get out of sync.

  • The JSON-Schema should be tagged with version info, and this should sync'd with the reference implementation release on mvn central
  • There should be a proper maven assembly target to generate the schema in the target/directory. Should this be part of the release?
  • The format and reference repos should be better synced.
  • The overall process should be better documented. We have minimal docs here: https://github.com/phenopackets/phenopacket-format/wiki/JSON-Schema

In many ways a merger of the two repos might make some of this easier

Make phenopacket examples for model organism

For fly, maybe @dosumis can help.
a potential fly article:

Differential Masking of Natural Genetic Variation by miR-9a in Drosophila. Justin J. Cassidy, Alexander J. Straughan, Richard W. Carthew. GENETICS February 11, 2016 vol. 202 no. 2 675-687; DOI: 10.1534/genetics.115.183822

I'm thinking of doing one of these for zebrafish:

A Novel Ribosomopathy Caused by Dysfunction of RPL10 Disrupts Neurodevelopment and Causes X-Linked Microcephaly in Humans
Susan S. Brooks, Alissa L. Wall, Christelle Golzio, David W. Reid, Amalia Kondyles, Jason R. Willer, Christina Botti, Christopher V. Nicchitta, Nicholas Katsanis, Erica E. Davis
GENETICS October 14, 2014 vol. 198 no. 2 723-733; DOI: 10.1534/genetics.114.168211

I especially like this one as it has both human and zebrafish.

another option:
snow white, a Zebrafish Model of Hermansky-Pudlak Syndrome Type 5
Christina M. S. Daly, Jason Willer, Ronald Gregg, Jeffrey M. Gross
GENETICS October 2, 2013 vol. 195 no. 2 481-494; DOI: 10.1534/genetics.113.154898

tool + PED

If there is a tool to be developed, it should be possible to enter pedigree information in that tool as well. the output of the tool should then be two files:

  • patient -> phenotypes
  • PED file

This way the user won't have to use another tool for PED-file generation and it is ensured that PATIENT-IDs are consistent.

temporal onset

relative temporal ordering of phenotype onset may be important for diagnosis. I think @mellybelly has some examples where a before b vs. b before a is a vital difference

Specify allowable age/DOB values

We need to be able to specify age at the time of phenotype capture, as well as DOB, or a range, or none. Many times age or DOB may not be known.

A few examples of each showing allowable values would be helpful

Traceable author statements that are referred to in other papers

"Slavotinek et al. [6] have reviewed the phenotypes associated with 2q24-2q31 and 2q31-33 deletions. The 2q31.1 region includes the HOXD cluster, one of four highly evolutionally conserved, homologous gene clusters coding for transcription factors with crucial roles in embryonic development. More specifically, the HOXD cluster has been implicated in limb formation [4, 7]."

This was a statement in passing and not the assertion of the enclosing paper. PMC4498842

However, we probably do want to capture this stuff but how? As a GO annotation?

Use JSON-schema instead of kwalify

kwalify becomes unreadable for complex schemas.

We can adopt a JSON schema standard, so long as we stick to the JSON subset of YAML. ProtoBuf may be a possibility if there is a standard mapping to JSON.

Further afield, SHACL may be worth considering

Modifiers

Consider allowing modifiers such as bilateral, unilateral. For instance these modifiers could be taken from the HPO subontology for clinical modifiers.
For instance, we might want ** bilataral ** iris coloboma

Confusion regarding entity versus admin profile declarations

Why are the DOBs or age not asserted within the "admin profile"?

from @cmungall :
Every assertion about an entity is partitioned into a module, and can have full provenance/audit info attached. By separating this into its own chunk, we have the flexibility of swapping out this piece and referencing a more dedicated format. This is the same principle for representing anything that is not a phenotype. There is a dedicated PED format, but we can capture this in the packet if we need to. Same for variants.

My confusion is more about the fact that we are recording sex and type on the entity declaration, but age on the admin profile. Is the idea that you could have the same person entity in the same phenopacket at different ages? What if the sex changes? What does the sex refer to anyway - chromosomal sex or phenotypic sex? Should potentially use the new PATO classes here?

entities:

  • id: "doi: 10.1101/mcs.a000661#patient1"
    type: human
    biological_sex: female

admin_profile:

  • entity: "doi: 10.1101/mcs.a000661#patient1"
    property: age
    value:
    literal: 23 years
    type: age
    source:
    id: "doi: 10.1101/mcs.a000661"

variant

It would be better to put the variant and the genotye into separate elements

  • type: OPA1
    value: "homozygous c.1601T>G"
  • value: c.1601T>G
  • genotype: homozygous

Also, c.1601T>G is not yet correct HGVS. We also need to demand a transcript for instance. We should, at least in the future for our uber-phenopacket-widget, run something like Mutalyzer to check the mutation nomenclature, this is extremely important for interoperability!

negation

assertions of the absence of a phenotype might be needed.

provide a top level entry page

We have a few possibilities for sending people to:

  1. the org -- https://github.com/phenopackets/
  2. the main format repo -- https://github.com/phenopackets/phenopacket-format
  3. Getting Started in the wiki (linked from the README in 2)

all are a bit geeky to the non-github familiar

It may be better if we have a splash page (authored in github pages and visible on phenopackets/github.io). Could just be very minimal, with quick links to the wiki, possibly duplicating the getting started page?

Should YAML subset be restricted to what is expressible in JSON (meaning exclude the ability to use references)

YAML allows us to reference the same object (first mention with &, future with *).

While we currently use this in the schema, this may be problematic in the packets. It could be confusing for producers: when to reuse vs when to duplicate? The semantics may be subtle here.

Also, this prohibits a translation to JSON, which is strictly trees.

Note that we do allow referencing of some entities via keys (see #22). But this is not at the level of YAML itself, it's at the level of our structural schema. Having two ways of doing this is definitely confusing

GSoC 2016 @ ga4gh.org

@cmungall, I wish to contribute to the project Phenotype exchange standard ( #6 at https://docs.google.com/document/d/15yUku7fdR3x_nH3eziI3VgXzvkUQrcVeN7C4pwqk3nU/edit ) and participate in GSoC for ga4gh.org in same project. I have experience making web-applications and have worked with NodeJS, HTML, CSS, Javascript. I also am familiar with Python and network usage (HTTP, SSH). Please help me get started.
I am sorry for opening an maybe irrelevant issue like this but the irc is dead and I have no replies to my mails.

Suggestion for output format

Hi Tudor,
as we are discussing right now here are some suggestions

  1. Allow users to upload either word or to paste in text into a window.

  2. After initial text mining, it is an issue that a paper may describe two or more patients. It would be useful to find a way for users to assign mined HPO terms to individual patients. One simple thing is to allow a user to mark part of the text that pertains to a single patient. Or allow users to enter the name of all patients described in an article and have the GUI present a table like this

      * patient ID 1 * patient ID 2 * patients ID 3
    

    HP1 * x * * x
    HP2 * x * x * x
    HP3 * * * x

etc

each HP is the ID and prefered name of one of the mined terms. It should be possible to delete entire rows if the HPO term was a false positive.

Document strategy for implementation in APIs

This repo defines an exchange format(s), it is API-neutral.

However, it would be useful to have a reference API, and to document strategies for incorporating pxf into existing APIs. A compatibility layer between both protobuf and swagger would be useful. Not clear the extent that could be autogenerated from the jsonschema (see #31).

Also consider hydra (related to #40 )

negation

negation not handled in schema:

phenotype:
  type:
    not:
      id: HP:0001608
      label: Abnormality of the voice

does not validate

Document mechanism for referencing entities within and across documents

the current proposal typically follows referencing by key value over nesting, for entities. This has some advantages - representation of entities can be shifted to a different document, and referenced from the ppkt (for example, pedigree info in a ped file, variant info in a vcf file, admin info in hospital records). At the same time the format currently allows these to be represented directly in the packet, for convenience. Recall also that the format is not just for cases, but also for entities such as variants, genes, genotypes, etc that may be represented in standard biomedical and bioinformatics databases.

Regardless of the entity type, we have 3 different scenarios.

  1. referencing an entity within the same ppk document
  2. referencing an external entity in a separate ppk document
  3. referencing an external entity in a non-ppk document, e.g. a VCF or PED file; or in some transient database
  4. referencing an external entity in a database that mints stable identifiers

Especially for 4, the need for a global unambiguous scheme is paramount. We will use CURIEs here, with a set of default prefixes, and the ability to add more THIS NEEDS DOCUMENTED.

For other cases, the requirement to have either pre-registered prefixes or a URL scheme may be onerous. For case 1, it's not strictly required, as identifiers can remain local (so long as this is clearly indicated, and client code makes no assumptions that these are global). One idea is to use something semantically equivalent to the concept of blank nodes (existential variables) in RDF. Currently the variant example uses a blank node, with the RDF convention of '_' as the prefix. This is potentially confusing (@pnrobinson had a question about this). We could use a different convention here. (in this particular example, where we are referencing a variant, we can obviate the requirement by having a convention for universal global URIs for variants).

It may be better to simply enforce urn:uuids here (see https://en.wikipedia.org/wiki/Uniform_Resource_Name#Examples )

2 and 3 may be more difficult. We can simply ban 2. For 3, it is hard because we may not be in control of how external formats handle identifiers. We may need a bipartite scheme - a way to reference a particular document, and a local scheme for entities in that document, that is format specific, with us referencing entities by concatenating this tuple.

treatment histories

" For example, it would not record any history of treatment".

Certain treatments may be important for discussions of medication sensitivities, etc.

Flags for phenopacket types (manual, automatic, others?)

Related to #27 (comment)

Algorithms that aggregate phenopackets with the aim of determining causal relationships need the ability to distinguish phenopackets that are the result of automatic entity recognition (eg. from journal articles) from those that are the result of manual curation. I'm not sure if we want to get more granular than that (eg. computationally inferred and manually verified). Thoughts? @cmungall @DoctorBud @tudorgroza

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.