Light

mapping-commons / disease-mappings Goto Github PK

View Code? Open in Web Editor NEW

5.0 11.0 0.0 4.99 MB

Repo to host disease ontology mappings

License: Creative Commons Zero v1.0 Universal

Makefile 31.93% Python 68.07%

disease-mappings's Introduction

Disease Mapping Commons

We collect publicly available mappings between disease ontologies and convert them to a common format (SSSOM).

As of 10.03.2022, this mapping commons is still under construction - in pre-pre-alpha state.

Editorial team

Nicolas Matentzoglu (@matentzn)
Harshad Hegde (@hrshdhgd)
Chris Mungall (@cmungall)
Nicole Vasilevsky (@nicolevasilevsky)
Sabrina Toro (@sabrinatoro)

For now, use the issue tracker if you have any questions.

disease-mappings's People

Contributors

Stargazers

Watchers

disease-mappings's Issues

Change default license

While we are still working on propagating license statements, can we use

license: https://github.com/mapping-commons/mapping-commons.github.io/blob/main/docs/original_license_applies.md

as a default license for all mapping sets in dmc (disease mapping commons) that do not have one already?

https://github.com/mapping-commons/mapping-commons.github.io/blob/main/docs/original_license_applies.md

Prefix maps do not exist

So far I have this list during generation of mondo.sssom.tsv

WARNING:root:ICD9CM is used in the data frame but does not exist in prefix map
WARNING:root:Orphanet is used in the data frame but does not exist in prefix map
WARNING:root:Wikidata is used in the data frame but does not exist in prefix map
WARNING:root:SCITD is used in the data frame but does not exist in prefix map
WARNING:root:DERMO is used in the data frame but does not exist in prefix map
WARNING:root:PO_GIT is used in the data frame but does not exist in prefix map
WARNING:root:MTH is used in the data frame but does not exist in prefix map
WARNING:root:MEDGEN is used in the data frame but does not exist in prefix map
WARNING:root:KUPO is used in the data frame but does not exist in prefix map
WARNING:root:Reactome is used in the data frame but does not exist in prefix map
WARNING:root:HGNC is used in the data frame but does not exist in prefix map
WARNING:root:CSP is used in the data frame but does not exist in prefix map
WARNING:root:GC_ID is used in the data frame but does not exist in prefix map
WARNING:root:SUBSET_SIREN is used in the data frame but does not exist in prefix map
WARNING:root:url is used in the data frame but does not exist in prefix map
WARNING:root:LOINC is used in the data frame but does not exist in prefix map
WARNING:root:NDFRT is used in the data frame but does not exist in prefix map
WARNING:root:IMDRF is used in the data frame but does not exist in prefix map
WARNING:root:ICDO is used in the data frame but does not exist in prefix map
WARNING:root:OMOP is used in the data frame but does not exist in prefix map
WARNING:root:MeSH is used in the data frame but does not exist in prefix map
WARNING:root:ICD9 is used in the data frame but does not exist in prefix map
WARNING:root:GARD is used in the data frame but does not exist in prefix map
WARNING:root:COHD is used in the data frame but does not exist in prefix map
WARNING:root:Fyler is used in the data frame but does not exist in prefix map
WARNING:root:ICD11 is used in the data frame but does not exist in prefix map
WARNING:root:MEDDRA is used in the data frame but does not exist in prefix map
WARNING:root:Wikipedia is used in the data frame but does not exist in prefix map
WARNING:root:GTR is used in the data frame but does not exist in prefix map
WARNING:root:CALOHA is used in the data frame but does not exist in prefix map
WARNING:root:ONCOTREE is used in the data frame but does not exist in prefix map
WARNING:root:NIFSTD is used in the data frame but does not exist in prefix map
WARNING:root:EPCC is used in the data frame but does not exist in prefix map

Is there an automated way of deducing prefix_maps @matentzn ?

Document ICD10 mapping sources

@sabrinatoro

Provide a list of URLs that you think hold good quality ICD10CM mappings, e.g.:

https://www.nlm.nih.gov/research/umls/mapping_projects/snomedct_to_icd10cm.html

Include Meddra mappings from Open Targets

See here: monarch-initiative/mondo#3122

Action items:

include open targets mappings in mapping commons as is (with metadata and attribution and all)
Provide a few lines of documentation on how to update the mapping with a pull request (docs/update-source-mappings.md).

These will then naturally be taking into account by a future "boomer" run (#9). Once the boomer pipeline is implemented, reconciled mappings will go straight back into mondo.owl.

Rules for disease mappings

We should prefer skos vocabulary over anything else (skos:exactMatch over owl:equivalentClass)
Every mapping_set must have resolvable mapping_set_id
mapping_set_ids defined at this mapping commons must adhere to this pattern for their ID:

http://w3id.org/sssom/commons/disease/[a-z][a-z0-9-]+.sssom.tsv

i.e. no underscores, upper case, or non-ascii characters in the id part.

To be continued

Ingest Mondo mappings

Use the Makefile to sync all Mondo mappings from monarch-initiative/mondo#3058 to mapping commons (literally wget them).

No need to try and use sssom-py mapping extraction on Mondo!

Basic progress monitor for mappings

We need a way to understand how far we are along the mapping process and how far we still need to go.

Functional requirements:

We have statistics and tables about:
- Number of ICD terms (100%)
- ICD terms that are
  - not mapped
    - unmapped-in-scope: but in scope (it is a disease)
    - unmapped-excluded: but excluded (because out of scope)
  - mapped.
- unmapped-in-scope, unmapped-excluded and mapped add up to 100%.

Inputs to analysis:

Exclusion table (monarch-initiative/mondo#4570): sync from here.
Mappings (#4, @hrshdhgd will share link to a mondo-icd sssom mapping when its there)
ICD10 (owl file).

Output:

a table with all ICD codes that have a column category that is either unmapped-excluded, mapped or unmapped-in-scope
Summary summary statistics across

### Example Table:

code	category
ICD10CM:ABC	unmapped-in-scope
ICD10CM:A12	unmapped-excluded

Non-functional requirements:

We need to be able to do the same for any incoming ontology - the code should be agnostic to which ontology, mappingset etc it takes as an input
Should be delivered as a Makefile goal here

analysis/icd10cm-mapping-progress.tsv: mirror/icd10cm.owl exclusions/icd10.tsv mappings/icd10cm.sssom.tsv
     python........

Build ingest pipeline for SNOMED-ICD10 mappings

https://www.nlm.nih.gov/research/umls/mapping_projects/snomedct_to_icd10cm.html

this should be realised as a SSSOM py extension, analogous to:

https://github.com/mapping-commons/sssom-py/blob/52ad5ae07ff1ee1dc14bfa84ba7cf02c92db7640/sssom/parsers.py#L415

If this turns out cumbersome, a stand alone python script here in mapping commons will do as well!

Boomer pipeline prototype implementation for Mondo ICD10CM mapping

Implement goal in the Makefile that takes as an input a set of mappings, then

generates ptables
runs boomer
Outputs a message that indicates "the next step" (i.e. which clique to review)

Integrate other mapping sources (drop links)

https://phewascatalog.org/#

Ingest DO, ORDO, OMIM, NCIT mappings

Have separate goals for obtaining the sources, which may eventually be extended to contain preprocessing (e.g. mirror/do.owl:)
Using sssom-py extract all mappings from DO, ORDO, OMIM and NCIT.
Manually review a dozen or so in the source ontology whether they don't contain extract curation information (ORDO ntbt for example)

Ingest OMOP2OBO mappings

Lets focus on MONDO/ICD10 related ones for now.

cc @callahantiff

UMLS - HPO Mappings

Get UMLS key and download UMLS
Read this: https://www.ncbi.nlm.nih.gov/books/NBK9685/ and make sure all we need is in MRCONSO, and not MRMAP.RRF and MRSMAP.RRF
Obtain MRCONSO.RRF (schema)
Make sure this file covers all "UMLS-HPO" mappings. What is the % of all HPO terms in that file?
Filter MRCONSO.RFF to only HP relevant (grep)
Add UMLS parser to sssom py to read MRCONSO.RRF and convert into SSSOM. The parser should have a switch to distinguish between MRCONSO and MRMAP files (both are needed, MRMAP has ID to ID mappings from sources, and MRCONSO the UMLS to SOURCE "mappings")
Make PR here with a pipeline using SSSOM PY to extract hp-umls.sssom.tsv from MRCONSO.RFF.

Add mapping sets to disease mapping commons

From Mondo:

mondo_ncit_exact
mondo_do_exact
mondo_orphanet_exact
mondo_omim_exact
mondo_omimps_exact

Capturing confidence levels from the side of the registry

We should start thinking about capturing confidence in mappings from the side of the registry.

My suggestion is to have a separate element on the registry if the referenced mapping sets:

mapping_set_id: x:y
registry_confidence: 0.5

which capture how much we "trust" a mapping set. We can specify also a default_registry_confidence directly for the registry metadata, which captures the confidence for all registered mapping sets that do not have a registry_confidence value. I would suggest to set it at 0.75 or something similar.

Review general mapping rules for diseases and phenotypes

The idea is to figure out a clear recipe with which we can determine a match between two phenotypes and two diseases.

@sabrinatoro Can you help me with that? I would like to capture all the possible mapping rules that can lead to a mapping. This does not include your fine-grained work on distinguishing when to do "exact" vs "narrow" that you captured in your ICD10 work - just the general "thought processes" that can be applied to determine whether a mapping (exact or otherwise) holds.

Mapping diseases

When matching diseases, potentially across species, the following matching disease rules (MDR) can be applied:

MDR1: two diseases (across species) share phenotypic presentation
MDR2: two diseases (across species) share known genetic underpinnings
MDR3: two diseases share phenotypic presentation and genetic underpinnings
MDR4: two diseases share same same label
MDR5: two diseases share very similar textual descriptions that, from a curators perspective, appear to be describing analogous concepts
MDR6: two diseases appear to be the same concept based on domain knowledge of the curator

Mapping phenotypes

MPR1: two phenotypes are associated with the exact same set of diseases
MPR2: two phenotypes inhere in homologous structures and exhibit the same quality (e.g. increased thickness)
MPR3: two phenotypes share very similar descriptions that, from a curators perspective, appear to be describing analogous concepts
MPR4: two phenotypes are caused by the same set of (orthologous) genes

Slurp Medgen HPO mappings

see https://www.ncbi.nlm.nih.gov/medgen/docs/faq/

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.