Giter Site home page Giter Site logo

disease-mappings's People

Contributors

hrshdhgd avatar matentzn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

disease-mappings's Issues

Review general mapping rules for diseases and phenotypes

The idea is to figure out a clear recipe with which we can determine a match between two phenotypes and two diseases.

@sabrinatoro Can you help me with that? I would like to capture all the possible mapping rules that can lead to a mapping. This does not include your fine-grained work on distinguishing when to do "exact" vs "narrow" that you captured in your ICD10 work - just the general "thought processes" that can be applied to determine whether a mapping (exact or otherwise) holds.

Mapping diseases

When matching diseases, potentially across species, the following matching disease rules (MDR) can be applied:

  • MDR1: two diseases (across species) share phenotypic presentation
  • MDR2: two diseases (across species) share known genetic underpinnings
  • MDR3: two diseases share phenotypic presentation and genetic underpinnings
  • MDR4: two diseases share same same label
  • MDR5: two diseases share very similar textual descriptions that, from a curators perspective, appear to be describing analogous concepts
  • MDR6: two diseases appear to be the same concept based on domain knowledge of the curator

Mapping phenotypes

  • MPR1: two phenotypes are associated with the exact same set of diseases
  • MPR2: two phenotypes inhere in homologous structures and exhibit the same quality (e.g. increased thickness)
  • MPR3: two phenotypes share very similar descriptions that, from a curators perspective, appear to be describing analogous concepts
  • MPR4: two phenotypes are caused by the same set of (orthologous) genes

Rules for disease mappings

  1. We should prefer skos vocabulary over anything else (skos:exactMatch over owl:equivalentClass)
  2. Every mapping_set must have resolvable mapping_set_id
  3. mapping_set_ids defined at this mapping commons must adhere to this pattern for their ID:
http://w3id.org/sssom/commons/disease/[a-z][a-z0-9-]+.sssom.tsv

i.e. no underscores, upper case, or non-ascii characters in the id part.

To be continued

Basic progress monitor for mappings

We need a way to understand how far we are along the mapping process and how far we still need to go.

Functional requirements:

  • We have statistics and tables about:
    • Number of ICD terms (100%)
    • ICD terms that are
      • not mapped
        • unmapped-in-scope: but in scope (it is a disease)
        • unmapped-excluded: but excluded (because out of scope)
      • mapped.
    • unmapped-in-scope, unmapped-excluded and mapped add up to 100%.

Inputs to analysis:

Output:

  • a table with all ICD codes that have a column category that is either unmapped-excluded, mapped or unmapped-in-scope
  • Summary summary statistics across

### Example Table:

code category
ICD10CM:ABC unmapped-in-scope
ICD10CM:A12 unmapped-excluded

Non-functional requirements:

  • We need to be able to do the same for any incoming ontology - the code should be agnostic to which ontology, mappingset etc it takes as an input
  • Should be delivered as a Makefile goal here
analysis/icd10cm-mapping-progress.tsv: mirror/icd10cm.owl exclusions/icd10.tsv mappings/icd10cm.sssom.tsv
     python........

Ingest DO, ORDO, OMIM, NCIT mappings

  • Have separate goals for obtaining the sources, which may eventually be extended to contain preprocessing (e.g. mirror/do.owl:)
  • Using sssom-py extract all mappings from DO, ORDO, OMIM and NCIT.
  • Manually review a dozen or so in the source ontology whether they don't contain extract curation information (ORDO ntbt for example)

UMLS - HPO Mappings

  1. Get UMLS key and download UMLS
  2. Read this: https://www.ncbi.nlm.nih.gov/books/NBK9685/ and make sure all we need is in MRCONSO, and not MRMAP.RRF and MRSMAP.RRF
  3. Obtain MRCONSO.RRF (schema)
  4. Make sure this file covers all "UMLS-HPO" mappings. What is the % of all HPO terms in that file?
  5. Filter MRCONSO.RFF to only HP relevant (grep)
  6. Add UMLS parser to sssom py to read MRCONSO.RRF and convert into SSSOM. The parser should have a switch to distinguish between MRCONSO and MRMAP files (both are needed, MRMAP has ID to ID mappings from sources, and MRCONSO the UMLS to SOURCE "mappings")
  7. Make PR here with a pipeline using SSSOM PY to extract hp-umls.sssom.tsv from MRCONSO.RFF.

Capturing confidence levels from the side of the registry

We should start thinking about capturing confidence in mappings from the side of the registry.

My suggestion is to have a separate element on the registry if the referenced mapping sets:

mapping_set_id: x:y
registry_confidence: 0.5

which capture how much we "trust" a mapping set. We can specify also a default_registry_confidence directly for the registry metadata, which captures the confidence for all registered mapping sets that do not have a registry_confidence value. I would suggest to set it at 0.75 or something similar.

Include Meddra mappings from Open Targets

See here: monarch-initiative/mondo#3122

Action items:

  • include open targets mappings in mapping commons as is (with metadata and attribution and all)
  • Provide a few lines of documentation on how to update the mapping with a pull request (docs/update-source-mappings.md).

These will then naturally be taking into account by a future "boomer" run (#9). Once the boomer pipeline is implemented, reconciled mappings will go straight back into mondo.owl.

Prefix maps do not exist

So far I have this list during generation of mondo.sssom.tsv

WARNING:root:ICD9CM is used in the data frame but does not exist in prefix map
WARNING:root:Orphanet is used in the data frame but does not exist in prefix map
WARNING:root:Wikidata is used in the data frame but does not exist in prefix map
WARNING:root:SCITD is used in the data frame but does not exist in prefix map
WARNING:root:DERMO is used in the data frame but does not exist in prefix map
WARNING:root:PO_GIT is used in the data frame but does not exist in prefix map
WARNING:root:MTH is used in the data frame but does not exist in prefix map
WARNING:root:MEDGEN is used in the data frame but does not exist in prefix map
WARNING:root:KUPO is used in the data frame but does not exist in prefix map
WARNING:root:Reactome is used in the data frame but does not exist in prefix map
WARNING:root:HGNC is used in the data frame but does not exist in prefix map
WARNING:root:CSP is used in the data frame but does not exist in prefix map
WARNING:root:GC_ID is used in the data frame but does not exist in prefix map
WARNING:root:SUBSET_SIREN is used in the data frame but does not exist in prefix map
WARNING:root:url is used in the data frame but does not exist in prefix map
WARNING:root:LOINC is used in the data frame but does not exist in prefix map
WARNING:root:NDFRT is used in the data frame but does not exist in prefix map
WARNING:root:IMDRF is used in the data frame but does not exist in prefix map
WARNING:root:ICDO is used in the data frame but does not exist in prefix map
WARNING:root:OMOP is used in the data frame but does not exist in prefix map
WARNING:root:MeSH is used in the data frame but does not exist in prefix map
WARNING:root:ICD9 is used in the data frame but does not exist in prefix map
WARNING:root:GARD is used in the data frame but does not exist in prefix map
WARNING:root:COHD is used in the data frame but does not exist in prefix map
WARNING:root:Fyler is used in the data frame but does not exist in prefix map
WARNING:root:ICD11 is used in the data frame but does not exist in prefix map
WARNING:root:MEDDRA is used in the data frame but does not exist in prefix map
WARNING:root:Wikipedia is used in the data frame but does not exist in prefix map
WARNING:root:GTR is used in the data frame but does not exist in prefix map
WARNING:root:CALOHA is used in the data frame but does not exist in prefix map
WARNING:root:ONCOTREE is used in the data frame but does not exist in prefix map
WARNING:root:NIFSTD is used in the data frame but does not exist in prefix map
WARNING:root:EPCC is used in the data frame but does not exist in prefix map

Is there an automated way of deducing prefix_maps @matentzn ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.