Giter Site home page Giter Site logo

medical_crossing's Introduction

Medical Crossing: X-Lingual Clinical Concept Normalization Benchmark

This repository contains scripts automating the evaluation of clinical entity linking tasks in multiple languages.

The underlying philosophy is the following.

  • It is better to submodule the code than copy it.
  • ...And write a wrapper for eval scripts rather than change the original tool's code,
  • We assume that Python eval scripts usually need
    • core parameters commonly shared by many model implementations and
    • custom parameters specific for this particular model (or model implementation).
  • Therefore we need mappings between shared parameters <-> some DSL of our own.

All this should be done using a single script and be configurable via Hydra.


Overview of the repository

In general, evaluation is done with running universal_runner.py script, which takes hydra-style command line arguments as input. The set of possible parameters and default values are configured using YAML files in the config/ folder.

Each run is saved to results/sessions_YYYY-MM-DD-... with the corresponding YAML config.

To aggregate the results, universal_aggregator.py is run; it applies to each session's folder in the results a relevant output parser (set in model/*.yaml; code in untils.output_parsers).

As a result, a large CSV table is generated; a row per each set of params. The latest scores are kept in case of duplicate parameters sets.

Configs folder

config/
└─── dataset/
|    │   ...
|    │   # datasets YAML configs specifying [names, paths, languages, fairness level]
└─── model/
|    │   ...
|    │   # models YAML configs specifying [names, output parsers, model_path, parameters_mapping]
└─── parameters_mapping/
|    │   ...
|    │   # mappings of different arguments sets in the form "our DSL: custom_args", e.g. "data_directory: data_dir"
└─── vocabulary/
|    │   ...
|    │   # vocabularies YAML configs specifying [names, paths]
|    config.yaml

How to setup the benchmarks

Required manual work for preparing this benchmark

Unfortunately, these benchmarks require a few steps that cannot be automated; certain datasets are available only on request after certain formal procedures.

One should obtain

  1. Several corpora: CANTEMIST, MCN, Mantra GSC, CodiEsp (see Section "Datasets & Vocabularies")
  2. UMLS 2020AA data, precisely: MRCONSO.RRF, MRSTY.RRF
  3. CANTEMIST vocabulary used by SINAI research group team. The file 'cieo-synonyms.csv' is available on request to authors.
  4. MCN data introduced in the original paper.
  5. [optional] MedLexSp.zip, which is available on request and requires signing papers.

Preparing vocabularies

  1. Put MRCONSO.RRF, MRSTY.RRF and [optional] MedLexSp.zip into the root of the repository.
  2. To generate the English vocabulary used for MCN dataset, run ./vocab_generate_snomedct_all.sh. This would generate data/vocabs/SNOMEDCT_US-all-aggregated.txt, a large file.

Expected output: 1324661 data/vocabs/SNOMEDCT_US-all-aggregated.txt

  1. English vocabulary used specifically for CLEF2013 (disorders only!): run ./vocab_generate_snomedct_clef2013.sh. This would generate data/vocabs/SNOMEDCT_US_clef2013-biosyn-aggregated.txt.

Expected output: 363326 data/vocabs/SNOMEDCT_US_clef2013-biosyn-aggregated.txt

  1. Spanish vocabulary for CANTEMIST: run ./vocab_generate_cantemist_lopez_ubeda_et_al.sh. This generates data/vocabs/CANTEMIST-lopez-ubeda-et-al.txt. cieo-synonyms.csv should be at the root of the repo.
  2. CodiEsp vocabularies: run ./vocab_generate_icd10_codiesp.sh; zenodo_get Python package tool should be installed. Files data/vocabs/codiesp-d-codes-es.txt and data/vocabs/codiesp-p-codes-es.txt are generated as a result.
  3. MANTRA vocabularies: run ./vocab_generate_mantra.sh [in progress].
  4. To generate UMLS French DISO vocabulary run ./vocab_generate_umls_fre_diso.sh. This would generate data/vocabs/umls_fre_diso.txt.
  5. [optional] To prepare MedLexSp, run ./vocab_generate_medlexsp.sh; file data/vocabs/MedLexSp_v0.txt should be generated.

Preparing datasets

  1. CANTEMIST: simply run ./data_generate_cantemist.sh; zenodo_get is required. The results will be saved into data/datasets/cantemist.
  2. MCN: the data can be downloaded here, after the registration. The unpacked data should be put into a folder data/datasets/MCN_n2c2 (so that the folders test, train, gold would be present). Then scripts ./data_generate_mcn.sh and query_preprocess_in_the_style_of_biosyn.sh should be executed [in progress].
  3. CodiEsp: run ./data_generate_codiesp.sh (requires zenodo_get), the results will be written to data/datasets/codiesp.
  4. MANTRA: can be downloaded here [in progress].

Filtering datasets

Example:

python3 fairification.py --test_dir data/datasets/codiesp/DIAGNOSTICO/test \
                          --train_dir data/datasets/codiesp/DIAGNOSTICO/train \
                          --vocabulary data/vocabs/codiesp-d-codes-es.txt \
                          --levenshtein_norm_method 1 \
                          --levenshtein_threshold 0.2

This may take a while; should generate folders:

data/
└─── datasets/
|    └─── codiesp/
|    |    └─── DIAGNOSTICO/
|    |    |    test-fair_exact/
|    |    |    ... 
|    |    |    test-fair_exact_vocab/
|    |    |    ... 
|    |    |    test-fair_levenshtein_0.2/
|    |    |    ... 
|    |    |    test-fair_levenshtein_train_0.2/
|    |    |    ... 

Evaluation without pretraining

For running a unifying evaluation script, run e.g. as follows:

    python3.6 universal_runner.py 

The default params can be configured in config/ directory.

Generating evaluation scripts for multiple settings

Statistics aggregation

All the results are reported in separate folders. To build a single *.CSV sheet with the results, run the script

python3.6 universal_aggregator.py

Setting up a Docker image

    cd medical_crossing/
    docker build . -t medical_crossing  
    nvidia-docker run -p 8807:8807 -v "`pwd`:/root/medical_crossing/" -it medical_crossing:latest bash

References to the works used

Datasets & vocabularies

  • CodiEsp: Clinical Case Coding in Spanish Shared Task (eHealth CLEF 2020)
  • MCN: The 2019 n2c2/UMass Track 3
  • CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition
  • Mantra GSC, link
  • [optional] MedLexSp.zip
  • UMLS 2020AA

medical_crossing's People

Contributors

alexeyev avatar tutubalinaev avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.