Medical Crossing: X-Lingual Clinical Concept Normalization Benchmark

This repository contains scripts automating the evaluation of clinical entity linking tasks in multiple languages.

The underlying philosophy is the following.

It is better to submodule the code than copy it.
...And write a wrapper for eval scripts rather than change the original tool's code,
We assume that Python eval scripts usually need
- core parameters commonly shared by many model implementations and
- custom parameters specific for this particular model (or model implementation).
Therefore we need mappings between shared parameters <-> some DSL of our own.

All this should be done using a single script and be configurable via Hydra.

Overview of the repository

In general, evaluation is done with running universal_runner.py script, which takes hydra-style command line arguments as input. The set of possible parameters and default values are configured using YAML files in the config/ folder.

Each run is saved to results/sessions_YYYY-MM-DD-... with the corresponding YAML config.

To aggregate the results, universal_aggregator.py is run; it applies to each session's folder in the results a relevant output parser (set in model/*.yaml; code in untils.output_parsers).

As a result, a large CSV table is generated; a row per each set of params. The latest scores are kept in case of duplicate parameters sets.

Configs folder

config/
└─── dataset/
|    │   ...
|    │   # datasets YAML configs specifying [names, paths, languages, fairness level]
└─── model/
|    │   ...
|    │   # models YAML configs specifying [names, output parsers, model_path, parameters_mapping]
└─── parameters_mapping/
|    │   ...
|    │   # mappings of different arguments sets in the form "our DSL: custom_args", e.g. "data_directory: data_dir"
└─── vocabulary/
|    │   ...
|    │   # vocabularies YAML configs specifying [names, paths]
|    config.yaml

How to setup the benchmarks

Required manual work for preparing this benchmark

Unfortunately, these benchmarks require a few steps that cannot be automated; certain datasets are available only on request after certain formal procedures.

One should obtain

Several corpora: CANTEMIST, MCN, Mantra GSC, CodiEsp (see Section "Datasets & Vocabularies")
UMLS 2020AA data, precisely: MRCONSO.RRF, MRSTY.RRF
CANTEMIST vocabulary used by SINAI research group team. The file 'cieo-synonyms.csv' is available on request to authors.
MCN data introduced in the original paper.
[optional] MedLexSp.zip, which is available on request and requires signing papers.

Preparing vocabularies

Put MRCONSO.RRF, MRSTY.RRF and [optional] MedLexSp.zip into the root of the repository.
To generate the English vocabulary used for MCN dataset, run ./vocab_generate_snomedct_all.sh. This would generate data/vocabs/SNOMEDCT_US-all-aggregated.txt, a large file.

Expected output: 1324661 data/vocabs/SNOMEDCT_US-all-aggregated.txt

English vocabulary used specifically for CLEF2013 (disorders only!): run ./vocab_generate_snomedct_clef2013.sh. This would generate data/vocabs/SNOMEDCT_US_clef2013-biosyn-aggregated.txt.

Expected output: 363326 data/vocabs/SNOMEDCT_US_clef2013-biosyn-aggregated.txt

Spanish vocabulary for CANTEMIST: run ./vocab_generate_cantemist_lopez_ubeda_et_al.sh. This generates data/vocabs/CANTEMIST-lopez-ubeda-et-al.txt. cieo-synonyms.csv should be at the root of the repo.
CodiEsp vocabularies: run ./vocab_generate_icd10_codiesp.sh; zenodo_get Python package tool should be installed. Files data/vocabs/codiesp-d-codes-es.txt and data/vocabs/codiesp-p-codes-es.txt are generated as a result.
MANTRA vocabularies: run ./vocab_generate_mantra.sh [in progress].
To generate UMLS French DISO vocabulary run ./vocab_generate_umls_fre_diso.sh. This would generate data/vocabs/umls_fre_diso.txt.
[optional] To prepare MedLexSp, run ./vocab_generate_medlexsp.sh; file data/vocabs/MedLexSp_v0.txt should be generated.

Preparing datasets

CANTEMIST: simply run ./data_generate_cantemist.sh; zenodo_get is required. The results will be saved into data/datasets/cantemist.
MCN: the data can be downloaded here, after the registration. The unpacked data should be put into a folder data/datasets/MCN_n2c2 (so that the folders test, train, gold would be present). Then scripts ./data_generate_mcn.sh and query_preprocess_in_the_style_of_biosyn.sh should be executed [in progress].
CodiEsp: run ./data_generate_codiesp.sh (requires zenodo_get), the results will be written to data/datasets/codiesp.
MANTRA: can be downloaded here [in progress].

Filtering datasets

Example:

python3 fairification.py --test_dir data/datasets/codiesp/DIAGNOSTICO/test \
                          --train_dir data/datasets/codiesp/DIAGNOSTICO/train \
                          --vocabulary data/vocabs/codiesp-d-codes-es.txt \
                          --levenshtein_norm_method 1 \
                          --levenshtein_threshold 0.2

This may take a while; should generate folders:

data/
└─── datasets/
|    └─── codiesp/
|    |    └─── DIAGNOSTICO/
|    |    |    test-fair_exact/
|    |    |    ... 
|    |    |    test-fair_exact_vocab/
|    |    |    ... 
|    |    |    test-fair_levenshtein_0.2/
|    |    |    ... 
|    |    |    test-fair_levenshtein_train_0.2/
|    |    |    ...

Evaluation without pretraining

For running a unifying evaluation script, run e.g. as follows:

    python3.6 universal_runner.py

The default params can be configured in config/ directory.

Generating evaluation scripts for multiple settings

Statistics aggregation

All the results are reported in separate folders. To build a single *.CSV sheet with the results, run the script

python3.6 universal_aggregator.py

Setting up a Docker image

    cd medical_crossing/
    docker build . -t medical_crossing  
    nvidia-docker run -p 8807:8807 -v "`pwd`:/root/medical_crossing/" -it medical_crossing:latest bash

References to the works used

Datasets & vocabularies

CodiEsp: Clinical Case Coding in Spanish Shared Task (eHealth CLEF 2020)
MCN: The 2019 n2c2/UMass Track 3
CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition
Mantra GSC, link
[optional] MedLexSp.zip
UMLS 2020AA

iinemo / medical_crossing Goto Github PK

medical_crossing's Introduction

Medical Crossing: X-Lingual Clinical Concept Normalization Benchmark

Overview of the repository

Configs folder

How to setup the benchmarks

Required manual work for preparing this benchmark

Preparing vocabularies

Preparing datasets

Filtering datasets

Evaluation without pretraining

Generating evaluation scripts for multiple settings

Statistics aggregation

Setting up a Docker image

References to the works used

Datasets & vocabularies

medical_crossing's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent