This repository contains scripts automating the evaluation of clinical entity linking tasks in multiple languages.
The underlying philosophy is the following.
- It is better to submodule the code than copy it.
- ...And write a wrapper for eval scripts rather than change the original tool's code,
- We assume that Python eval scripts usually need
- core parameters commonly shared by many model implementations and
- custom parameters specific for this particular model (or model implementation).
- Therefore we need mappings between shared parameters <-> some DSL of our own.
All this should be done using a single script and be configurable via Hydra.
In general, evaluation is done with running universal_runner.py
script,
which takes hydra-style command line arguments as input. The set of possible
parameters and default values are configured using YAML files in the config/
folder.
Each run is saved to results/sessions_YYYY-MM-DD-...
with the corresponding YAML config.
To aggregate the results, universal_aggregator.py
is run; it applies to each session's
folder in the results
a relevant output parser (set in model/*.yaml
; code in untils.output_parsers
).
As a result, a large CSV table is generated; a row per each set of params. The latest scores are kept in case of duplicate parameters sets.
config/
└─── dataset/
| │ ...
| │ # datasets YAML configs specifying [names, paths, languages, fairness level]
└─── model/
| │ ...
| │ # models YAML configs specifying [names, output parsers, model_path, parameters_mapping]
└─── parameters_mapping/
| │ ...
| │ # mappings of different arguments sets in the form "our DSL: custom_args", e.g. "data_directory: data_dir"
└─── vocabulary/
| │ ...
| │ # vocabularies YAML configs specifying [names, paths]
| config.yaml
Unfortunately, these benchmarks require a few steps that cannot be automated; certain datasets are available only on request after certain formal procedures.
One should obtain
- Several corpora: CANTEMIST, MCN, Mantra GSC, CodiEsp (see Section "Datasets & Vocabularies")
- UMLS 2020AA data, precisely:
MRCONSO.RRF
,MRSTY.RRF
- CANTEMIST vocabulary used by SINAI research group team. The file 'cieo-synonyms.csv' is available on request to authors.
- MCN data introduced in the original paper.
- [optional] MedLexSp.zip, which is available on request and requires signing papers.
- Put
MRCONSO.RRF
,MRSTY.RRF
and [optional]MedLexSp.zip
into the root of the repository. - To generate the English vocabulary used for MCN dataset, run
./vocab_generate_snomedct_all.sh
. This would generatedata/vocabs/SNOMEDCT_US-all-aggregated.txt
, a large file.
Expected output:
1324661 data/vocabs/SNOMEDCT_US-all-aggregated.txt
- English vocabulary used specifically for CLEF2013 (disorders only!):
run
./vocab_generate_snomedct_clef2013.sh
. This would generatedata/vocabs/SNOMEDCT_US_clef2013-biosyn-aggregated.txt
.
Expected output:
363326 data/vocabs/SNOMEDCT_US_clef2013-biosyn-aggregated.txt
- Spanish vocabulary for CANTEMIST: run
./vocab_generate_cantemist_lopez_ubeda_et_al.sh
. This generatesdata/vocabs/CANTEMIST-lopez-ubeda-et-al.txt
.cieo-synonyms.csv
should be at the root of the repo. - CodiEsp vocabularies: run
./vocab_generate_icd10_codiesp.sh
; zenodo_get Python package tool should be installed. Filesdata/vocabs/codiesp-d-codes-es.txt
anddata/vocabs/codiesp-p-codes-es.txt
are generated as a result. - MANTRA vocabularies: run
./vocab_generate_mantra.sh
[in progress]. - To generate UMLS French DISO vocabulary run
./vocab_generate_umls_fre_diso.sh
. This would generatedata/vocabs/umls_fre_diso.txt
. - [optional] To prepare MedLexSp, run
./vocab_generate_medlexsp.sh
; filedata/vocabs/MedLexSp_v0.txt
should be generated.
- CANTEMIST: simply run
./data_generate_cantemist.sh
;zenodo_get
is required. The results will be saved intodata/datasets/cantemist
. - MCN: the data can be downloaded here, after the registration. The unpacked data should be put into a folder
data/datasets/MCN_n2c2
(so that the folderstest
,train
,gold
would be present). Then scripts./data_generate_mcn.sh
andquery_preprocess_in_the_style_of_biosyn.sh
should be executed [in progress]. - CodiEsp: run
./data_generate_codiesp.sh
(requireszenodo_get
), the results will be written todata/datasets/codiesp
. - MANTRA: can be downloaded here [in progress].
Example:
python3 fairification.py --test_dir data/datasets/codiesp/DIAGNOSTICO/test \
--train_dir data/datasets/codiesp/DIAGNOSTICO/train \
--vocabulary data/vocabs/codiesp-d-codes-es.txt \
--levenshtein_norm_method 1 \
--levenshtein_threshold 0.2
This may take a while; should generate folders:
data/
└─── datasets/
| └─── codiesp/
| | └─── DIAGNOSTICO/
| | | test-fair_exact/
| | | ...
| | | test-fair_exact_vocab/
| | | ...
| | | test-fair_levenshtein_0.2/
| | | ...
| | | test-fair_levenshtein_train_0.2/
| | | ...
For running a unifying evaluation script, run e.g. as follows:
python3.6 universal_runner.py
The default params can be configured in config/
directory.
All the results are reported in separate folders. To build a single *.CSV sheet with the results, run the script
python3.6 universal_aggregator.py
cd medical_crossing/
docker build . -t medical_crossing
nvidia-docker run -p 8807:8807 -v "`pwd`:/root/medical_crossing/" -it medical_crossing:latest bash
- CodiEsp: Clinical Case Coding in Spanish Shared Task (eHealth CLEF 2020)
- MCN: The 2019 n2c2/UMass Track 3
- CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition
- Mantra GSC, link
- [optional] MedLexSp.zip
- UMLS 2020AA