PICO_Parser

Parse RCT PubMed abstracts following PICO framework to standarize PICO elements.

Author: Tian Kang ([email protected])
Affiliation: Department of Biomedical Informatics, Columbia Univerisity (Dr. Chunhua Weng's lab)
Citation: "Kang, T., Zou, S. and Weng, C., 2019. Pretraining to Recognize PICO Elements from Randomized Controlled Trial Literature. Studies in health technology and informatics, 264, p.188."

UPDATE May, 2020:

1. Solved the issues with BERT-based parser.

2. Pretrained Sentence classification model for RCT abstracts available.

Major updates coming soon ^_^:

More modules coming soon for representing medical evidence information comprehensively from RCT abstracts.

User Guide

NEW: BlueBERT-based Parser (bugs solved, May 2020):

Adapted from NCBI-NLP BlueBERT

Install requirements.txt
If you want to use UMLS to standardize entities, please install 'UMLS' and 'QuickUMLS' locally
Download pretrained bluebert for PICO element recognition models (link in BERT )
Edit parser_config.py to customize your own diretories and BERT configuration
Run to start parsing (specify your input in --data_dir and output directory in -- output_dir. In the input directory, each abstract text is put in one text file with its pmid as the file name. Example data is provided in test folder.
```
 python run_bluebert_ner_predict.py --data_dir= --output_dir=
```

To run examples:

python run_bluebert_ner_predict.py --data_dir=test/txt --output_dir=test/json`

Exmample

Input test/txt
Parsing results test/json

Original: LSTM Parser:

PICO Element with attributes in JSON/XML

Install requirements.txt
If you want to use UMLS to standardize entities, please install 'UMLS' and 'QuickUMLS' locally
Edit parser_config.py to customize your own diretories and installation
Run python Phase1_NER_predict.py to start parsing

Clustering parsed PICO elements to represent study design

Download context vector pretrained in all pubmed abstracts from 1990-2019 (downlaod link in cluster/model/download.txt)
Extract 3 files and put them under cluster/model
TO BE CONTINUED

Exmample

JSON
Input example.txt contain over 70+ abstracts with methods sections
Parsing results folder example_json_out

{
  "pmid": "11264545",
  "sentences": {
    "sent_1": {
      "Section": "METHODS",
      "text": "METHODS AND RESULTS : To determine the relative power of radiographic heart measurements for predicting outcome in dilated cardiomyopathy , we retrospectively studied 88 adult patients with chest radiographs obtained within 35 days of echocardiography .",
      "entities": {
        "entity_1": {
          "text": "radiographic heart measurements",
          "class": "Outcome",
          "negation": 0,
          "UMLS": "C0018787:heart,C1306645:radiograph,",
          "index": 1,
          "start": 10
        },
        "entity_2": {
          "text": "predicting outcome",
          "class": "Outcome",
          "negation": 0,
          "UMLS": "",
          "index": 2,
          "start": 14
        },
        "entity_3": {
          "text": "dilated cardiomyopathy",
          "class": "Participant",
          "nega    tion": 0,
          "UMLS": "C0007193:dilated cardiomyopathy,",
          "index": 3,
          "start": 17
        },
        "entity_4": {
          "text": "chest radiographs",
          "class": "Participant",
          "negation": 0,
          "UMLS": "C1306645:radiographs,C0817096:chest,",
          "index": 4,
          "start": 27
        },
        "entity_5": {
          "text": "echocardiography",
          "c    lass": "Participant",
          "negation": 0,
          "UMLS": "C0013516:echocardiography,",
          "index": 5,
          "start": 34
        }
      },
      "relations": {}
    },
    "sent_2": {
      "Section": "METHODS",
      "text": "Standard radiographic variables were measured for each patient , and the cardiothoracic ( CT ) ratio , frontal cardiac area     , and volume were calculated .",
      "entities": {
        "entity_6": {
          "text": "Standard radiographic variables",
          "class": "Outcome",
          "negation": 0,
          "UMLS": "C0038137:Standard,C1306645:radiograph,",
          "index": 1,
          "start": 0
        },
        "entity_7": {
          "text": "cardiothoracic ( CT ) ratio",
          "class": "Outcome",
          "negation": 0,
          "UMLS": "",
          "index": 2,
          "start": 11
        },
        "entity_8": {
          "text": "frontal cardiac area",
          "class": "Outcome",
          "negation": 0,
          "UMLS": "C0018787:cardiac,",
          "index": 3,
          "start": 17
        },
        "entity_9": {
          "text": "volume",
          "class": "Outcome",
          "negation": 0,
          "UMLS": "",
          "inde    x": 4,
          "start": 22
        }
      },
      "relations": {}
    }
  }
}

XML
Input test.txt
Parsing results temp.xml

A double-blind crossover comparison of pindolol , metoprolol , atenolol and labetalol in mild to moderate hypertension . 1     This study was designed to compare in a double-blind randomized crossover trial , atenolol , labetalol , metoprolol and pindolol . Considerable differences in dose ( atenolol 138 +/- 13 mg daily ; labetalol 308 +/- 34 mg daily ; metoprolol 234 +/- 22 mg daily ; and pindolol 24 +/-2 mg daily were required to produce similar antihypertensive effects .

<abstract>
		<sent>
			<text>A double-blind crossover comparison of pindolol , metoprolol , atenolol and labetalol in mild to moderate hypertension .</text>
			<entity class='Intervention' UMLS='C0031937:pindolol' index='T1' start='5'> pindolol </entity>
			 <entity class='Intervention' UMLS='C0025859:metoprolol' index='T2' start='7'> metoprolol </entity>
			 <entity class='Intervention' UMLS='C0004147:atenolol' index='T3' start='9'> atenolol </entity>
			 <entity class='Intervention' UMLS='C0022860:labetalol' index='T4' start='11'> labetalol </entity>
			 <entity class='Participant' UMLS='C0020538:hypertension' index='T5' start='13'> mild to moderate hypertension </entity>
		</sent>
		<sent>
			<text>1 This study was designed to compare in a double-blind randomized crossover trial , atenolol , labetalol , metoprolol and pindolol .</text>
			<entity class='Intervention' UMLS='C0004147:atenolol' index='T6' start='14'> atenolol </entity>
			 <entity class='Intervention' UMLS='C0022860:labetalol' index='T7' start='16'> labetalol </entity>
			 <entity class='Intervention' UMLS='C0025859:metoprolol' index='T8' start='18'> metoprolol </entity>
			 <entity class='Intervention' UMLS='C0031937:pindolol' index='T9' start='20'> pindolol </entity>
		</sent>
		<sent>
			<text>Considerable differences in dose ( atenolol 138 +/- 13 mg daily ; labetalol 308 +/- 34 mg daily ; metoprolol 234 +/- 22 mg daily ; and pindolol 24 +/-2 mg daily were required to produce similar antihypertensive effects .</text>
			<attribute class='modifier' index='T10' start='1'> differences </attribute>
			 <entity class='Intervention' UMLS='C0004147:atenolol' index='T11' start='5'> atenolol </entity>
			 <attribute class='measure' index='T12' start='6'> 138 +/- 13 mg daily </attribute>
			 <entity class='Intervention' UMLS='C0022860:labetalol' index='T13' start='12'> labetalol </entity>
			 <attribute class='measure' index='T14' start='13'> 308 +/- 34 mg daily </attribute>
			 <entity class='Intervention' UMLS='C0025859:metoprolol' index='T15' start='19'> metoprolol </entity>
			 <attribute class='measure' index='T16' start='20'> 234 +/- 22 mg daily </attribute>
			 <entity class='Intervention' UMLS='C0031937:pindolol' index='T17' start='27'> pindolol </entity>
			 <attribute class='measure' index='T18' start='28'> 24 +/-2 mg daily </attribute>
			 <entity class='Outcome' UMLS='C0003364:antihypertensive' index='T19' start='37'> antihypertensive effects </entity>
		</sent>
</abstract>

Reference

Parser achitecture is adapted from my previous project of eligibility criteria parser EliIE.
LSTM-CRF scritps modified from EBM-NLP

wenglab-informaticsresearch / pico_parser Goto Github PK

pico_parser's Introduction

PICO_Parser

UPDATE May, 2020:

1. Solved the issues with BERT-based parser.

2. Pretrained Sentence classification model for RCT abstracts available.

Major updates coming soon ^_^:

User Guide

NEW: BlueBERT-based Parser (bugs solved, May 2020):

Exmample

Original: LSTM Parser:

PICO Element with attributes in JSON/XML

Clustering parsed PICO elements to represent study design

Exmample

Reference

pico_parser's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org