Giter Site home page Giter Site logo

tyler-cranmer / ehrsql-2024 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from glee4810/ehrsql-2024

0.0 0.0 0.0 27.05 MB

Clinical NLP Shared Task @ NAACL'24

Home Page: https://sites.google.com/view/ehrsql-2024

License: Creative Commons Attribution 4.0 International

Shell 0.33% Python 99.67%

ehrsql-2024's Introduction

Reliable Text-to-SQL on Electronic Health Records - Clinical NLP Workshop @ NAACL 2024

Electronic Health Records (EHRs) are relational databases that store the entire medical histories of patients within hospitals. They record numerous aspects of a patient's medical care, from admission and diagnosis to treatment and discharge. While EHRs are vital sources of clinical data, exploring them beyond a predefined set of queries requires skills in query languages like SQL. To make this process more accessible, one could develop a text-to-SQL system that automatically translates natural language questions into corresponding SQL queries. In this task, we aim to develop a reliable text-to-SQL system specifically tailored for EHRs.

This is part of the shared tasks at NAACL 2024 - Clinical NLP.

Timeline | Dataset | Evaluation | Baselines | Submission | Contact | Organizer

Timeline

All deadlines are 11:59PM UTC-12:00 (Anywhere on Earth), unless stated otherwise

  • Registration opens: Monday January 29, 2024
  • Training and validation data release: Monday January 29, 2024
  • Test data release: Tuesday March 26, 2024
  • Run submission due: Thursday March 28, 2024 (11:59PM UTC)
  • Code submission and fact sheet deadline: Friday March 29, 2024
  • Final result release: Monday April 1, 2024
  • Paper submission period starts: Monday April 8, 2024
  • Paper submission due: Wednesday April 10, 2024
  • Notification of acceptance: Thursday April 18, 2024
  • Final versions of papers due: Wednesday April 24, 2024
  • Clinical NLP Workshop @ NAACL 2024: June 21 or 22, 2024, Mexico City, Mexico

Dataset

Statistics

#Train #Valid #Test
5124 1163 1167

Data Format

For the task, we have two types of files for each of the train, dev, and test sets: data files (with names like *_data.json) and label files (with names like *_label.json). Data files contain the input data for the model, and label files contain the expected model outputs that share the same 'id's as the corresponding data files (sample data).

Input Data (data.json)

{
  "version" : dataset version,
	"data" : [
	  {
		  "id" : sample identifier,
			"question" : natural langauge question (either answerable or unanswerable given the MIMIC-IV schema),	
	  },
	...		
	]
}

Each object in the data list consists of an ID and the corresponding natural language question.

Output Data (label.json)

{
  id -> sample identifier : label -> SQL query or 'null' if subject to abstention,
	...
}

Each object has a key of a sample's ID and a value of the corresponding label.

Table Schema

We follow the same table information style used in Spider. tables.json contains the following information for both databases:

  • db_id: the ID of the database
  • table_names_original: the original table names stored in the database.
  • table_names: the cleaned and normalized table names.
  • column_names_original: the original column names stored in the database. Each column has the format [0, "id"]. 0 is the index of the table name in table_names. "id" is the column name.
  • column_names: the cleaned and normalized column names.
  • column_types: the data type of each column
  • foreign_keys: the foreign keys in the database. [7, 2] indicates the column indices in column_names. that correspond to foreign keys in two different tables.
  • primary_keys: the primary keys in the database. Each number represents the index of column_names.
{
    "column_names": [
      [
        -1,
        "*"
      ],      
      [
        0,
        "row id"
      ],
      [
        0,
        "subject id"
      ],
      ...
    ],
    "column_names_original": [
      [
        -1,
        "*"
      ],      
      [
        0,
        "row_id"
      ],
      [
        0,
        "subject_id"
      ],
      ...
    ],
    "column_types": [
      "text",
      "number",
      "number",
      ...
    ],
    "db_id": "mimic_iv",
    "foreign_keys": [
      [
        7,
        2
      ],
      ...
    ],
    "primary_keys": [
      1,
      6,
      ...
    ],
    "table_names": [
      "patients",
      "admissions",
      ...
    ],
    "table_names_original": [
      "patients",
      "admissions",
      ...
    ]
  }

Database

We use the MIMIC-IV database demo, which anyone can access the files as long as they conform to the terms of the Open Data Commons Open Database License v1.0. If you agree to the terms, use the bash command below to download the database.

wget https://physionet.org/static/published-projects/mimic-iv-demo/mimic-iv-clinical-database-demo-2.2.zip
unzip mimic-iv-clinical-database-demo-2.2
gunzip -r mimic-iv-clinical-database-demo-2.2

Once downloaded, run the code below to preprocess the database. This step involves time-shifting, value deduplication in tables, and more.

cd preprocess
bash preprocess.sh
cd ..

Evaluation

The scorer (scoring.py in the scoring_program module) will report the official evaluation score for the task. For more details about the metric, please refer to the Evaluation tab on the Codabench website.

Baseline

We provide three sample baseline code examples on Colab as starters.

Generates 'null' for all predictions. This will mark all questions as unanswerable, and the reliability scores will match the percentage of unanswerable questions in the evaluation set.

Generates predictions using T5.

Generates predictions using ChatGPT.

Submission

File Format

After saving your prediction file, compress (zip) it using a bash command, for example:

zip predictions.zip prediction.json

Submitting the File

Submit your prediction file on our task website on Codabench. For more details, see the Submission tab.

Contact

For more updates, join our Google group https://groups.google.com/g/ehrsql-2024/.

Organizer

Organizers are from EdLab @ KAIST.

ehrsql-2024's People

Contributors

glee4810 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.