CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL

CHASE is a large-scale and pragmatic Chinese dataset for cross-database context-dependent text-to-SQL task (natural language interfaces for relational databases). It is released along with our ACL 2021 paper: CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL. This repo contains our dataset CHASE.

Citation

Data Content and Format

Question, SQL, and Parsed SQL

Each file intrain.json and dev.json contains the following fields:

database_id: the database id to which this interaction is addressed.
interaction: the query interaction including multiple DB query questions. For each question in the interaction, it includes:
- utterance: the natural language question
- utterance_toks: the natural language question tokens
- query: the SQL query corresponding to the question.
- sql: parsed results of this SQL query using process_sql.py. Please refer to the Spider Github page for the detailed documentation.

    {
        "database_id": "party_host",
        "interaction": [
            {
                "utterance": "主办方都有谁？",
                "utterance_toks": [
                    "主",
                    "办",
                    "方",
                    ...
                    "？"
                ],
                "query": "select 姓名 from 主办方",
                "sql": {
                    "except": null,
                    "from": {
                        "conds": [],
                        "table_units": [
                            [
                                "table_unit",
                                1
                            ]
                        ]
                    },
                    ...
                    "where": []
                }
            },
            {
                "utterance": "他们来自哪些不同的国家？",
                "utterance_toks": [
                    "他",
                    "们",
                    ...
                    "？"
                ],
                "query": "select distinct 国籍 from 主办方",
                "sql": {
                    "except": null,
                    "from": {
                        "conds": [],
                        "table_units": [
                            [
                                "table_unit",
                                1
                            ]
                        ]
                    },
                    ...
                    "where": []
                }
            },
            {
                "utterance": "每个国家有多少个主办方？",
                "utterance_toks": [
                    "每",
                    "个",
                    "国",
                    "家",
                    ...
                    "？"
                ],
                "query": "select 国籍 , count(*) from 主办方 group by 国籍",
                "sql": {
                    "except": null,
                    "from": {
                        "conds": [],
                        "table_units": [
                            [
                                "table_unit",
                                1
                            ]
                        ]
                    },
                    ...
                    "where": []
                }
            }
        ]
    }

Tables

tables.json contains the following information for each database:

db_id: database id
table_names_original: original table names stored in the database.
table_names: cleaned and normalized table names. We make sure the table names are meaningful. [to be changed]
column_names_original: original column names stored in the database. Each column looks like: [0, "派对主题"]. 0 is the index of table names in table_names, which is "派对" in this case. "派对主题" is the column name.
column_names: cleaned and normalized column names. We make sure the column names are meaningful. [to be changed]
column_types: data type of each column
foreign_keys: foreign keys in the database. [11, 7] means column indices in the column_names. These two columns are foreign keys of two different tables.
primary_keys: primary keys in the database. Each number is the index of column_names.

    {
        "db_id": "party_host",
        "table_names_original": [
            "派对",
            "主办方",
            "派对主办方"
        ],
        "table_names": [
            "派对",
            "主办方",
            "派对主办方"
        ],
        "column_names_original": [
            [
                -1,
                "*"
            ],
            [
                0,
                "派对"
            ],
            [
                0,
                "派对主题"
            ],
            [
                0,
                "地点"
            ],
            ...
        ],
        "column_names": [
            [
                -1,
                "*"
            ],
            [
                0,
                "派对"
            ],
            [
                0,
                "派对主题"
            ],
            [
                0,
                "地点"
            ],
            ...
        ],
        "column_types": [
            "text",
            "number",
            "text",
            "text",
            ...
        ],
        "foreign_keys": [
            [
                11,
                1
            ],
            [
                12,
                7
            ]
        ],
        "primary_keys": [
            1,
            7,
            11
        ]
    }

xjtu-intsoft / chase-dataset Goto Github PK

chase-dataset's Introduction

CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL

Citation

Data Content and Format

Question, SQL, and Parsed SQL

Tables

chase-dataset's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent