Giter Site home page Giter Site logo

chase's Introduction

CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL

CHASE is a large-scale and pragmatic Chinese dataset for cross-database context-dependent text-to-SQL task (natural language interfaces for relational databases). It is released along with our ACL 2021 paper: CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL. This repo contains our dataset CHASE.

Citation

Data Content and Format

Question, SQL, and Parsed SQL

Each file intrain.json and dev.json contains the following fields:

  • database_id: the database id to which this interaction is addressed.
  • interaction: the query interaction including multiple DB query questions. For each question in the interaction, it includes:
    • utterance: the natural language question
    • utterance_toks: the natural language question tokens
    • query: the SQL query corresponding to the question.
    • sql: parsed results of this SQL query using process_sql.py. Please refer to the Spider Github page for the detailed documentation.
    {
        "database_id": "party_host",
        "interaction": [
            {
                "utterance": "主办方都有谁?",
                "utterance_toks": [
                    "主",
                    "办",
                    "方",
                    ...
                    "?"
                ],
                "query": "select 姓名 from 主办方",
                "sql": {
                    "except": null,
                    "from": {
                        "conds": [],
                        "table_units": [
                            [
                                "table_unit",
                                1
                            ]
                        ]
                    },
                    ...
                    "where": []
                }
            },
            {
                "utterance": "他们来自哪些不同的国家?",
                "utterance_toks": [
                    "他",
                    "们",
                    ...
                    "?"
                ],
                "query": "select distinct 国籍 from 主办方",
                "sql": {
                    "except": null,
                    "from": {
                        "conds": [],
                        "table_units": [
                            [
                                "table_unit",
                                1
                            ]
                        ]
                    },
                    ...
                    "where": []
                }
            },
            {
                "utterance": "每个国家有多少个主办方?",
                "utterance_toks": [
                    "每",
                    "个",
                    "国",
                    "家",
                    ...
                    "?"
                ],
                "query": "select 国籍 , count(*) from 主办方 group by 国籍",
                "sql": {
                    "except": null,
                    "from": {
                        "conds": [],
                        "table_units": [
                            [
                                "table_unit",
                                1
                            ]
                        ]
                    },
                    ...
                    "where": []
                }
            }
        ]
    }

Tables

tables.json contains the following information for each database:

  • db_id: database id
  • table_names_original: original table names stored in the database.
  • table_names: cleaned and normalized table names. We make sure the table names are meaningful. [to be changed]
  • column_names_original: original column names stored in the database. Each column looks like: [0, "派对主题"]. 0 is the index of table names in table_names, which is "派对" in this case. "派对主题" is the column name.
  • column_names: cleaned and normalized column names. We make sure the column names are meaningful. [to be changed]
  • column_types: data type of each column
  • foreign_keys: foreign keys in the database. [11, 7] means column indices in the column_names. These two columns are foreign keys of two different tables.
  • primary_keys: primary keys in the database. Each number is the index of column_names.
    {
        "db_id": "party_host",
        "table_names_original": [
            "派对",
            "主办方",
            "派对主办方"
        ],
        "table_names": [
            "派对",
            "主办方",
            "派对主办方"
        ],
        "column_names_original": [
            [
                -1,
                "*"
            ],
            [
                0,
                "派对"
            ],
            [
                0,
                "派对主题"
            ],
            [
                0,
                "地点"
            ],
            ...
        ],
        "column_names": [
            [
                -1,
                "*"
            ],
            [
                0,
                "派对"
            ],
            [
                0,
                "派对主题"
            ],
            [
                0,
                "地点"
            ],
            ...
        ],
        "column_types": [
            "text",
            "number",
            "text",
            "text",
            ...
        ],
        "foreign_keys": [
            [
                11,
                1
            ],
            [
                12,
                7
            ]
        ],
        "primary_keys": [
            1,
            7,
            11
        ]
    }

chase's People

Contributors

jasperguo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.