Closing the Feedback Loop: Improving Natural Language to SQL Translation Using Natural Language Explanations

Improve NL2SQL with Natural Language Explanations as Self-provided Feeback

The official repository contains the code and pre-trained models for our paper Closing the Feedback Loop: Improving Natural Language to SQL Translation Using Natural Language Explanations (The paper will be public after acceptance😊).

📖 Overview

This code implements:

A unified iterative framework built upon self-provided feedback to enhance the translation accuracy of existing end-to-end models.

🚀 About CycleSQL

TL;DR: We introduce CycleSQL -- a unified framework that enables flexible integration into existing end-to-end NL2SQL models. Inspired by the feedback mechanisms used in modern recommendation systems and iterative refinement methods introduced in LLMs, CycleSQL introduces NL explanations of query results as a form of internal feedback to create a self-contained feedback loop within the end-to-end translation process, facilitating iterative self-evaluation of translation correctness.

The objective of NL2SQL translation is to convert a natural language query into an SQL query.

While significant advancements in enhancing overall translation accuracy, current end-to-end models face persistent challenges in producing desired quality output during their initial attempt, owing to the treatment of language translation as a "one-time deal".

To tackle the problem, Cyclesql introduces natural language explanations of query results as self-provided feedback and uses the feedback to validate the correctness of the translation iteratively, hence improving the overall translation accuracy.

This is the approach taken by the CycleSQL method.

❓ How it works

CycleSQL uses the following four steps to establish the feedback loop for the NL2SQL translation process:

Provenance Tracking: Track provenance of the to-explained query result to retrieve data-level information from the database.
Semantics Enrichment: Enhance the provenance by associating it with operation-level semantics derived from the translated SQL.
Explanation Generation: Generate a natural language explanation by interpreting the enriched provenance information.
Translation Verification: The generated NL explanation is utilized to verify the correctness of the underlying NL2SQL translation. Iterating through the above steps until a validated correct translation is achieved.

This process is illustrated in the diagram below:

⚡️ Quick Start

🙇 Prerequisites

First, you should set up a Python environment. This code base has been tested under Python 3.8.

Install the required packages

pip install -r requirements.txt

Download the Spider and the other three robustness variants (Spider-Realistic, Spider-Sync, and Spider-DK), and put the data into the data folder. Unpack the datasets and create the following directory structure:

/data
├── database
│   └── ...
├── dev.json
├── dev_gold.sql
├── tables.json
├── train_gold.sql
├── train.json
└── train.json

🏋️‍♀️ Training

📃 Natural Language Inference Model: We implemented the natural language inference model based on the T5-large model. We utilize various NL2SQL models (i.e., SmBoP, PICARD, RESDSQL, and ChatGPT) to generate the training data for the model training. You can use the following command to train the model from scratch:

$ python scripts/run_classification.py --model_name_or_path t5-large --shuffle_train_dataset --do_train --do_eval --num_train_epochs 5 --learning_rate 5e-6 --per_device_train_batch_size 8 --per_device_eval_batch_size 1 --evaluation_strategy steps --train_file data/nli/train.json  --validation_file data/nli/dev.json --output_dir tmp/ --load_best_model_at_end --save_total_limit 5

👍 Download the checkpoint

The natural language inference model checkpoint will be uploaded in the following link:

Model	Download Model
`nli-classifier`	nli-classifier.tar.gz

put the model checkpoint put the data into the saved_models folder.

👀 Inference

The evaluation script is located in the root directory run_inference.sh. You can run it with:

$ bash run_infer.sh <dataset_name> <model_name> <test_file_path> <model_raw_beam_output_file_path> <table_path> <db_dir> <test_suite_db_dir>

The evaluation script will create the directory outputs in the current directory and generate the result outcomes.

kaimary / cyclesql Goto Github PK

cyclesql's Introduction

Closing the Feedback Loop: Improving Natural Language to SQL Translation Using Natural Language Explanations

📖 Overview

🚀 About CycleSQL

❓ How it works

⚡️ Quick Start

🙇 Prerequisites

🏋️‍♀️ Training

👍 Download the checkpoint

👀 Inference

cyclesql's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent