Giter Site home page Giter Site logo

dail-sql's People

Contributors

beachwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dail-sql's Issues

Detailed README

Hello, I just read your paper and am looking at the code. I appreciate your efforts for this new method. I want to use this method on my database and I am not getting it after reading this how can I do this or it is just for spider data? I would be grateful if you could guide us on how can we use on own data.
Thanks !!

Script or code for evaluation?

Hi,
I would appreciate it if you could provide me with the evaluation script since there is no information about it in the repository. Although I am aware that you use some other packages for evaluation, I am not sure how you utilized them to assess the results.

dev.json

Hello, I saw that in your code, "train_spider_and_others.json" and "train_others.json" were merged to get train_spider_and_others. Would you like to ask if you used all the spider data sets in your paper, or just the dev.json data set?

dail

I see that the default data in generate_question.py uses dev.json

dev

Did you train the model using train_spider_and_others and then test it using dev.json?

Question on Pre-Predicted s'

Hello, I just read your paper and am taking a look at the code. Thank you for your contributions.

There was one element in the paper that I did not fully understand which I hope you can assist me with.

image

Where does the pre-predicted s' come from? Does this arrive from a separate, preliminary model, and is this model within the codebase?

python generate_question.py \
--data_type spider \
--split test \
--tokenizer gpt-3.5-turbo \
--max_seq_len 4096 \
--selector_type EUCDISMASKPRESKLSIMTHR \
--pre_test_result [your_pre_generated_queries_file] \
--prompt_repr SQL \
--k_shot 9 \
--example_type QA

I saw that in an example run setup for considering both question similarity and query similarity, the "pre_test_result" flag was included. I am unsure where these pre generated queries arrive from, and would greatly appreciate an explanation. Thank you.

Error during data_preprocess

Hello!

Thanks for great work! when I followed the instruction I got stuck when I ran the data_preprocess.py, it told me that 'Error while loading a tagger model (probably missing model file). Why did this situation occur? How to resolve this problem?

Thanks a lot!

How are foreign keys added to the schema?

Hello,

I just read your work and find that you add foreign keys to BS_p, OD_p, TR_p, and AS_p prompt formats. I'm curious about these formats with foreign keys, but I can't find details of these formats in your paper. Did I missed something or could you just give some examples about these formats?

Thank you.

open-source for BIRD dataset?

Thanks for your nice work.

Can you open source code & dev-predicted-sqls for re-implement BIRD dataset? I cannot find related code in this repo.

Thanks.

Database schema

Hello, I would like to ask, in the text2sql experiment,
is your database schema bound with natural language (referring to the question based on the database schema),
or is the natural language question independent of the database schema (according to the natural language question to find the matching database schema into the corresponding prompt)?

question.json + pre_test_result [your_pre_generated_queries_file]

[UPDATE] Nevermind, your_pre_generated_queries_file is already included at ./results/DAIL-SQL+GPT-4.txt

I hope all is well, I'm attempting to re-implement this research paper, and I was wondering if there is any chance that you have a pre_generated_queries file that I could use to do some testing of the functionality. If not, no worries. Thanks and the paper is awesome, great work!

 
python generate_question.py
--data_type spider
--split test
--tokenizer gpt-3.5-turbo
--max_seq_len 4096
--selector_type EUCDISMASKPRESKLSIMTHR
--pre_test_result [your_pre_generated_queries_file]
--prompt_repr SQL
--k_shot 9
--example_type QA

Help wanted! What is the significance of extracting the skeleton?

I am very interested in your articles and code. When I read the code, I found that you use sentence-transformers to calculate Euclidean distance to extract examples. At the same time, I also noticed that you have extracted the question skeleton and SQL skeleton and they were used when generating question.json.
image

    def __init__(self, tokenizer: str, *args, **kwargs):
        self.tokenizer = get_tokenizer(tokenizer)
        self.example_qualities = []
        self.pattern_similarities = []

    def record_example_quality(self, examples, target):
        quality_list = []
        for example in examples:
            quality_list.append(jaccard_similarity(example["query_skeleton"], target["query_skeleton"]))
        self.example_qualities.append(quality_list)

    def get_example_quality(self):
        if self.example_qualities:
            return np.mean([num for row in self.example_qualities for num in row])
        else:
            return 1

    def get_example_quality_for_each(self):
        if self.example_qualities:
            return [np.mean(row) for row in self.example_qualities]
        else:
            return []

    def record_pattern_similarity(self, examples, target):
        similarity_list = []
        for example in examples:
            similarity_list.append(jaccard_similarity(example["question_pattern"], target["question_pattern"]))
        self.pattern_similarities.append(similarity_list)

    def get_pattern_similarity(self):
        if self.pattern_similarities:
            return np.mean([num for row in self.pattern_similarities for num in row])
        else:
            return 1

However, to my surprise, you only used the skeleton to calculate the average for output and did not use them in ask_llm.py.

So I'm curious about what is the purpose of extracting skeletons?

I am very interested in your work, could you please help me clarify my doubts? Thank you.

issue with generate_question.py

Hi there!
I was initially implementing DAIL-SQL, and the version I pulled at the beginning of December worked perfectly fine. However, with the new commits and the updated version, I'm encountering the following error when I run this command:

python generate_question.py \
-- data_type spider \
--split test \
--tokenizer gpt-3.5-turbo \
--max_seq_len 4096 \
--prompt_repr SQL \
--k_shot 9 \
--example_type QA \
--selector_type  EUCDISQUESTIONMASK

Error:

File "generate_question.py", line 29, in <module>
    REPR_TYPE.OPENAI_DEMOSTRATION_WFK,
AttributeError: type object 'REPR_TYPE has no attribute  'OPENAI_DEMOSTRATION_WFK'

pre_test_results BIRD-SQL

Hi,

Like BugMaker-Boyan, I'm super grateful that you have added the BIRD-SQL results -- it's definitely the best benchmark in Text-to-SQL right now.

I'm strapped for compute at the moment (aren't we all), and don't think I can afford to run the pre_test_result generation phase.

I want to check that the url below, contains the pre_test_results you used to Bird-SQL.

https://github.com/AlibabaResearch/DAMO-ConvAI/blob/main/bird/llm/exp_result/turbo_output_kg/predict_dev.json  

If they are not the correct pre_test_results, I was wondering if you could release the pre_test_results you originally generated, similar to the results/graphix_result.txt for Spider dev?

I'm excited to see what you and your team do next!

NameError in data_preprocess.py

if data_type == "spider":

Need to add data_type = args.data_type in the previous line to avoid the following error:

Traceback (most recent call last):
File "data_preprocess.py", line 112, in
if data_type == "spider":
NameError: name 'data_type' is not defined

Error During Data Processing

Hi!

Thanks for the great work! I followed the instructions in the readme but got no lock while running python data_preprocess.py

It stucks at "test section linking: " at 0%. Any clues on what could be the possible reason?

Thanks a lot!

Why the execution accuracy in spider-dev dataset is different from the paper?

I eval the spider-dev using official test-suit-sql-eval scripts, with the GPT-4 results in your repo, but the execution accuracy is different.

The paper result: 83.5%

But I get: 76.2

run cmd:

python3 evaluation.py --gold ./my_test/gold_sqls.txt --pred ./my_test/DAIL-SQL+GPT-4+self-consistency.txt --db ./database/database/ --etype exec

                 easy                 medium               hard                 extra                all                 

count 248 446 174 166 1034
===================== EXECUTION ACCURACY =====================
execution 0.903 0.818 0.661 0.506 0.762

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.