beachwang / dail-sql Goto Github PK
View Code? Open in Web Editor NEWA efficient and effective few-shot NL2SQL method on GPT-4.
License: Apache License 2.0
A efficient and effective few-shot NL2SQL method on GPT-4.
License: Apache License 2.0
Hello, I just read your paper and am looking at the code. I appreciate your efforts for this new method. I want to use this method on my database and I am not getting it after reading this how can I do this or it is just for spider data? I would be grateful if you could guide us on how can we use on own data.
Thanks !!
Hi,
I would appreciate it if you could provide me with the evaluation script since there is no information about it in the repository. Although I am aware that you use some other packages for evaluation, I am not sure how you utilized them to assess the results.
Hello, I saw that in your code, "train_spider_and_others.json" and "train_others.json" were merged to get train_spider_and_others. Would you like to ask if you used all the spider data sets in your paper, or just the dev.json data set?
I see that the default data in generate_question.py uses dev.json
Did you train the model using train_spider_and_others and then test it using dev.json?
Hello, I just read your paper and am taking a look at the code. Thank you for your contributions.
There was one element in the paper that I did not fully understand which I hope you can assist me with.
Where does the pre-predicted s' come from? Does this arrive from a separate, preliminary model, and is this model within the codebase?
python generate_question.py \
--data_type spider \
--split test \
--tokenizer gpt-3.5-turbo \
--max_seq_len 4096 \
--selector_type EUCDISMASKPRESKLSIMTHR \
--pre_test_result [your_pre_generated_queries_file] \
--prompt_repr SQL \
--k_shot 9 \
--example_type QA
I saw that in an example run setup for considering both question similarity and query similarity, the "pre_test_result" flag was included. I am unsure where these pre generated queries arrive from, and would greatly appreciate an explanation. Thank you.
Hello!
Thanks for great work! when I followed the instruction I got stuck when I ran the data_preprocess.py, it told me that 'Error while loading a tagger model (probably missing model file). Why did this situation occur? How to resolve this problem?
Thanks a lot!
Hello,
I just read your work and find that you add foreign keys to BS_p
, OD_p
, TR_p
, and AS_p
prompt formats. I'm curious about these formats with foreign keys, but I can't find details of these formats in your paper. Did I missed something or could you just give some examples about these formats?
Thank you.
Thanks for your nice work.
Can you open source code & dev-predicted-sqls for re-implement BIRD dataset? I cannot find related code in this repo.
Thanks.
Hello, I would like to ask, in the text2sql experiment,
is your database schema bound with natural language (referring to the question based on the database schema),
or is the natural language question independent of the database schema (according to the natural language question to find the matching database schema into the corresponding prompt)?
[UPDATE] Nevermind, your_pre_generated_queries_file is already included at ./results/DAIL-SQL+GPT-4.txt
I hope all is well, I'm attempting to re-implement this research paper, and I was wondering if there is any chance that you have a pre_generated_queries file that I could use to do some testing of the functionality. If not, no worries. Thanks and the paper is awesome, great work!
python generate_question.py
--data_type spider
--split test
--tokenizer gpt-3.5-turbo
--max_seq_len 4096
--selector_type EUCDISMASKPRESKLSIMTHR
--pre_test_result [your_pre_generated_queries_file]
--prompt_repr SQL
--k_shot 9
--example_type QA
I am very interested in your articles and code. When I read the code, I found that you use sentence-transformers to calculate Euclidean distance to extract examples. At the same time, I also noticed that you have extracted the question skeleton and SQL skeleton and they were used when generating question.json.
def __init__(self, tokenizer: str, *args, **kwargs):
self.tokenizer = get_tokenizer(tokenizer)
self.example_qualities = []
self.pattern_similarities = []
def record_example_quality(self, examples, target):
quality_list = []
for example in examples:
quality_list.append(jaccard_similarity(example["query_skeleton"], target["query_skeleton"]))
self.example_qualities.append(quality_list)
def get_example_quality(self):
if self.example_qualities:
return np.mean([num for row in self.example_qualities for num in row])
else:
return 1
def get_example_quality_for_each(self):
if self.example_qualities:
return [np.mean(row) for row in self.example_qualities]
else:
return []
def record_pattern_similarity(self, examples, target):
similarity_list = []
for example in examples:
similarity_list.append(jaccard_similarity(example["question_pattern"], target["question_pattern"]))
self.pattern_similarities.append(similarity_list)
def get_pattern_similarity(self):
if self.pattern_similarities:
return np.mean([num for row in self.pattern_similarities for num in row])
else:
return 1
However, to my surprise, you only used the skeleton to calculate the average for output and did not use them in ask_llm.py
.
So I'm curious about what is the purpose of extracting skeletons?
I am very interested in your work, could you please help me clarify my doubts? Thank you.
Hi there!
I was initially implementing DAIL-SQL, and the version I pulled at the beginning of December worked perfectly fine. However, with the new commits and the updated version, I'm encountering the following error when I run this command:
python generate_question.py \
-- data_type spider \
--split test \
--tokenizer gpt-3.5-turbo \
--max_seq_len 4096 \
--prompt_repr SQL \
--k_shot 9 \
--example_type QA \
--selector_type EUCDISQUESTIONMASK
Error:
File "generate_question.py", line 29, in <module>
REPR_TYPE.OPENAI_DEMOSTRATION_WFK,
AttributeError: type object 'REPR_TYPE has no attribute 'OPENAI_DEMOSTRATION_WFK'
Hi,
Like BugMaker-Boyan, I'm super grateful that you have added the BIRD-SQL results -- it's definitely the best benchmark in Text-to-SQL right now.
I'm strapped for compute at the moment (aren't we all), and don't think I can afford to run the pre_test_result generation phase.
I want to check that the url below, contains the pre_test_results you used to Bird-SQL.
If they are not the correct pre_test_results, I was wondering if you could release the pre_test_results you originally generated, similar to the results/graphix_result.txt for Spider dev?
I'm excited to see what you and your team do next!
Line 112 in d78ce34
Need to add data_type = args.data_type
in the previous line to avoid the following error:
Traceback (most recent call last):
File "data_preprocess.py", line 112, in
if data_type == "spider":
NameError: name 'data_type' is not defined
Hi!
Thanks for the great work! I followed the instructions in the readme but got no lock while running python data_preprocess.py
It stucks at "test section linking: " at 0%. Any clues on what could be the possible reason?
Thanks a lot!
I eval the spider-dev using official test-suit-sql-eval scripts, with the GPT-4 results in your repo, but the execution accuracy is different.
The paper result: 83.5%
But I get: 76.2
run cmd:
python3 evaluation.py --gold ./my_test/gold_sqls.txt --pred ./my_test/DAIL-SQL+GPT-4+self-consistency.txt --db ./database/database/ --etype exec
easy medium hard extra all
count 248 446 174 166 1034
===================== EXECUTION ACCURACY =====================
execution 0.903 0.818 0.661 0.506 0.762
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.