Giter Site home page Giter Site logo

microsoft / contextualsp Goto Github PK

View Code? Open in Web Editor NEW
361.0 16.0 59.0 72.7 MB

Multiple paper open-source codes of the Microsoft Research Asia DKI group

License: MIT License

Python 87.39% Batchfile 0.22% Jsonnet 1.35% Shell 1.56% Makefile 0.05% Jupyter Notebook 9.41% Dockerfile 0.02%
semantic-parsing compositional-generalization conversational-semantic-parsing utterance-rewriting microsoft-research-asia text-to-sql

contextualsp's Introduction

📫 Paper Code Collection (MSRA DKI Group)

License: MIT

This repo hosts multiple open-source codes of the Microsoft Research Asia DKI Group. You could find the corresponding code as below:

News

Code Release (Click Title to Locate the Code)

Reasoning

Reasoning Like Program Executors Xinyu Pi*, Qian Liu*, Bei Chen, Morteza Ziyadi, Zeqi Lin, Qiang Fu, Yan Gao, Jian-Guang Lou, Weizhu Chen, EMNLP 2022.

LEMON: Language-Based Environment Manipulation via Execution-guided Pre-training Qi Shi, Qian Liu, Bei Chen, Yu Zhang, Ting Liu, Jian-Guang Lou, EMNLP 2022 Findings.

LogiGAN: Learning Logical Reasoning via Adversarial Pre-training Xinyu Pi*, Wanjun Zhong*, Yan Gao, Nan Duan, Jian-Guang Lou, NeurIPS 2022.

Text-to-SQL

MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing Longxu Dou, Yan Gao, Mingyang Pan, Dingzirui Wang, Wanxiang Che, Dechen Zhan, Jian-Guang Lou, AAAI 2023.

Towards Knowledge-Intensive Text-to-SQL Semantic Parsing with Formulaic Knowledge Longxu Dou, Yan Gao, Xuqi Liu, Mingyang Pan, Dingzirui Wang, Wanxiang Che, Min-Yen Kan, Dechen Zhan, Jian-Guang Lou, EMNLP 2022.

UniSAr: A Unified Structure-Aware Autoregressive Language Model for Text-to-SQL Longxu Dou, Yan Gao, Mingyang Pan, Dingzirui Wang, Wanxiang Che, Dechen Zhan, Jian-Guang Lou, arxiv 2022.

Towards Robustness of Text-to-SQL Models Against Natural and Realistic Adversarial Table Perturbation Xinyu Pi*, Bing Wang*, Yan Gao, Jiaqi Guo, Zhoujun Li, Jian-Guang Lou, ACL 2022.

Awakening Latent Grounding from Pretrained Language Models for Semantic Parsing Qian Liu*, Dejian Yang*, Jiahui Zhang*, Jiaqi Guo, Bin Zhou, Jian-Guang Lou, ACL 2021 Findings.

Compositional Generalization

Learning Algebraic Recombination for Compositional Generalization Chenyao Liu*, Shengnan An*, Zeqi Lin, Qian Liu, Bei Chen, Jian-Guang Lou, Lijie Wen, Nanning Zheng, Dongmei Zhang, ACL 2021 Findings.

Hierarchical Poset Decoding for Compositional Generalization in Language Yinuo Guo, Zeqi Lin, Jian-Guang Lou, Dongmei Zhang, NeurIPS 2020.

Compositional Generalization by Learning Analytical Expressions Qian Liu*, Shengnan An*, Jian-Guang Lou, Bei Chen, Zeqi Lin, Yan Gao, Bin Zhou, Nanning Zheng, Dongmei Zhang, NeurIPS 2020.

Conversation

"What Do You Mean by That?" A Parser-Independent Interactive Approach for Enhancing Text-to-SQL Yuntao Li, Bei Chen, Qian Liu, Yan Gao, Jian-Guang Lou, Yan Zhang, Dongmei Zhang, EMNLP 2020

Incomplete Utterance Rewriting as Semantic Segmentation Qian Liu, Bei Chen, Jian-Guang Lou, Bin Zhou, Dongmei Zhang, EMNLP 2020

How Far are We from Effective Context Modeling ? An Exploratory Study on Semantic Parsing in Context Qian Liu, Bei Chen, Jiaqi Guo, Jian-Guang Lou, Bin Zhou, Dongmei Zhang, IJCAI 2020

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Question

If you have any question or find any bug, please go ahead and open an issue. Issues are an acceptable discussion forum as well.

If you want to concat the author, please email: qian DOT liu AT buaa.edu.cn.

contextualsp's People

Contributors

an1006634493 avatar bellabei avatar gaoyancheerup avatar longxudou avatar microsoft-github-operations[bot] avatar microsoftopensource avatar qshi95 avatar siviltaram avatar zhongwanjun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

contextualsp's Issues

nlp function does'nt work

Hi,
At corpus_construction/mlm_corpus/corpus_construction.py line 46 a function named "nlp" has been used to filter instances where "so" indicator does'nt indicate logical reasoning (like so happy), but it never defined before and make error in producing results.

Issue while preprocessing CANRAD data using download.sh

I am getting this error while running download.sh script:

--2021-12-19 22:06:48--  https://obj.umiacs.umd.edu/elgohary/CANARD_Release.zip
Resolving obj.umiacs.umd.edu (obj.umiacs.umd.edu)... 128.8.122.11
Connecting to obj.umiacs.umd.edu (obj.umiacs.umd.edu)|128.8.122.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3258983 (3.1M) [application/zip]
Saving to: ‘CANARD_Release.zip’

CANARD_Release.zip                           100%[============================================================================================>]   3.11M   458KB/s    in 7.0s

2021-12-19 22:06:57 (458 KB/s) - ‘CANARD_Release.zip’ saved [3258983/3258983]

Archive:  CANARD_Release.zip
replace ._CANARD_Release? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ._CANARD_Release
  inflating: multiple_refs.json
replace ._multiple_refs.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ._multiple_refs.json
  inflating: test.json
replace ._test.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ._test.json
  inflating: dev.json
replace ._dev.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ._dev.json
  inflating: train.json
replace ._train.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ._train.json
  inflating: readme.txt
replace ._readme.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ._readme.txt
Traceback (most recent call last):
  File "../../preprocess.py", line 197, in <module>
    unified_dataset_format("Multi")
  File "../../preprocess.py", line 85, in unified_dataset_format
    src_f = open(src_file, "r", encoding="utf8")
FileNotFoundError: [Errno 2] No such file or directory: 'train.sr'

The requirements are not described in LEMON

I would like to reproduce the results again in the LEMON paper. Since I have to pre-train and fine-tune a model using BART, I don't know what version of Python, fairseq, etc. was used in their work!

Can anyone help?

Questions about RUN model

Dear Dr. Liu Qian,

I appreciate your work that combining cv's semantic segmentation with NLP is a fantastic idea. I have run the code and have some questions. I hope you can help me. Thank you very much.
Below are questions:

  1. Using the similarity function to build the feature map then "ellipsis" and "coreference" 's pixel value will be close, how can semantic segmentation differ them while predicting.
  2. Related to the first question, since using similarity function to build the feature map then the replicas' pixel value in the context utterance will be closer, which means they are easy to be predicted as the same class resulting in replica operation in the result. It could owe to the nature of CNN nature to have the invariance in the image.
  3. Do you have some other tricks since I still cannot reproduce your result by retraining the "train_multi.sh" several time

Best Regards,
Yong

Error while training using turn.none.jsonnet

Hi,

I am trying to run the code for turn.none.jsonnet, I am getting the following error

Traceback (most recent call last): File "./dataset_reader/sparc_reader.py", line 143, in build_instance sql_query_list=sql_query_list File "./dataset_reader/sparc_reader.py", line 417, in text_to_instance action_non_terminal, action_seq, all_valid_actions = world.get_action_sequence_and_all_actions() File "./context/world.py", line 155, in get_action_sequence_and_all_actions action_sequence = self.sql_converter.translate_to_intermediate(self.sql_clause) File "./context/converter.py", line 87, in translate_to_intermediate return self._process_statement(sql_clause=sql_clause) File "./context/converter.py", line 117, in _process_statement inter_seq.extend(self._process_root(sql_clause)) File "./context/converter.py", line 657, in _process_root step_inter_seq = _process_step(cur_state) File "./context/converter.py", line 511, in _process_step return call_back_mapping[step_state](sql_clause) File "./context/converter.py", line 262, in _process_join if self.col_names[col_ind].refer_table.name == join_tab_name: KeyError: 'C'

After this the code fails with:
site-packages/allennlp/data/vocabulary.py", line 399, in from_instances instance.count_vocab_items(namespace_token_counts) AttributeError: 'NoneType' object has no attribute 'count_vocab_items

I have downloaded the glove embeddings in glove folder and the dataset is in dataset_sparc along with the code. Do you have any suggestions what might be the issue?

Thanks

Semantic parsing in context predict sql

Hello everyone in the Semantic Parsing in Context repository, predicted sql queries with where are never correct.
example: what is the abbreviation for Jetblue?
given as query "SELECT airlines.abbreviation FROM airlines WHERE airlines.airline = 1"
as you can see the value associated with WHERE is 1 instead of Jetblue.
it's the same for all queries with WHERE.
Is there a way to resolve this.
Thanks in advance

custom dataset creation for unisar

Hello @SivilTaram,

I've two tables and they can be linked with primary and foreign key. I would like to use unisar on these tables. could you please share steps or hints on how can i create custom dataset for making use of unisar? I appreciate your help:)

MultiSpider repo

Hi! I have tried to access to the MultiSpider link from the README file but it does not exist. Where can I find the code of the model?

Reproducing LEMON

Hello,

Firstly, thanks for the great work! After reading "LEMON: Language-Based Environment Manipulation via Execution-Guided Pre-training", I wanted to reproduce the results on Propara. However I obtained a very low accuracy score.

Python version: 3.9.0
Fairseq version: 0.12.2

Here are the steps I followed:

1 - Cloning the repository.
2 - Downloading the data/BART models.
3 - Preprocessing propara for both the pretraining and finetuning.
4 - Pretraining with BART-large.
5 - Finetuning the pretrained BART-large model.

For preprocessing, I used the preprocess_pretrain.sh and preprocess_finetune.sh files. For pretraining and finetuning, I used pretrain.sh and finetune.sh files without any parameter change. These steps leaded up to the following performance:

Correct / Total : 19 / 368, Denotation Accuracy : 0.052
path: bart_large_finetuned/checkpoint_best.pt, stage: valid, 1utts: 0.017, 3utts: 0.018, 5utts: 0.0
path: bart_large_finetuned/checkpoint_best.pt, stage: test, 1utts: 0.041, 3utts: 0.068, 5utts: 0.068

I would really appreciate your help for reproducing the results.
Thanks in advance.

about prediction problem

I'm not very familiar with AllenNLP api,How do you use the prediction code? I write the following code that reports an
" TypeError: is_bidirectional() missing 1 required positional argument: 'self'"
error

@Predictor.register("rewrite")
class RewritePredictor(Predictor):

@overrides
def _json_to_instance(self, json_dict: JsonDict) -> Instance:
    """
    Expects JSON that looks like `{"source": "..."}`.
    """
    context = json_dict["context"]
    current = json_dict["current"]
    # placeholder
    # restate_utt = "hi"
    restate_utt = json_dict["restate_utt"]
    return self._dataset_reader.text_to_instance(context, current, restate_utt, training=False)

inputs = {
"context":'浙 江 省 温 州 市 鹿 城 区 有 好 天 气 这 种 天 气 最 适 合 出 门 了 骑 骑 车 兜 兜 风',
"current":'明 天 天 气 咋 样',
"restate_utt":'hi'
}
model = UnifiedFollowUp(
Vocabulary,
Seq2SeqEncoder,
TextFieldEmbedder
)
dataset_reader = RewriteDatasetReader()

pred_fun = RewritePredictor(model = model,dataset_reader = dataset_reader)
result = pred_fun._json_to_instance(inputs)

semantic_parsing_in_context cuda out of memory

Hi there.
I run the training via colab (I have a 16GB GPU memory).
I'm using concat.none.jsonnet for BERT and getting a "CUDA out of memory" error at 54% of epoch 0
I would like to know the amount of memory needed to be able to launch the training based on BERT or if there is a way to do it with the 16GB
Thanks
Capture d’écran (15)

Capture d’écran (16)

ETA with downstream (awakening_latent_grounding)

Hello! Dear researcher. I would like to ask if the code that couples ETA to downstream text-to-SQL parsers can be open source.

In the paper, SLSQL (Lei et al., 2020) is used in the downstream task. But, I found the matrix shape generated by the grounding module in ETA is different from the matrix shape of the schema linking module in SLSQL.

I would like to know how to apply the matrix generated by the ETA grounding module to SLSQL.

The code that couples ETA to ALIGN is also expected to be open source if conditions permit. Thank you!

Potential performance issue: .apply slow in pandas below 1.5 version

Issue Description:

Hello.
I have discovered a performance degradation in the .apply function of pandas version below 1.5. And I notice parts of the repository depends on pandas below 1.5 such as robustness_of_text_to_sql/CTA/requirements.txt. I am not sure whether this performance problem in pandas will affect this repository. I found some discussions on pandas GitHub related to this issue, including #44172 and #45404.
I also found that poset_decoding/traversal_path_prediction/MatchZoo-py/matchzoo/data_pack/data_pack.py used the influenced api. There may be more files using the influenced api and more parts using pandas below 1.5.

Suggestion

I would recommend considering an upgrade to a different version of pandas >= 1.5 or exploring other solutions to optimize the performance of .apply.
Any other workarounds or solutions would be greatly appreciated.
Thank you!

LogiGAN: dataset creation

Hi! Thank you for sharing the code for LogiGAN paper.
I'm having troubles creating training set. In particular:

  1. Here code refers to non-exiting script. I have replaced commands with "python corpus_construction.py --start 0 --end 500 --indicator_type conclusion &" - is it the right way to do?
  2. elastic_search/build_gen_train/ver_train refers to files that do not exist in the bookcorpus, and there are no instructions how to create them. Is there a script/link to generate gan_corpus_new/beta/gen_train_B.jsonl and gan_corpus_new/beta/ver_train.jsonl files?

关于`python predict.py`的问题~

您好!
感谢作者的开源!
我在执行python install -r requirement.txt之后执行
cd src && python predict.py之后出现错误~
似乎是模型"../pretrained_weights/multi_bert.tar.gz"并没有加载进去~

非常期待能得到您的回复~
祝您五一快乐~

Model name 'bert-base-chinese' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz' was a path or url but couldn't find any file associated to this path or url.
Traceback (most recent call last):
  File "predict.py", line 28, in <module>
    manager = PredictManager("../pretrained_weights/multi_bert.tar.gz")
  File "predict.py", line 12, in __init__
    archive = load_archive(archive_file)
  File "/home/cingti/anaconda3/envs/cpu-py36/lib/python3.6/site-packages/allennlp/models/archival.py", line 230, in load_archive
    cuda_device=cuda_device)
  File "/home/cingti/anaconda3/envs/cpu-py36/lib/python3.6/site-packages/allennlp/models/model.py", line 327, in load
    return cls.by_name(model_type)._load(config, serialization_dir, weights_file, cuda_device)
  File "/home/cingti/anaconda3/envs/cpu-py36/lib/python3.6/site-packages/allennlp/models/model.py", line 265, in _load
    model = Model.from_params(vocab=vocab, params=model_params)
  File "/home/cingti/anaconda3/envs/cpu-py36/lib/python3.6/site-packages/allennlp/common/from_params.py", line 365, in from_params
    return subclass.from_params(params=params, **extras)
  File "/home/cingti/anaconda3/envs/cpu-py36/lib/python3.6/site-packages/allennlp/common/from_params.py", line 386, in from_params
    kwargs = create_kwargs(cls, params, **extras)
  File "/home/cingti/anaconda3/envs/cpu-py36/lib/python3.6/site-packages/allennlp/common/from_params.py", line 133, in create_kwargs
    kwargs[name] = construct_arg(cls, name, annotation, param.default, params, **extras)
  File "/home/cingti/anaconda3/envs/cpu-py36/lib/python3.6/site-packages/allennlp/common/from_params.py", line 229, in construct_arg
    return annotation.from_params(params=subparams, **subextras)
  File "/home/cingti/anaconda3/envs/cpu-py36/lib/python3.6/site-packages/allennlp/common/from_params.py", line 365, in from_params
    return subclass.from_params(params=params, **extras)
  File "/home/cingti/anaconda3/envs/cpu-py36/lib/python3.6/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 168, in from_params
    token_embedders[key] = TokenEmbedder.from_params(vocab=vocab, params=embedder_params)
  File "/home/cingti/anaconda3/envs/cpu-py36/lib/python3.6/site-packages/allennlp/common/from_params.py", line 365, in from_params
    return subclass.from_params(params=params, **extras)
  File "/home/cingti/anaconda3/envs/cpu-py36/lib/python3.6/site-packages/allennlp/common/from_params.py", line 388, in from_params
    return cls(**kwargs)  # type: ignore
  File "/home/cingti/anaconda3/envs/cpu-py36/lib/python3.6/site-packages/allennlp/modules/token_embedders/bert_token_embedder.py", line 272, in __init__
    for param in model.parameters():
AttributeError: 'NoneType' object has no attribute 'parameters'

MCD2 and MCD3 specific data processing?

Hi authors, @SivilTaram

I see there is some specialized logic to process the CFQ dataset for the MCD2 and MCD3 datasets. We are confused why this special path is present. Why did you add this special logic? What what the behavior if you preprocessed MCD2 and MCD3 with the MCD1 preprocessing code paths?

if query.startswith("Did M") or query.startswith("Was M") or query.startswith("Were M") or query.startswith("Was a"):
if type in ['mcd2', 'mcd3']:
nl_pattern = query.split()[0] +" " + query.split()[1]
terms.append((nl_pattern, [f'?x0#is#{query.split()[1]}'], (0, 1)))
else:
nl_pattern = query.split()[0] +" M"
terms.append((nl_pattern, ['?x0#is#M'], (0, 1)))

if candidate_term.count("M") == 1:
if candidate_term.startswith("?x0 is M") and split in ['mcd2', 'mcd3']:
candidate_triplets[candidate_skeleton] += [candidate_term]
else:
candidate_triplets[candidate_skeleton] += [''.join(candidate_term.replace("M", entity[0][0])) for entity in entities]

Thanks,
Paras

MultiSpider dataset availability

Hello,

I just finished reading your work on "MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing". Is the data available somewhere? Unfortunately I can't find it.

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.