ohadrubin / smbop Goto Github PK

View Code? Open in Web Editor NEW

88.0 3.0 34.0 81 KB

License: MIT License

Python 96.77% Shell 0.34% Jsonnet 2.73% Dockerfile 0.15%

smbop's Introduction

SmBoP: Semi-autoregressive Bottom-up Semantic Parsing

Author implementation of this NAACL 2021 paper.

Install & Configure

Install pytorch 1.8.1 that fits your CUDA version
Install the rest of required packages
```
pip install -r requirements.txt
```

Run this command to install NLTK punkt.

python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"

Download the Spider dataset with the following command:
```
bash scripts/download_spider.sh 
```

Training the parser

Use the following command to train:

python exec.py

First time loading of the dataset might take a while (a few hours) since the model first loads values from tables and calculates similarity features with the relevant question. It will then be cached for subsequent runs. Use the disable_db_content argument to reduce the pre-processing time in exchange of not performing IR on some (incredibly large) tables.

Evaluation

To create predictions run the following command:

python eval.py --archive_path {model_path} --output preds.sql

To run the evalutation with the official spider script:

python smbop/eval_final/evaluation.py --gold dataset/dev_gold.sql --pred preds.sql --etype all --db  dataset/database  --table dataset/tables.json

Pretrained model

You can download a pretrained model from here. It achieves the following results on the offical script:

                     easy                 medium               hard                 extra                all                 
count                248                  446                  174                  166                  1034                
=====================   EXECUTION ACCURACY     =====================
execution            0.883                0.791                0.684                0.530                0.753             

====================== EXACT MATCHING ACCURACY =====================
exact match          0.883                0.791                0.655                0.512                0.746

Demo

You can run SmBoP on a Google Colab notebook here.

Docker

You could also use the demo with docker:

docker build -t smbop .
docker run -it --gpus=all smbop:latest

This will create a infrence terminal similar to the Google Colab demo, you could run for example:

>>>inference("Which films cost more than 50 dollars or less than 10?","cinema")
SELECT film.title FROM schedule JOIN film ON schedule.film_id = film.film_id WHERE schedule.price > 50 OR schedule.price<10

smbop's People

Contributors

Stargazers

Watchers

smbop's Issues

Returning tree_obj_values in place of tree_obj in spider.py

Hi @OhadRubin ,

If value_pred is True, would it be correct to simply return tree_obj_values in place of tree_obj in the MetaField?

SmBop/smbop/dataset_readers/spider.py

Line 271 in e7a6fce

tree_obj_values = ra_preproc.ast_to_ra(tree_dict_values["query"])

Thanks,

why add “added_values”

Hi, I see in the spider.py
added_values = [
"1",
"2",
"3",
"4",
"5",
"yes",
"no",
"y",
"t",
"f",
"m",
"n",
"null",
]

I've not just figured out why do we need to add these values in entities?

Documentation of smbop module for better understanding

Hi @OhadRubin,

Congratulations on this great work and thank you for open-sourcing the code.

It would be very helpful if you could also provide some documentation for the code inside smbop module.
I'm finding it a bit difficult to match the parts of your code to the ideas in Section 3 of your paper.
A brief documentation would help significantly in gaining a better understanding.

Thank you in Advance :)

[Possible Bug?] Should is_level_order_list always have atleast one element as 1?

Hi @OhadRubin ,

Regarding is_levelorder_list defined at the following line:

SmBop/smbop/models/smbop.py

Line 682 in e7a6fce

is_levelorder_list = vec_utils.isin(

For debugging, I was feeding the validation dataset while training the SmBOP model.
I see that is_levelorder_list contains all zero elements for some examples at some decoding steps.
Is this expected?
As per my understanding, is_levelorder_list should have at least 1 non-zero element for each example, at each decoding step.

What is the purpose of the 2 lists "RULES_novalues" and "RULES_values" in node_util.py? could you please elaborate on its format and what do the 2 terms in each inner list denote

How can i fine-tune model?

I try to fine-tune SmBop on my own small dataset. I add my sqlite file in database, update table.json file and create config.json like this

{
    "dataset_reader": {
        "type": "smbop",
        "dataset_path": "dataset/database",
        "keep_if_unparsable": false,
        "lazy": false,
        "limit_instances": -1,
        "max_instances": 1000000,
        "question_token_indexers": {
            "tokens": {
                "type": "pretrained_transformer",
                "model_name": "Salesforce/grappa_large_jnt"
            }
        },
        "tables_file": "dataset/output_new_table.json",
        "value_pred": true
    },
     "validation_dataset_reader": {
        "type": "smbop",
        "dataset_path": "dataset/database",
        "keep_if_unparsable": true,
        "lazy": false,
        "limit_instances": -1,
        "max_instances": 1000000,
        "question_token_indexers": {
            "tokens": {
                "type": "pretrained_transformer",
                "model_name": "Salesforce/grappa_large_jnt"
            }
        },
        "tables_file": "dataset/output_new_table.json",
        "value_pred": true
    },
    "train_data_path": "dataset/fine_train.json",
    "validation_data_path": "dataset/fine_test.json",
    "model": {
        "type": "from_archive",
        "archive_file": "model.tar.gz"
    },
    "data_loader": {
        "batch_sampler": {
            "type": "bucket",
            "batch_size": 8,
            "sorting_keys": [
                "enc",
                "depth"
            ]
        }
    },
    "validation_data_loader": {
        "batch_size": 8,
        "shuffle": true
    },
    "trainer": {
        "num_epochs": 10,
        "optimizer": {
            "type": "adam",
            "lr": 1.86e-06,
            "parameter_groups": [
                [
                    [
                        "question_embedder"
                    ],
                    {
                        "lr": 3e-08
                    }
                ]
            ]
        },
        "learning_rate_scheduler": {
            "type": "polynomial_decay",
            "power": 0.5,
            "warmup_steps": 1
        },
        "num_gradient_accumulation_steps": 4,
        "grad_norm": null,
        "patience": 100,
        "checkpointer": {
            "num_serialized_models_to_keep": 1
        },
        "grad_clipping": null,
        "validation_metric": "+spider",
        "use_amp": true,
        "cuda_device": 0
    }
}

and than run resume.py with this config file, but it doesn't look like the model is trained. What am I doing wrong?

Is it possible to share the trained model?

Thank you for sharing the code! It will be very kind of you guys to share the trained model.
Looking forward to it!

RecursionError: maximum recursion depth exceeded

When I run exec.py, I faced the error below:

Traceback (most recent call last):
File "exec.py", line 146, in
run()
File "exec.py", line 137, in run
train_model(
File "/home/puritysarah/anaconda3/envs/py38/lib/python3.8/site-packages/allennlp/commands/train.py", line 236, in train_model
model = _train_worker(
File "/home/puritysarah/anaconda3/envs/py38/lib/python3.8/site-packages/allennlp/commands/train.py", line 453, in train_worker
train_loop = TrainModel.from_params(
File "/home/puritysarah/anaconda3/envs/py38/lib/python3.8/site-packages/allennlp/common/from_params.py", line 589, in from_params
return retyped_subclass.from_params(
File "/home/puritysarah/anaconda3/envs/py38/lib/python3.8/site-packages/allennlp/common/from_params.py", line 623, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/puritysarah/anaconda3/envs/py38/lib/python3.8/site-packages/allennlp/commands/train.py", line 658, in from_partial_objects
"train": data_loader.construct(reader=dataset_reader, data_path=train_data_path)
File "/home/puritysarah/anaconda3/envs/py38/lib/python3.8/site-packages/allennlp/common/lazy.py", line 80, in construct
return self.constructor(**contructor_kwargs)
File "/home/puritysarah/anaconda3/envs/py38/lib/python3.8/site-packages/allennlp/common/lazy.py", line 64, in constructor_to_use
return self.constructor.from_params( # type: ignore[union-attr]
File "/home/puritysarah/anaconda3/envs/py38/lib/python3.8/site-packages/allennlp/common/from_params.py", line 589, in from_params
return retyped_subclass.from_params(
File "/home/puritysarah/anaconda3/envs/py38/lib/python3.8/site-packages/allennlp/common/from_params.py", line 623, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/puritysarah/anaconda3/envs/py38/lib/python3.8/site-packages/allennlp/data/data_loaders/multiprocess_data_loader.py", line 281, in __
init
deque(self.iter_instances(), maxlen=0)
File "/home/puritysarah/anaconda3/envs/py38/lib/python3.8/site-packages/allennlp/data/data_loaders/multiprocess_data_loader.py", line 349, in it
er_instances
for instance in Tqdm.tqdm(
File "/home/puritysarah/.local/lib/python3.8/site-packages/tqdm/std.py", line 1180, in iter
for obj in iterable:
File "/home/puritysarah/anaconda3/envs/py38/lib/python3.8/site-packages/allennlp/data/dataset_readers/dataset_reader.py", line 192, in read
for instance in self._multi_worker_islice(self._read(file_path)): # type: ignore
File "/home/puritysarah/SmBop/smbop/dataset_readers/spider.py", line 180, in _read
yield from self._read_examples_file(file_path)
File "/home/puritysarah/SmBop/smbop/dataset_readers/spider.py", line 214, in _read_examples_file
ins = self.create_instance(ex)
File "/home/puritysarah/SmBop/smbop/dataset_readers/spider.py", line 244, in create_instance
ins = self.text_to_instance(
File "/home/puritysarah/SmBop/smbop/dataset_readers/spider.py", line 265, in text_to_instance
tree_dict = msp.parse(sql)
...
File "/home/puritysarah/.local/lib/python3.8/site-packages/pyparsing/core.py", line 907, in _parseCache
value = self._parseNoCache(instring, loc, doActions, callPreParse)
File "/home/puritysarah/.local/lib/python3.8/site-packages/pyparsing/core.py", line 801, in _parseNoCache
pre_loc = self.preParse(instring, loc)
File "/home/puritysarah/.local/lib/python3.8/site-packages/pyparsing/core.py", line 751, in preParse
loc = self._skipIgnorables(instring, loc)
File "/home/puritysarah/.local/lib/python3.8/site-packages/pyparsing/core.py", line 743, in _skipIgnorables
loc, dummy = e._parse(instring, loc)
File "/home/puritysarah/.local/lib/python3.8/site-packages/pyparsing/core.py", line 903, in _parseCache
value = cache.get(lookup)
File "/home/puritysarah/.local/lib/python3.8/site-packages/pyparsing/util.py", line 109, in get
return cache_get(key, not_in_cache)
RecursionError: maximum recursion depth exceeded

I do not know where this comes from...

Usage of vocab built during training

Is the vocab built by AllenNLP by scanning the data used anywhere during training? Or do we simply use the vocabulary of Grappa?
Thanks :)

Question about the Representation of Query Trees

Hi,
I'd like to confirm 3 points of confusion:

The operation of join is absent in relational algebra grammar，is it possible that causes Search errors(52%) you mentioned in the paper?
And, what's the meaning of the unary ops' literal?
I have seen that you have decreased the self._num_values = 15 to 10, whether the 10 is enough, due to the mention words in the question that related to schema normally less than 10

Looking forward to your reply

Hos is reranking applied in the code?

Hi,

It seems like I could not find the code related to the reranking part described in the paper. Several config options like "cntx_reranker" and "should_rerank" are not actually used. Could you please give me some hints on that? Many thanks!

How can we new sample to spider dataset to fine tune the SmBop?

none to the best of my knowledge.

Hi @hXtreme, @OhadRubin , I managed to make the custom data set, but I'm facing some problems:

I'm unable to understand one field sql:{}in the train_spider.json . Can you please explain how you were utilizing this in SmBop , what role is playing in the spider dataset? and is it needed in SmBop or not?
There is an script in the spider repository, preprocess/parse_one_sql.py to generate this parsed sql query and it is working perfectly for some queries, but when i tried this query ,SELECT DATE_FORMAT(date_complaint_raised,'%Y-%m'), count(*) FROM Complaints WHERE YEAR(date_complaint_raised) = value GROUP BY DATE_FORMAT(date_complaint_raised,'%Y-%m'), it gives some date-format assertion error. So while going through the code i found out the script is hard coded, and not supporting functions, So what can we do to add function.
Thanks

PS: The strftime function in sqlite is also not working.
Originally posted by @alan-ai-learner in #39 (comment)

The problem of reproducibility.

Hi,
Thank you for your excellent paper and open source code.
But I re-train the default configuration code directly, and did not reproduce excellent performance.

Here is my result:

                     easy                 medium               hard                 extra                all
count                248                  446                  174                  166                  1034
=====================   EXECUTION ACCURACY     =====================
execution            0.867                0.776                0.644                0.524                0.735

====================== EXACT MATCHING ACCURACY =====================
exact match          0.879                0.787                0.644                0.500                0.739

What did I miss? I've run three times and it's almost the nearly result.

Can't download spider and model_weights

There is a problem with the permissions of the gdown files:

Permission denied: https://drive.google.com/uc?export=download&id=1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0
Maybe you need to change permission over 'Anyone with the link'?

Of course I could download it via the browser, however, I need the files on an ssh server, where I dont have browser access. How do I solve this?

Issue in running Colab

I am getting the following error

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-5-071043e09cdd>](https://localhost:8080/#) in <cell line: 10>()
      8 from allennlp.models.archival import Archive, load_archive, archive_model
      9 from allennlp.data.vocabulary import Vocabulary
---> 10 from smbop.modules.relation_transformer import *
     11 from allennlp.common import Params
     12 from smbop.models.smbop import SmbopParser

6 frames
[/content/SmBop/smbop/modules/relation_transformer.py](https://localhost:8080/#) in <module>
      6 
      7 @Seq2SeqEncoder.register("relation_transformer")
----> 8 class RelationTransformer(Seq2SeqEncoder):
      9     def __init__(
     10         self,

[/content/SmBop/smbop/modules/relation_transformer.py](https://localhost:8080/#) in RelationTransformer()
     55 
     56     @overrides
---> 57     def is_bidirectional(self):
     58         return False
     59 

[/usr/local/lib/python3.10/dist-packages/overrides/overrides.py](https://localhost:8080/#) in overrides(method, check_signature, check_at_runtime)
     81     """
     82     if method is not None:
---> 83         return _overrides(method, check_signature, check_at_runtime)
     84     else:
     85         return functools.partial(

[/usr/local/lib/python3.10/dist-packages/overrides/overrides.py](https://localhost:8080/#) in _overrides(method, check_signature, check_at_runtime)
    168                 return wrapper  # type: ignore
    169             else:
--> 170                 _validate_method(method, super_class, check_signature)
    171                 return method
    172     raise TypeError(f"{method.__qualname__}: No super class method found")

[/usr/local/lib/python3.10/dist-packages/overrides/overrides.py](https://localhost:8080/#) in _validate_method(method, super_class, check_signature)
    187         and not isinstance(super_method, property)
    188     ):
--> 189         ensure_signature_is_compatible(super_method, method, is_static)
    190 
    191 

[/usr/local/lib/python3.10/dist-packages/overrides/signature.py](https://localhost:8080/#) in ensure_signature_is_compatible(super_callable, sub_callable, is_static)
    100 
    101     if super_type_hints is not None and sub_type_hints is not None:
--> 102         ensure_return_type_compatibility(super_type_hints, sub_type_hints, method_name)
    103         ensure_all_kwargs_defined_in_sub(
    104             super_sig, sub_sig, super_type_hints, sub_type_hints, is_static, method_name

[/usr/local/lib/python3.10/dist-packages/overrides/signature.py](https://localhost:8080/#) in ensure_return_type_compatibility(super_type_hints, sub_type_hints, method_name)
    300     sub_return = sub_type_hints.get("return", None)
    301     if not _issubtype(sub_return, super_return) and super_return is not None:
--> 302         raise TypeError(
    303             f"{method_name}: return type `{sub_return}` is not a `{super_return}`."
    304         )

TypeError: RelationTransformer.is_bidirectional: return type `None` is not a `<class 'bool'>`.

I had to remove versions from the earlier block's pip installs as it won't work otherwise.

Executing exec.py

I am using GPU to follow along the training process of the model. I have cloned the github repo, ran the requirements.txt command, installed NLTK punkt, and downloaded spider dataset.
Then while executing the "training the parser" part by "exec.py" as follows:
(SmBop) rajkrishna@deeplearning-02:~/SmBop$ python3 exec.py, I am getting following error:
{'name': 'None', 'force': 'false', 'gpu': '0', 'recover': 'false', 'debug': 'false', 'detect_anomoly': 'false', 'profile': 'false', 'is_oracle': 'false', 'tiny_dataset': 'false', 'load_less': 'false', 'cntx_rep': 'false', 'cntx_beam': 'false', 'disentangle_cntx': 'true', 'cntx_reranker': 'true', 'value_pred': 'true', 'use_longdb': 'true', 'uniquify': 'false', 'use_bce': 'false', 'tfixup': 'false', 'train_as_dev': 'false', 'amp': 'true', 'utt_aug': 'true', 'should_rerank': 'false', 'use_treelstm': 'false', 'db_content': 'true', 'lin_after_cntx': 'false', 'optimizer': 'adam', 'rat_layers': '8', 'beam_size': '30', 'base_dim': '32', 'num_heads': '8', 'beam_encoder_num_layers': '1', 'tree_rep_transformer_num_layers': '1', 'dropout': '0.1', 'rat_dropout': '0.2', 'lm_lr': '3e-06', 'lr': '0.000186', 'batch_size': '20', 'grad_acum': '4', 'max_steps': '60000', 'power': '0.5', 'temperature': '1.0', 'grad_clip': '-1', 'grad_norm': '-1'} experiment_name: jumpy-viridian-guppy Traceback (most recent call last): File "exec.py", line 146, in <module> run() File "exec.py", line 137, in run train_model( File "/home/rajkrishna/.local/lib/python3.8/site-packages/allennlp/commands/train.py", line 226, in train_model training_util.create_serialization_dir(params, serialization_dir, recover, force) File "/home/rajkrishna/.local/lib/python3.8/site-packages/allennlp/training/util.py", line 232, in create_serialization_dir os.makedirs(serialization_dir, exist_ok=True) File "/usr/lib/python3.8/os.py", line 213, in makedirs makedirs(head, exist_ok=exist_ok) File "/usr/lib/python3.8/os.py", line 213, in makedirs makedirs(head, exist_ok=exist_ok) File "/usr/lib/python3.8/os.py", line 213, in makedirs makedirs(head, exist_ok=exist_ok) File "/usr/lib/python3.8/os.py", line 223, in makedirs mkdir(name, mode) PermissionError: [Errno 13] Permission denied: '/media/disk1'

I am also attaching the screenshot:

Kindly help me resolve the error.

Is it possible to use the schema of our own dataset, generated by SQLite in json format, with SmBop code in google colab?

I have my own dataset. I can generate its schema in json format using SQLite DB's export option. I want to convert natural questions to SQL format using SmBop for querying on my own database. Is it possible to do this using the pretrained SmBop model on google colab?

Model Performance

The model is working fine for smaller DBs. But when there are multiple level of foreign key connections it seems to struggle
Why is it so, any reason?

Is SmBop can adapt the database functions and it not detecting the date columns in data base?

it looks like the model doesn't support some important functions like strftime() which are quite common while doing, say, a monthwise or yearwise aggregation query. Can the model be trained to support it

setting bug

Traceback (most recent call last):
File "exec.py", line 148, in
run()
File "exec.py", line 111, in run
settings = Params.from_file(
File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/allennlp/common/params.py", line 488, in from_file
file_dict = json.loads(evaluate_file(params_file, ext_vars=ext_vars))
RuntimeError: RUNTIME ERROR: field does not exist: parseJson
configs/defaults.jsonnet:5:26-39 function
configs/defaults.jsonnet:168:23-47 object
configs/defaults.jsonnet:(161:21)-(169:6) object
configs/defaults.jsonnet:(145:12)-(205:4) object
During manifestation

how to deal with the bug? I don not know which language it use? std.parseJson

How can I change the parameters if the cuda memery is not enough?

Even I change the batch size to 1,it still raises this error 。And my cuda memery is 6 G

Is it possible to use any other databases like MySQL, Postgresql, oracle etc

The sqlite databases are working fine, but i want to try with another dbs.
Thanks

lmdb.Error: cache/exp1000train: No such file or directory

Hi there! Thanks for putting this repo up! After reading the paper I was excited to try and mess around with it. As is, it appears that SmbopSpiderDatasetReader is expecting some sort of cached files?

2021-04-21 20:51:26,983 - CRITICAL - root - Uncaught exception
Traceback (most recent call last):
  File "exec.py", line 140, in <module>
    run()
  File "exec.py", line 131, in run
    train_model(
  File "/Users/maltersj/miniconda/envs/smbop/lib/python3.8/site-packages/allennlp/commands/train.py", line 236, in train_model
    model = _train_worker(
  File "/Users/maltersj/miniconda/envs/smbop/lib/python3.8/site-packages/allennlp/commands/train.py", line 453, in _train_worker
    train_loop = TrainModel.from_params(
  File "/Users/maltersj/miniconda/envs/smbop/lib/python3.8/site-packages/allennlp/common/from_params.py", line 589, in from_params
    return retyped_subclass.from_params(
  File "/Users/maltersj/miniconda/envs/smbop/lib/python3.8/site-packages/allennlp/common/from_params.py", line 621, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/Users/maltersj/miniconda/envs/smbop/lib/python3.8/site-packages/allennlp/common/from_params.py", line 199, in create_kwargs
    constructed_arg = pop_and_construct_arg(
  File "/Users/maltersj/miniconda/envs/smbop/lib/python3.8/site-packages/allennlp/common/from_params.py", line 307, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/Users/maltersj/miniconda/envs/smbop/lib/python3.8/site-packages/allennlp/common/from_params.py", line 341, in construct_arg
    return annotation.from_params(params=popped_params, **subextras)
  File "/Users/maltersj/miniconda/envs/smbop/lib/python3.8/site-packages/allennlp/common/from_params.py", line 589, in from_params
    return retyped_subclass.from_params(
  File "/Users/maltersj/miniconda/envs/smbop/lib/python3.8/site-packages/allennlp/common/from_params.py", line 623, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/Users/maltersj/Documents/School/DL/SmBop/smbop/dataset_readers/spider.py", line 68, in __init__
    self.cache = TensorCache(cache_directory)
  File "/Users/maltersj/Documents/School/DL/SmBop/smbop/utils/cache.py", line 126, in __init__
    self.lmdb_env = lmdb.open(
lmdb.Error: cache/exp1000train: No such file or directory

Question regarding final_beam_acc and BEM

Hi @OhadRubin ,

Could you please let me know whether the final_beam_acc metric in the code is the same as the BEM metric reported in Table 3 of your paper?

I'm finding final_beam_acc it to be much lesser than the EM and BEM scores.
Is this expected?

{
"best_epoch": 204,
"peak_worker_0_memory_MB": 11110.15234375,
"peak_gpu_0_memory_MB": 17076.2431640625,
"training_duration": "1 day, 11:19:15.241460",
"training_start_epoch": 0,
"training_epochs": 304,
"epoch": 304,
"training_final_beam_acc": 0.9714408725602756,
"training_loss": 1.5601123754273762,
"training_worker_0_memory_MB": 11110.15234375,
"training_gpu_0_memory_MB": 17076.2431640625,
"validation_final_beam_acc": 0.6634335596508244,
"validation_spider": 0.7177497575169738,
"validation_reranker": 0.5470417070805044,
"validation_leafs_acc": 0.965082444228904,
"validation_loss": 0.0,
"best_validation_final_beam_acc": 0.6702230843840931,
"best_validation_spider": 0.7381183317167799,
"best_validation_reranker": 0.5606207565470417,
"best_validation_leafs_acc": 0.9515033947623667,
"best_validation_loss": 0.0
}

EM evaluation failed

Hi there,

Thanks for releasing the code. I ran the model with a filtered spider dataset excluding extra hard samples. However I encountered errors below, although the training process can still go on, the evaluation accuracy is 0. Any idea on this situation?

Problems using SmBop without values

Hello, I recently tried to train and evaluate the method without values and I ran into some problems. You can surely help me with that.
I trained the model with :

python exec.py --disable_value_pred

and afterward tried to evaluate with the lines:

python eval.py --archive_path model_name  --output preds.sql 
python smbop/eval_final/evaluation.py --gold dataset/dev_gold.sql --pred preds.sql --etype all --db  dataset/database  --table dataset/tables.json

The preds.sql contained lines like:

concert.concert_id != singer.age    concert_singer
concert.concert_id != *    concert_singer
concert.concert_name != *    concert_singer

The result in every run was:


=====================   EXECUTION ACCURACY     =====================
execution            0.000                0.000                0.000                0.000                0.000               

====================== EXACT MATCHING ACCURACY =====================
exact match          0.000                0.000                0.000                0.000                0.000

Thank you in advance!

What are the uses of the function compute_beam_idx and compute_op_idx durin training ?

How to get RA Tree node embeddings?

How to get the sql/relational-algebra tree node (contextual) embeddings?

@OhadRubin

[possible bug?] is_level_order_list should always have atleast one element as 1?

SmBop/smbop/models/smbop.py

Line 682 in e7a6fce

is_levelorder_list = vec_utils.isin(

lmdb.Error: cache/exp1000train: No such file or directory

I installed everything as described. When I try to run start_demo or the eval script, I get the error:

Traceback (most recent call last):
File "eval.py", line 77, in
main()
File "eval.py", line 49, in main
args.archive_path, cuda_device=args.gpu, overrides=overrides
File "/mount/arbeitsdaten31/studenten1/aterima/.conda/envs/SmBop/lib/python3.6/site-packages/allennlp/predictors/predictor.py", line 362, in from_path
load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),
File "/mount/arbeitsdaten31/studenten1/aterima/.conda/envs/SmBop/lib/python3.6/site-packages/allennlp/models/archival.py", line 205, in load_archive
config.duplicate(), serialization_dir
File "/mount/arbeitsdaten31/studenten1/aterima/.conda/envs/SmBop/lib/python3.6/site-packages/allennlp/models/archival.py", line 231, in _load_dataset_readers
dataset_reader_params, serialization_dir=serialization_dir
File "/mount/arbeitsdaten31/studenten1/aterima/.conda/envs/SmBop/lib/python3.6/site-packages/allennlp/common/from_params.py", line 593, in from_params
**extras,
File "/mount/arbeitsdaten31/studenten1/aterima/.conda/envs/SmBop/lib/python3.6/site-packages/allennlp/common/from_params.py", line 623, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/mount/arbeitsdaten31/studenten1/aterima/SmBop/smbop/dataset_readers/spider.py", line 69, in init
self.cache = TensorCache(cache_directory)
File "/mount/arbeitsdaten31/studenten1/aterima/SmBop/smbop/utils/cache.py", line 137, in init
lock=use_lock,
lmdb.Error: cache/exp1000train: No such file or directory

Do I have to run training first, although model weights are given?

ra_post_roc.remove_keep does not remove top level keep

For example consider the tree:

keep
╰── keep
    ╰── keep
        ╰── keep
            ╰── keep
                ╰── keep
                    ╰── Orderby_asc
                        ├── keep
                        │   ╰── keep
                        │       ╰── Value
                        ╰── Project
                            ├── keep
                            │   ╰── Value
                            ╰── keep
                                ╰── Table

after running it through remove_keep we get:

keep
╰── Orderby_asc
    ├── Value
    ╰── Project
        ├── Value
        ╰── Table

I'll be submitting a pr to fix this.

Running SmBoP in multi-threaded contexts

I'm trying to write a Flask web app where I want to write out some english text, translate it to SQL using an SmBoP model, and execute it on my backend database. However, Flask seems to create multiple threads to run the app, and when I try to run the model through the web app I get the following SQLite3 error:

ProgrammingError: SQLite objects created in a thread can only be used in that same thread

This seems to be caused by SmBoP's enc_preproc functionality accessing the SQLite database from multiple threads. After some research, following a suggestion from this stackoverflow question, I found that changing lines 185-186 of smbop/dataset_readers/enc_preproc.py from:

with sqlite3.connect(sqlite_path) as source:
    dest = sqlite3.connect(":memory:")

to:

with sqlite3.connect(sqlite_path, check_same_thread=False) as source:
    dest = sqlite3.connect(":memory:", check_same_thread=False)

solved my problem. Would it be possible to include this in the official codebase so that I can just clone the repository when I need it, and can properly keep my clone up to date, or are there reasons to avoid this type of change?

How do I make predictions on Custom Data

I made the store_small.sqlite and schema.sql files in a folder called store_small and placed this folder inside the spider database folder.
Now, when I try to select this database in the colab notebook, it gives out a KeyError: 'store_small' .

How should I make inferences and run custom queries on this database? Someone, please help me out here.

The traceback:

KeyError                                  Traceback (most recent call last)
<ipython-input-9-714b9f458c15> in <module>()
     26       )
     27       return out[0]["sql_list"]
---> 28 inference("How many regions are there?","store_small")

<ipython-input-9-714b9f458c15> in inference(question, db_id)
     18 def inference(question,db_id):
     19   instance = predictor._dataset_reader.text_to_instance(
---> 20       utterance=question, db_id=db_id,
     21   )
     22   predictor._dataset_reader.apply_token_indexers(instance)

/content/SmBop/smbop/dataset_readers/spider.py in text_to_instance(self, utterance, db_id, sql, sql_with_values)
    316             )
    317 
--> 318         desc = self.enc_preproc.get_desc(tokenized_utterance, db_id)
    319         entities, added_values, relation = self.extract_relation(desc)
    320 

/content/SmBop/smbop/dataset_readers/enc_preproc.py in get_desc(self, tokenized_utterance, db_id)
    193             text=[x.text for x in tokenized_utterance[1:-1]],
    194             code=None,
--> 195             schema=self.schemas[db_id],
    196             orig=None,
    197             orig_schema=self.schemas[db_id].orig,

KeyError: 'store_small'

Extract beam instead of best query

Hi again,

Is it possible to extract a beam with N queries, instead of the top one?

It seems the variable "out" in the image above always has size 2.

I have been checking allennlp/models/model.py which contains the class Predictor (which has the method forward_on_instances(...), but after grepping it I cannot find any mention of a beam on that file.

Thanks a lot,

Question about train/validation set.

Hi, this is just one question

Spider train_set has 8659 instances, although it comes divided into train_spider.json which has 7000, and train_others.json which has the other instances and which is used in most models as a validation set.

I would like a clarification, about whether SmBop is trained in the 7000 instances or the full 8659 to achieve state-of-the-art performance.

I've been checking from a high-level perspective the config file, and I am not sure...

Thanks for your work and attention

System requirements for faster training!

I'm using colab pro , with batch_size = 15 each epoch is taking 24 minutes, so total epoch is 410, it will take lots of time.
Any way to train faster. @OhadRubin

Do we have to create the schema manually for the custom database or is there any way to generate sqlite schema?

About multi GPUs for training

Hi, again

I wanna use multi GPUs, whether just add
"distributed": { "cuda_devices": [0, 1, 2], }

in defaults.jsonnet is enough?

Hi! InvalidVersion: Invalid version: '0.10.1,<0.11'

I declare the libraries

return error

Libraries of me:

Poor Performance with bigbird-roberta-base

Hi @OhadRubin ,

Replacing Grappa based encoder with bigbird-roberta-base, yields validation accuracy of just 40.8%.
Is this expected because best_validation_spider is not pre-trained using the Grappa-based objectives?

Thanks,

Resuming model training

Hi,

First of all, Thanks for your work.

I've managed to train the model for 100 epochs, but I would like to continue. I noticed you prepared a script "resume.py" but when I try running it, after changing the serialization_dir:

I got some errors with the imports, so I commented them:

However, when running resume.py I now get this error:

Any clue on how to solve it @OhadRubin ?

Thanks a lot

CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 2.00 GiB total capacity; 1.24 GiB already allocated; 4.50 MiB free; 1.24 GiB reserved in total by PyTorch)

Hi! I tried to train the model and encountered this problem.

I tried to decrease the batch size too but the problem persisted.

Can someone please help me with this?

How to get the confidence of a prediction?

I am trying to get the confidence of each prediction from SmBop. In your code, it seems like to make the final prediction, only the score for the last decoding step is needed, i.e., in this line all other scores are set as the minimum. So can I just use the max score in beam_scores_el there as the confidence of a prediction? Or is there any more reasonable way to do it? Any suggestions would be greatly appreciated!

Type of enc object in 'forward' in smbop.py is a dict with the tensors. But the _read function yields TextField with strings for 'enc' in the returned Instance. Where does this change take place and how to change the tokeniser used here?

Example of config.json for resume.py

Hello! According to the code of resume.py, some configuration (namely config.json) has to be provided. What should be inside? Can you provide a link to an example config file, please?

ValueError: TextField's token_indexers have not been set.

I get the following error on running the demo notebook on colab. I have not made any changes to your code

ValueError: TextField's token_indexers have not been set.
Did you forget to call DatasetReader.apply_token_indexers(instance) on your instance?
If apply_token_indexers() is being called but you're still seeing this error, it may not be implemented correctly..

In smbop.py, line 865 has the comment "TODO: change this!! this causes bugs!" Exactly what bug is this?

Resume training with the pretrained model

Hi,

I've trained a SmBop model from scratch with the 8659 instances instead of the 7000 and got better results, however they were still worse than the pre-trained model. So currently I am trying to continue training the pre-trained model with the 8659 instances to see if I can get further improvements.

However, I am not being able to resume training on the pre-trained model provided on this page:

This is what I am currently doing:

python resume.py --dir /app/SmBop/experiments/pretrained_model --gpu 0

However, it seems the training is starting from scratch instead of resuming (As it can be seen by the very low validation score):

Should I need to change the weights.th file ? What can I do to resume training with the best pretrained model available ?

Meaning of tensor beam_scores_el

What is beam_scores_el mathematically? Are these ( sum/ average/ product?) of log probabilities of each predicted token of each beam? what do the dimension of this tensor indicate? @entslscheia @OhadRubin please answer at your earlies convenience. Thanks.

ohadrubin / smbop Goto Github PK

smbop's Introduction

SmBoP: Semi-autoregressive Bottom-up Semantic Parsing

Install & Configure

Training the parser

Evaluation

Pretrained model

Demo

Docker

smbop's People

Contributors

Stargazers

Watchers

Forkers

smbop's Issues

Recommend Projects

Recommend Topics

Recommend Org