rowanz / r2c Goto Github PK

Recognition to Cognition Networks (code for the model in "From Recognition to Cognition: Visual Commonsense Reasoning", CVPR 2019)

Home Page: https://visualcommonsense.com

License: MIT License

Python 100.00%

visual reasoning vision vcr visual-commonsense-reasoning commonsense

r2c's People

Contributors

Stargazers

Watchers

Forkers

ml-lab sherlock42 ankitshah009 tony32769 hsouporto cclauss jackroos anak10thn jizecao huaizhengzhang navinthenapster viswanathgs pkulangzi ihaeyong radhikadua123 hammad001 ryanevhart baokuiwang anter001 deanplayerljx charishsiao jasonkrone hwanheelee1993 hyzcn songhe17 rvaughan sercanamac jesa7955 kuntalkumarpal stevenji ranamihir ruizewang ign0relee ehosseiniasl lzy9982 pidugusundeep stjordanis yongsongh yssongbit intuitionmachine scape1989 xiaobingdu wtdeng aledala riiick2011 amingwu weiyunfei mugekural9 jaeyun95 ranakroychowdhury amarasovic guangxuan-xiao kukrishna wanboyang raivnlab guptam ipmeme xiangz-416 null-0000 higuseonhye yanan1989 eric11eca sdlzy otanet autogyro tony-hong arjunsinghrana yuwlong666 donggu-kang manoprakash6 leahh147 wzj207 yueyedeai tzonglin66 viola-yuan hssip nullius-2020 ai-lab-team esimionato xkcao etoile-q thearchiver gaohuan2015 muyangren499 snat1505027 acamargofb bio3-systems-genomics-lab-ulg babyblue26 guaguagod wesley7137

r2c's Issues

meaning of ctx_answer

I'm looking at the bert representation of the question, it seems that ctx_answer0, ctx_answer1, ctx_answer2, ctx_answer3 should be the same representation for the question for each sample. But in the downloaded bert data, they are not. why does this happen?

Issue with dataset extraction using Colab.

I tried uploading both zipped and unzipped files on google drive and both ways it shows the files in the data-set might be corrupted. And since the dataset is huge, Colab won't suppport the computation. Could you recommend me some way to work on this dataset?

two links in data/README.md don't work

Hi @rowanz ,
Thanks for your wonderful work and code.
I found the two following links don't work and the bert representations couldn't be downloaded from them.
https://s3-us-west-2.amazonaws.com/ai2-rowanz/r2c/bert_da_answer_test.h5
https://s3-us-west-2.amazonaws.com/ai2-rowanz/r2c/bert_da_rationale_test.h5

KeyError: "Unable to open object (object '194190' doesn't exist)"

Thanks for your great code.
I got an error below.
I don't know where were going wrong and how can i do.
I hope can get some help.

0%| Traceback (most recent call last): File "train.py", line 119, in <module> for b, (time_per_batch, batch) in enumerate(time_batch(train_loader if args.no_tqdm else tqdm(train_loader), reset_every=ARGS_RESET_EVERY)): File "/home/songzijie/r2cmaster/utils/pytorch_misc.py", line 29, in time_batch for i, item in enumerate(gen): File "/home/songzijie/.conda/envs/r2c/lib/python3.6/site-packages/tqdm/std.py", line 1127, in __iter__ for obj in iterable: File "/home/songzijie/.conda/envs/r2c/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__ return self._process_next_batch(batch) File "/home/songzijie/.conda/envs/r2c/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 606, in _process_next_batch raise Exception("KeyError:" + batch.exc_msg) Exception: KeyError:Traceback (most recent call last): File "/home/songzijie/.conda/envs/r2c/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/songzijie/.conda/envs/r2c/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp> samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/songzijie/r2cmaster/dataloaders/vcr.py", line 235, in __getitem__ grp_items = {k: np.array(v, dtype=np.float16) for k, v in h5[str(index)].items()} File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/home/songzijie/.conda/envs/r2c/lib/python3.6/site-packages/h5py/_hl/group.py", line 264, in __getitem__ oid = h5o.open(self.id, self._e(name), lapl=self._lapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 190, in h5py.h5o.open KeyError: "Unable to open object (object '194190' doesn't exist)"

FileNotFoundError: [Errno 2] No such file or directory: '/disk4/libuwei/r2c-master/data/vcr1images/movieclips_S.W.A.T./[email protected]'

Hi Rowan,

When I run :
python train.py -params multiatt/default.json -folder saves/flagship_answer

There is an error:
FileNotFoundError: [Errno 2] No such file or directory: '/disk4/libuwei/r2c-master/data/vcr1images/movieclips_S.W.A.T./[email protected]'

environment information:
PyTorch version: 1.0.1.post2
CUDA used to build PyTorch: 8.0.44
OS：centos 7.3
GCC version: 5.5
Python version: 3.6
Is CUDA available: yes
GPU: Tesla K20c
Nvidia driver version: 375.66

Low training accuracy

Hi @rowanz,

I was trying to replicate the training of R2C model at my end. But the training fetches the accuracy of around 24.9% only as opposed to upwards of 60% that we get after training on the best checkpoint available in the README.md file.

The model running is being replicated on 2 GPUs with all the settings set to default. The standard output file is attached as
train.txt. Could you please help explain what the issue could be and what can be done to try and get a comparable accuracy? Please let me know if you need any details from my side.

Loss not decreasing on default config settings

@rowanz
Hi, I am trying to train the model from scratch, but am not able to reproduce the actual results. Specifically the loss is not decreasing in each epoch. I ran it for 20 epochs and the results are below. Anyone faced such an issue or know the possible reason for this? Any kind of suggestions will be of great help. Thank you.

TRAIN EPOCH 0:
loss 1.356284
crl 0.144345
accuracy 0.311996
sec_per_batch 1.702358
hr_per_epoch 1.048369
dtype: float64

Val epoch 0 has acc 0.249 and loss 1.386
Best validation performance so far. Copying weights to 'saves/flagship_rationale/best.th'.

TRAIN EPOCH 1:
loss 1.386393
crl 0.089470
accuracy 0.249471
sec_per_batch 2.008696
hr_per_epoch 1.237022
dtype: float64

Val epoch 1 has acc 0.249 and loss 1.386

TRAIN EPOCH 2:
loss 1.386381
crl 0.075422
accuracy 0.251220
sec_per_batch 1.946174
hr_per_epoch 1.198519
dtype: float64

Epoch 2: reducing learning rate of group 0 to 1.0000e-04.
Val epoch 2 has acc 0.249 and loss 1.386

TRAIN EPOCH 3:
loss 1.386379
crl 0.050537
accuracy 0.248640
sec_per_batch 1.870728
hr_per_epoch 1.152057
dtype: float64

Val epoch 3 has acc 0.249 and loss 1.386

TRAIN EPOCH 4:
loss 1.386330
crl 0.042339
accuracy 0.250779
sec_per_batch 2.006369
hr_per_epoch 1.235589
dtype: float64

Val epoch 4 has acc 0.249 and loss 1.386

TRAIN EPOCH 5:
loss 1.386332
crl 0.037035
accuracy 0.250581
sec_per_batch 1.735174
hr_per_epoch 1.068578
dtype: float64

Val epoch 5 has acc 0.249 and loss 1.386

TRAIN EPOCH 6:
loss 1.386333
crl 0.032566
accuracy 0.249394
sec_per_batch 2.384569
hr_per_epoch 1.468497
dtype: float64

Epoch 6: reducing learning rate of group 0 to 5.0000e-05.
Val epoch 6 has acc 0.249 and loss 1.386

TRAIN EPOCH 7:
loss 1.386345
crl 0.020694
accuracy 0.247829
sec_per_batch 2.088539
hr_per_epoch 1.286192
dtype: float64

Val epoch 7 has acc 0.249 and loss 1.386

TRAIN EPOCH 8:
loss 1.386309
crl 0.017643
accuracy 0.251004
sec_per_batch 1.965981
hr_per_epoch 1.210717
dtype: float64

Val epoch 8 has acc 0.249 and loss 1.386

TRAIN EPOCH 9:
loss 1.386299
crl 0.015537
accuracy 0.251415
sec_per_batch 1.872479
hr_per_epoch 1.153135
dtype: float64

Val epoch 9 has acc 0.249 and loss 1.386

TRAIN EPOCH 10:
loss 1.386302
crl 0.014494
accuracy 0.251420
sec_per_batch 1.644809
hr_per_epoch 1.012928
dtype: float64

Epoch 10: reducing learning rate of group 0 to 2.5000e-05.
Val epoch 10 has acc 0.249 and loss 1.386

TRAIN EPOCH 11:
loss 1.386306
crl 0.009551
accuracy 0.252025
sec_per_batch 1.408009
hr_per_epoch 0.867099
dtype: float64

Val epoch 11 has acc 0.249 and loss 1.386

TRAIN EPOCH 12:
loss 1.386314
crl 0.007876
accuracy 0.250382
sec_per_batch 1.419217
hr_per_epoch 0.874001
dtype: float64

Val epoch 12 has acc 0.249 and loss 1.386

TRAIN EPOCH 13:
loss 1.386337
crl 0.007333
accuracy 0.248957
sec_per_batch 1.800047
hr_per_epoch 1.108529
dtype: float64

Val epoch 13 has acc 0.249 and loss 1.386

TRAIN EPOCH 14:
loss 1.386308
crl 0.006972
accuracy 0.251202
sec_per_batch 1.691500
hr_per_epoch 1.041682
dtype: float64

Epoch 14: reducing learning rate of group 0 to 1.2500e-05.
Val epoch 14 has acc 0.249 and loss 1.386

TRAIN EPOCH 15:
loss 1.386294
crl 0.004941
accuracy 0.250033
sec_per_batch 1.976553
hr_per_epoch 1.217227
dtype: float64

Val epoch 15 has acc 0.249 and loss 1.386

TRAIN EPOCH 16:
loss 1.386299
crl 0.004361
accuracy 0.250594
sec_per_batch 2.385966
hr_per_epoch 1.469357
dtype: float64

Val epoch 16 has acc 0.249 and loss 1.386

TRAIN EPOCH 17:
loss 1.386329
crl 0.004206
accuracy 0.249658
sec_per_batch 2.463118
hr_per_epoch 1.516870
dtype: float64

Val epoch 17 has acc 0.249 and loss 1.386

TRAIN EPOCH 18:
loss 1.386311
crl 0.003819
accuracy 0.249090
sec_per_batch 2.041939
hr_per_epoch 1.257494
dtype: float64

Epoch 18: reducing learning rate of group 0 to 6.2500e-06.
Val epoch 18 has acc 0.249 and loss 1.386

TRAIN EPOCH 19:
loss 1.386334
crl 0.003092
accuracy 0.249248
sec_per_batch 1.784414
hr_per_epoch 1.098902
dtype: float64

Val epoch 19 has acc 0.249 and loss 1.386

Import error : undefined symbol issue

I've followed the process from creating a new conda environment but
I can not get the leaderboard output, I get the import issue about cuda

Do you have any idea?

Cannot create a tensor proto whose content is larger than 2GB

For rationale mode, the repository provides code for calculating Question and correct answer pairs BERT embeddings only for train and validation set. While for test set, embeddings are calculated for all question/answer pairs. If we try to extend this functionality to train set, following error occurs:

Traceback (most recent call last):
  File "extract_features.py", line 245, in <module>
    for result in tqdm(estimator.predict(input_fn, yield_single_examples=True)):
  File "/usr/local/lib/python3.5/dist-packages/tqdm/_tqdm.py", line 1022, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2437, in predict
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2431, in predict
    yield_single_examples=yield_single_examples):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 549, in predict
    input_fn, model_fn_lib.ModeKeys.PREDICT)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1024, in _get_features_from_input_fn
    result = self._call_input_fn(input_fn, mode)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2354, in _call_input_fn
    return input_fn(**kwargs)
  File "/vol/vcr/r2c/data/get_bert_embeddings/vcr_loader.py", line 57, in input_fn
    dtype=tf.int32),
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/constant_op.py", line 207, in constant
    value, dtype=dtype, shape=shape, verify_shape=verify_shape))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/tensor_util.py", line 506, in make_tensor_proto
    "Cannot create a tensor proto whose content is larger than 2GB.")

How can I download vcr v1.0 data?

Hi, rowanz, thanks for your great work. But I have a problem that I cannot find the link to download VCR data, i.e. images and annotations. I click 'Annotations' and 'Images', but it dosen't have any response.

i got an error "OSError: broken data stream when reading image file"

i had training successful before.
but now, i got an error below...
i don't know what happen..
how do i do? T^T

  0%|▏                                               | 5/1019 [00:45<2:49:41, 10.04s/it]Traceback (most recent call last):
  File "train.py", line 132, in <module>
    for b, (time_per_batch, batch) in enumerate(time_batch(train_loader if args.no_tqdm else tqdm(train_loader), reset_every=ARGS_RESET_EVERY)):
  File "/home/ailab/r2c/utils/pytorch_misc.py", line 29, in time_batch
    for i, item in enumerate(gen):
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/tqdm/std.py", line 1104, in __iter__
    for obj in iterable:
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 568, in __next__
    return self._process_next_batch(batch)
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
OSError: Traceback (most recent call last):
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ailab/r2c/dataloaders/vcr.py", line 392, in __getitem__
    image = load_image(os.path.join(VCR_IMAGES_DIR, item['img_fn']))
  File "/home/ailab/r2c/dataloaders/box_utils.py", line 15, in load_image
    return default_loader(img_fn)
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/torchvision/datasets/folder.py", line 147, in default_loader
    return pil_loader(path)
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/torchvision/datasets/folder.py", line 130, in pil_loader
    return img.convert('RGB')
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/PIL/Image.py", line 930, in convert
    self.load()
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/PIL/ImageFile.py", line 272, in load
    raise_ioerror(err_code)
  File "/home/ailab/anaconda3/envs/vcr/lib/python3.6/site-packages/PIL/ImageFile.py", line 59, in raise_ioerror
    raise IOError(message + " when reading image file")
OSError: broken data stream when reading image file

Question about adversarial Matching

In the paper, it's said that "each answer appears exactly four times in the dataset". I try to verify this. But I cannot get this conclusion for the train.jsonl and val.jsonl file. Can you explain more about this? Besides, does this mean all correct answers appears 4 times, or all answers candidates appears 4 times? Thanks.

No corresponding module find in AllenNLP

Hi, thanks for the code for helping load the dataset.

But I find there are two small issues in the code.
in dataloaders/vcr.py: from allennlp.data.dataset import Batchl Batch cannot be found
in dataloaders/bert_field.py from allennlp.data.token_indexers.token_indexer import TokenIndexer, TokenType; TokenType cannot be found.

Maybe it's due to the AllenNLP version when writing the code. I'm using version 2.10.0.

Many thanks!

need ~10 hrs to run one epoch

Hi,

Thanks for your great work! I tried to run your train script in the "models" folder, and it showed it will approximately take about 10 hrs to train one epoch. After I used the line_profile tool to check the running time analysis of the train script, I found that this line of code takes 90% of the total running time:
for b, (time_per_batch, batch) in enumerate(time_batch(train_loader if args.no_tqdm else tqdm(train_loader), reset_every=ARGS_RESET_EVERY)):

I think this line basically calls the collate_fn() in the DataLoader object and the get_item() in the Dataset object. Do you have any ideas that why it takes so long to run one epoch?

I'm using 4 CPUs with 20GB memory in total and Telsla V100 on a Google Cloud VM instance.

PS: I also tried to replace all the executions inside that loop with "pass",
for b, (time_per_batch, batch) in enumerate(time_batch(train_loader if args.no_tqdm else tqdm(train_loader), reset_every=ARGS_RESET_EVERY)): pass
and it still needed the same amount of time for one epoch.

RuntimeError: DataLoader worker (pid 6645) is killed by signal: Illegal instruction.

When I run your code,there is an error－ RuntimeError: DataLoader worker (pid 6645) is killed by signal: Illegal instruction. It's also wrong when I reduce batchsize. Thank you for your reply.

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Hi, i meet a problem like this:
File "train.py", line 131, in
output_dict = model(**batch)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "../models/multiatt/model.py", line 157, in forward
obj_reps = self.detector(images=images, boxes=boxes, box_mask=box_mask, classes=objects, segms=segms)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "../utils/detector.py", line 111, in forward
img_feats = self.backbone(images)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torchvision/models/resnet.py", line 98, in forward
out = self.conv2(out)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/r2c_1/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

The environment i use:
python3.6.6
cuda9.0.176
cudnn7.5.1
torch1.1.0
torchvision0.3.0
and i have tried several environment configs(like cudnn7.4, torch1.0, etc) but none of them works.
what should i do?
thank you :)

i want to see a visualized result

i want to see a visualized result, like' https://visualcommonsense.com' Explore section.
Could you give me the visualzied code?? T^T

Thank you! Good day!

no module named 'torchvision.layers'

When I run your code, there is an error 'no module named 'torchvision.layers'. and I can not use the link

pip install git+git://github.com/pytorch/vision.git@24577864e92b72f7066e1ed16e978e873e19d13d.

The solution of
sudo apt-get install git-core
pip install git+git://github.com/pytorch/vision.git@24577864e92b72f7066e1ed16e978e873e19d13d
doesn't work.

Error

When I run your code, there is an error 'no module named 'torchvision.layers'. and I can not use the link

pip install git+git://github.com/pytorch/vision.git@24577864e92b72f7066e1ed16e978e873e19d13d.

When I run sudo apt-install git-core and pip install git+git://github.com/pytorch/vision.git@24577864e92b72f7066e1ed16e978e873e19d13d, it also does not work.

RuntimeError: DataLoader worker (pid(s) 4743) exited unexpectedly

RuntimeError: copy_if failed to synchronize: the launch timed out and was terminated

i have a problem like this:
RuntimeError: copy_if failed to synchronize: the launch timed out and was terminated

Traceback (most recent call last):
File "train.py", line 125, in
loss = output_dict['loss'].mean() + output_dict['cnn_regularization_loss'].mean()
File "/home/ailab/anaconda2/envs/r2c/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, kwargs)
File "/home/ailab/r2c/models/multiatt/model.py", line 156, in forward
obj_reps = self.detector(images=images, boxes=boxes, box_mask=box_mask, classes=objects, segms=segms)
File "/home/ailab/anaconda2/envs/r2c/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call**
result = self.forward(*input, **kwargs)
File "/home/ailab/r2c/utils/detector.py", line 112, in forward
box_inds = box_mask.nonzero()
RuntimeError: copy_if failed to synchronize: the launch timed out and was terminated

When I was training this model, it is stop !
what can i do?

env
titan X
cuda 9.0

cannot restore_checkpoint and resume training

The return value epoch_to_return, val_metric_per_epoch of restore_checkpoint() function in utils/pytorch_misc.py

r2c/utils/pytorch_misc.py

Line 253 in 71ee684

return epoch_to_return, val_metric_per_epoch

are always 0, [] even though I try to restore from the existing folder and previous training method with output "Found folder! restoring"

r2c/models/train.py

Lines 102 to 105 in 71ee684

    
           if os.path.exists(args.folder): 
        
               print("Found folder! restoring", flush=True) 
        
               start_epoch, val_metric_per_epoch = restore_checkpoint(model, optimizer, serialization_dir=args.folder, 
        
                                                                      learning_rate_scheduler=scheduler)

AttributeError: 'ScatterableList' object has no attribute 'cuda'

i have a problem like this:
Traceback (most recent call last):
File "eval_for_leaderboard.py", line 110, in
batch = _to_gpu(batch)
File "eval_for_leaderboard.py", line 74, in _to_gpu
td[k] = {k2: v.cuda(async=True) for k2, v in td[k].items()} if isinstance(td[k], dict) else td[k].cuda(
AttributeError: 'ScatterableList' object has no attribute 'cuda'

what can i do?

thank you:)

I use bert large for pretrain on vcr and encountered the error ResourceExhaustedError: OOM when allocating tensor

I tried using bert large instead of bert in the original code, and modified three parameters (hidden size=1024, hidden layers=24, attention heads=16) in bert config.
Here's the error log:
https://gist.github.com/AeroXi/d4d273da9f443c0f2cf9f6d6872eeffe
My device is 4 1080Ti
Maybe I can skip domain adaption and just extract features? However, the generated filename starts with "bert" instead of "bert_da", I can't use it directly even changed the correct filename when training r2c. Should I make other modification?

Corrupted zip file

In linux terminal, while I am unzipping the vcr1images.zip file, with the following command unzip vcr1images.zip, in the midway of extraction process, the process is killed. I tried to unzip in windows with winrar, and the following error is coming.

Anyone else faced the problem?

How long did you spend on training?

I am training Q->A. It seems to cost 10 hrs to training one epoch. Is there any approach to accelerate?

How do you use test data set??

i want test original r2c model and check accuracy.
but eval_q2ar.py use only validation data set.

test dataset only use for eval_for_leaderboard???

thank U

segmentation fault error

I try to run the train.py, but the code fails on the line:
model = Model.from_params(vocab=train.vocab, params=params['model'])
The error infomation is:
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer = [['.*final_mlp.*weight', {'type': 'xavier_uniform'}], ['.*final_mlp.*bias', {'type': 'zero'}], ['.weight_ih.', {'type': 'xavier_uniform'}], ['.weight_hh.', {'type': 'orthogonal'}], ['.bias_ih.', {'type': 'zero'}], ['.bias_hh.', {'type': 'lstm_hidden_bias'}]]
01/10/2019 21:36:36 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.nn.initializers.Initializer'> from params {'type': 'xavier_uniform'} and extras {}
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer.list.list.type = xavier_uniform
01/10/2019 21:36:36 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
01/10/2019 21:36:36 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
01/10/2019 21:36:36 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.nn.initializers.Initializer'> from params {'type': 'zero'} and extras {}
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer.list.list.type = zero
01/10/2019 21:36:36 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
01/10/2019 21:36:36 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
01/10/2019 21:36:36 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.nn.initializers.Initializer'> from params {'type': 'xavier_uniform'} and extras {}
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer.list.list.type = xavier_uniform
01/10/2019 21:36:36 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
01/10/2019 21:36:36 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
01/10/2019 21:36:36 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.nn.initializers.Initializer'> from params {'type': 'orthogonal'} and extras {}
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer.list.list.type = orthogonal
01/10/2019 21:36:36 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
01/10/2019 21:36:36 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
01/10/2019 21:36:36 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.nn.initializers.Initializer'> from params {'type': 'zero'} and extras {}
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer.list.list.type = zero
01/10/2019 21:36:36 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
01/10/2019 21:36:36 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
01/10/2019 21:36:36 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.nn.initializers.Initializer'> from params {'type': 'lstm_hidden_bias'} and extras {}
01/10/2019 21:36:36 - INFO - allennlp.common.params - model.initializer.list.list.type = lstm_hidden_bias
01/10/2019 21:36:36 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently.
01/10/2019 21:36:36 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS:
01/10/2019 21:36:37 - INFO - allennlp.nn.initializers - Initializing parameters
01/10/2019 21:36:37 - INFO - allennlp.nn.initializers - Initializing span_encoder._module._module.weight_ih_l0 using .weight_ih. intitializer
01/10/2019 21:36:37 - INFO - allennlp.nn.initializers - Initializing span_encoder._module._module.weight_hh_l0 using .weight_hh. intitializer
Segmentation fault

Why not read h5 file in VCR init function

r2c/dataloaders/vcr.py

Lines 229 to 231 in 71ee684

    
           # grp_items = {k: np.array(v, dtype=np.float16) for k, v in self.get_h5_group(index).items()} 
        
           with h5py.File(self.h5fn, 'r') as h5: 
        
               grp_items = {k: np.array(v, dtype=np.float16) for k, v in h5[str(index)].items()}

Hi @rowanz , thanks for your sharing your nice code again!
I am confused about this and want to know the reason why not read h5 file in init but in getitem.

Thank you very much!

No idea about params file or any other command line arguments

In the code params file is asked for as input, but I cannot see any params file or even demo params file in the code base.

Even the file to be executed is just mentioned without mentioning the command line arguments as demo.

Please help me to understand how does it look like or what the path of demo params file is so that I can understand what parameters are being passed into the program.

ModuleNotFoundError: No module named 'torchvision.layers'

I see that this layer is not in the official documentation, how do I solve it?

Error

When I run your code, there is an error 'no module named 'torchvision.layers'. and I can not use the link

pip install git+git://github.com/pytorch/vision.git@24577864e92b72f7066e1ed16e978e873e19d13d.

i have a problem extract vcr1images.zip

i was download your dataset "vcr1images.zip"files and i extract it.
but it have a problem like this:
"an error occurred while extracting files"

what can i do?! help me please!
thank you:)

i figure out this problem!
extract command

Issue with directory structure

Hi,

I think the main directory is missing a setup.py file to install this code as a module, which should probably look something like this:

from setuptools import setup, find_packages

setup(
    name='r2c',
    version='0.1',
    packages=find_packages()
)

Also, when I downloaded the data, train.jsonl lies in the vcr1/vcr1annots/ folder, and cocoontology.json lies in the dataloaders folder, but this line and this line indicate that dataloaders should be inside the vcr1/vcr1annots/ folder, but the instructions on the website say that we can have a separate folder where the data has been downloaded. Can you please help clarify the confusion?

Thanks!

eval for test data

Hello,
When I eval my model on the test data, could I submit a 'answer_preds.npy' and a 'ration_preds.npy', respectively?

Thank you

Baseline for Q->AR

First up, thanks for the great work and releasing the code!

I'm trying to repro the baselines from the code and it works like a charm for Q->A and QA->R tasks, but I don't see any code for Q->AR task. Could you please share some details as to how this is computed?

Is the baseline validation accuracy of 43.1 mentioned in the paper for Q->AR task obtained by first running Q->A task, and conditioned on those predicted answers, running QA->R? If so, I believe this would mean the bert_da embeddings for ctx_rationale<i> needs to be recomputed based on the (question + predicted answer) as opposed to what's been precomputed (question + ground-truth answer). To avoid having to pretrain the bert_da embeddings as mentioned in https://github.com/rowanz/r2c/blob/master/data/get_bert_embeddings/README.md, would you be able to share the init_checkpoint file that I could use in extract_features.py?

Thank you!

Submission to leaderboard

Hi! Can reuslts be submited to the leaderboard? I tried contacting you via email, but I didn't receive reply. Thank you!

	if os.path.exists(args.folder):
	print("Found folder! restoring", flush=True)
	start_epoch, val_metric_per_epoch = restore_checkpoint(model, optimizer, serialization_dir=args.folder,
	learning_rate_scheduler=scheduler)

	# grp_items = {k: np.array(v, dtype=np.float16) for k, v in self.get_h5_group(index).items()}
	with h5py.File(self.h5fn, 'r') as h5:
	grp_items = {k: np.array(v, dtype=np.float16) for k, v in h5[str(index)].items()}

rowanz / r2c Goto Github PK

r2c's People

Contributors

Stargazers

Watchers

Forkers

r2c's Issues

i have a problem like this: RuntimeError: copy_if failed to synchronize: the launch timed out and was terminated

Recommend Projects

Recommend Topics

Recommend Org

i have a problem like this:
RuntimeError: copy_if failed to synchronize: the launch timed out and was terminated