yuanhy1997 / seqdiffuseq Goto Github PK

View Code? Open in Web Editor NEW

76.0 5.0 13.0 88 KB

Text Diffusion Model with Encoder-Decoder Transformers for Sequence-to-Sequence Generation [NAACL 2024]

Home Page: https://arxiv.org/abs/2212.10325

Python 95.12% Shell 4.88%

seqdiffuseq's Introduction

This is the official repo for the paper SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers

Parts of our codes are modified from DiffusionLM and minimaldiffusion repos.

Environment

Before running our code, you may setting the environments using the following lines.

conda create -n seqdiffuseq python=3.8
conda install mpi4py
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0
pip install -r requirements.txt

Preparing dataset

For the non-translation tasks, we follows DiffuSeq for the dataset settings.

For IWSLT14 and WMT14, we follow the data preprocessing from fairseq, we also provide the processed datasets in this links. (Update 04/13/2023: Sorry for missing WMT14 data, I just uploaded it. Download from here)

Training

To run the code, we use iwslt14 en-de as an illustrative example:

Prepare the data of iwslt14 under ./data/iwslt14/ directory;
Learning the BPE tokenizer by

python ./tokenizer_utils.py train-byte-level iwslt14 10000

To train with the following line:

mkdir ckpts
bash ./train_scripts/iwslt_en_de.sh 0 de en
#(for en to de translation) bash ./train_scripts/iwslt_en_de.sh 0 en de

You may modify the scripts in ./train_scripts for your own training settings.

Inference

After training accomplish, you can run the following line for inference:

bash ./inference_scrpts/iwslt_inf.sh path-to-ckpts/ema_0.9999_280000.pt path-to-save-results path-to-ckpts/alpha_cumprod_step_260000.npy

The ema_0.9999_280000.pt file is the model weights and alpha_cumprod_step_260000.npy is the saved noise schedule. You have to use the most recent .npy schedule file saved before .pt model weight file.

Other Comments

Note that for all the training experiments, we all set the maximum training steps and warmups to 1000000 and 10000. For different datasets, it is needless to stop training until maximum training steps. IWSLT14 use checkpoint around 300000 training steps, WMT15 around 500000 train steps and non-translation task around 100000 train steps.

You can change the hyperparameter setting for your own experiments, maybe increasing the training batches or modify the training schedule will bring some improvements.

Citation

If you find our work and codes interesting and useful, please cite:

@article{Yuan2022SeqDiffuSeqTD,
  title={SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers},
  author={Hongyi Yuan and Zheng Yuan and Chuanqi Tan and Fei Huang and Songfang Huang},
  journal={ArXiv},
  year={2022},
  volume={abs/2212.10325}
}

seqdiffuseq's People

Contributors

Stargazers

Watchers

Forkers

chuanqi1992 shiyaya jiahuigeng jaguaroo chiyuzhang94 shunsunsun wfan-os agarwalishika transformerswsz youssefdaoud99 techthiyanes web-logs2 kpmyon

seqdiffuseq's Issues

About the paper results.

SeqDiffuSeq/bleu_eval.py

Line 19 in ed56ca4

sacre_results = metric_sacrebleu.compute(predictions=preds, references=refs)

I want to ask if the result 29.83 compared with the AR and CNAT is calculated by this code? And this result is test set or valid set?

Thanks~

Can't run any train script

When I tried to run the training script, I was reminded that mpi4py was missing, so I installed mpi4py

pip install mpi4py

Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Collecting mpi4py
  Using cached http://mirrors.aliyun.com/pypi/packages/2e/1a/1393e69df9cf7b04143a51776727dd048586781bca82543594ab439e2eb4/mpi4py-3.1.5.tar.gz (2.5 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: mpi4py
  Building wheel for mpi4py (pyproject.toml) ... done
  Created wheel for mpi4py: filename=mpi4py-3.1.5-cp38-cp38-linux_x86_64.whl size=6024408 sha256=64ef1c54d03ecb2c862c4e57da02d6dd8d9e33673ad3948afafca08d60edfd64
  Stored in directory: /root/.cache/pip/wheels/9d/2a/7e/c6575a1d595c7d8cce796177f1b9827975c5b48b31e28f25b9
Successfully built mpi4py
Installing collected packages: mpi4py
Successfully installed mpi4py-3.1.5

Then I re-ran the training script, and there was no output at all.

bash ./train_scripts/iwslt_en_de.sh 0 de en

I waited for a while, but the program still didn't output anything. I don't know what's wrong. My operating system is Ubuntu. Is it possible that it's an MPI problem?

About Inference Problem

Hello! Thanks for your code sharing.
When I run your default inference code:

bash ./inference_scrpts/iwslt_inf.sh path-to-ckpts/ema_0.9999_280000.pt path-to-save-results path-to-ckpts/alpha_cumprod_step_260000.npy

I met a problem:

created 6800 samples
sampling complete
Traceback (most recent call last):
File "inference_main.py", line 210, in
main()
File "inference_main.py", line 159, in main
write_outputs(args=args,
File "inference_main.py", line 192, in write_outputs
with open(output_file_basepath, "w") as text_fout:
FileNotFoundError: [Errno 2] No such file or directory: 'path-to-ckpts/ema_0.9999_280000.pt.samples_6750.steps-2000.clamp-no_clamp-normal_10708.txt'

So how I can find the txt file?
I would appreciate it if you can share your idea.

About code share

Hello! I am a student and I am doing the work related with diffusion and seq2seq model. I am very interested in your work code. If possible, may I can be shared with your code: My email is: [email protected]

Data for wmt14 is missing

Hi, thanks for your great work. I would like to train your model on WMT14, but I haven't found your data used in the google drive link provided from this repo. Would you please double check it? Thanks.

Issues of reproducing Table 1 results on Commonsense Conversation Dataset (CCD)

Hi, I try to use your script (ccd.sh) to reproduce the Table 1 results on Commonsense Conversation Dataset, but it turns out that my reproduced results (BLEU: 0.154, Rouge-L: 6.38) are far below your reported values (BLEU: 1.02, Rouge-L: 8.59). Could you check whether the hyperparameters in ccd.sh are the optimal ones that you use? It would be better if you could also provide the evaluation scripts for producing BLEU and Rouge-L (currently the inference_scripts only save the testing outputs but no metrics evaluation results if I run it right)? Besides, are there any model checkpoints and testing outputs available?

The training recipe for SeqDiffuSeq without adaptive noise schedule

Hi, I try to reproduce your result in Table 3, especially for SeqDiffuSeq without Adaptive Noise Schedule. Here is how I reproduce it:

I remove these two lines:

SeqDiffuSeq/src/modeling/diffusion/gaussian_diffusion.py

Line 345 in ed56ca4

self.update_time_discretized_parameters(interp_alpha_cumprod)

SeqDiffuSeq/src/modeling/diffusion/gaussian_diffusion.py

Line 295 in ed56ca4

self.update_time_discretized_parameters(alphas_cumprod)

After removing the first line, the noise schedule would not be changed during training. And after removing the second line, the model will not load saved noise schedule and only use the pre-defined one, i.e. the one in the init function.
I use your default training and sampling script (sqrt noise schedule with uniform timestep sampling). And run sampling with different checkpoints, e.g. the 280K, 400K checkpoint.

However, my best result is about 18, which is quite different from your reported number 28.94. May I ask what I should change to get the reported number? BTW, I can reproduce the result of SeqDiffuSeq.

The self-conditioning part

I apply this adaptive technique by modifying my code, but the system prompts me
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
87 making sure all forward function outputs participate in calculating loss.
88 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable)

Can you give me a hint? About half using 0 and half using the Zt of the previous prediction？

Thanks!

Question about EMA

Hi,

I'm confused about the EMA implementation. Could you tell me whether my understanding is correct?

There are two ways to do EMA. The first one is

1. after optimizer.step() you obtain model_new
2. ema_model_new = a * ema_model_old + (1-a) * model_new   # a is ema_decay
3. model_new = copy(ema_model_new)

The second one is:

1. after optimizer.step() you obtain model_new
2. ema_model_new = a * ema_model_old + (1-a) * model_new   # a is ema_decay

The difference between these two ways are:

the first one needs to use the new ema_model parameters to initialize the model for the next step. Till convergence, ema_model and model should be very similar. We can use either to do sampling.
The second one just wants to store the ema_model and uses it for sampling. The update of ema_model will not affect the model.

May I ask which one you implement? I assume it is the second one, since the training outputs don't change when I change ema_decay.

Inference Problem

Hello, I would like to ask how to solve this problem
When I run the code for inference
Traceback (most recent call last):
File "inference_main.py", line 209, in
main()
File "inference_main.py", line 124, in main
if args.decoder_attention_mask:
AttributeError: 'Namespace' object has no attribute 'decoder_attention_mask'

'TransformerNetModel_encoder_decoder' object has no attribute 'module'

Hi authors,

Thank you for such a well written repository! Really appreciate researchers like you all.

We are interested in your work and are trying to run your code on our data. However, we are encountering this issue. While we dig into this, could you please also take a look (since you would be more well-versed)?

FWIW, we have not changed anything except the dataloader_utils.py file with code to load in our data.

Thank you!

Traceback (most recent call last):
File "main.py", line 107, in
main()
File "main.py", line 95, in main
warmup=args.warmup,
File "SeqDiffuSeq/trainer.py", line 177, in run_loop
self.run_step(batch, cond)
File "SeqDiffuSeq/trainer.py", line 196, in run_step
self.forward_backward(batch, cond)
File "SeqDiffuSeq/trainer.py", line 265, in forward_backward
losses = compute_losses()
File "SeqDiffuSeq/src/modeling/diffusion/respace.py", line 98, in training_losses
return super().training_losses(self._wrap_model(model), *args, **kwargs)
File "SeqDiffuSeq/src/modeling/diffusion/gaussian_diffusion.py", line 376, in training_losses
x_start_mean = model.model.module.get_embeds(input_ids)
File ".pyenv/versions/3.7.14/lib/python3.7/site-packages/torch/nn/modules/module.py", line 948, in getattr
type(self).name, name))
AttributeError: 'TransformerNetModel_encoder_decoder' object has no attribute 'module'

Hi，How could I get the BLEU score，for example, specific command？

I should run “python bleu_eval.py ... ...”

Checkpoints

Thanks for the work! For all datasets, could you upload the checkpoints used for the results in the paper?

Initialize model with pre-trained BART

Hi @Yuanhy1997 ,

Thanks for your great work.

I wonder if you try to use the weights of pre-trained BART to initialize your model. I didn't find these results in the paper while I can see you have a knob to keep pre-trained weights in the code.

Best,
Chiyu

the meaning of loss name in the output

Hi,

when I run your code, I get the output like:

| loss      | 0.0201   |
| loss_q0   | 0.0202   |
| loss_q1   | 0.0201   |
| loss_q2   | 0.0202   |
| loss_q3   | 0.0199   |
| mse       | 0.0201   |
| mse_q0    | 0.0202   |
| mse_q1    | 0.0201   |
| mse_q2    | 0.0202   |
| mse_q3    | 0.0199   |

what is the difference between loss and loss_qi for i in [0,3].

Question about decoder_nll loss

Hi,

thank you again for your clean code. Here I have a question about the decoder_nll loss.

According to your code, you calculate the decoder_nll loss in this way:

x_start_mean = model.model.module.get_embeds(input_ids)
x_start = self.get_x_start(x_start_mean, std)
decoder_nll = self.token_discrete_loss(x_start, get_logits, input_ids, mask=loss_mask)

def token_discrete_loss(self, x_t, get_logits, input_ids, mask=None):
    logits = get_logits(x_t)  # bsz, seqlen, vocab
    loss_fct = th.nn.CrossEntropyLoss(reduction="none", ignore_index=-100)
    decoder_nll = loss_fct(logits.view(-1, logits.size(-1)), input_ids.view(-1)).view(
        input_ids.shape
    )
    if mask is not None:
        decoder_nll[mask == 0] = 0
        decoder_nll = decoder_nll.sum(dim=-1) / mask.sum(dim=-1)
    else:
        decoder_nll = decoder_nll.mean(dim=-1)
    return decoder_nll

which means that x_start is the noisy word embedding. You calculate the cross entropy between the noisy word embedding and input_ids. However, in the diffusion_lm, "We now describe the inverse process of rounding a predicted x0 back to discrete text" (second sentence in Section 4.2). It seems they use the predicted x_start ( model_output) rather than the noisy x_start.

I know the original code also implements it in this way, but it confuses me. Why don't we replace x_start in the function self.token_discrete_loss with model_out? The noisy word embedding x_start should be very close to the original word embedding, since at the beginning we only add little noise. We don't need to calculate its loss. Instead, we should make sure the predicted x_start (model_out) to be close to the word embedding.

about code

hi, I am very interested in your work. Could you share the source code with me? Thanks a lot!
my email is: [email protected]

Request to Evaluate code

Could you please provide the evaluation code？

problems met during inference

Hi,

I could successfully train a model on IWSLT De-En. Now I have some questions about inference:

decoder_attention_mask is not in the args. Should I just comment it out?

SeqDiffuSeq/inference_main.py

Line 124 in ed56ca4

if args.decoder_attention_mask:
How could I get the BLEU score after I run
bash ./inference_scrpts/iwslt_inf.sh path-to-ckpts/ema_0.9999_280000.pt path-to-save-results path-to-ckpts/alpha_cumprod_step_260000.npy
It seems you remove the code for DDIM. Do I have to keep the number of timesteps for inference the same as the one for training?

About the decoding of Tokenizer.

Hi, when I run the decode of the trained bpe tokenizer, I found the parameter 'skip_special_tokens' is not work. It means the final generate text including the '</s>, <s>, <pad>' etc.

Did you meet this problem? I know it can use the 'replace' function to solve it. But it is really resonable?

Thanks for you sharing your code !

MBR-DECODE

How can I use the mbr-decode?

I am very interested in your research and would like to know more details.

most recent .npy schedule file saved before .pt model weight file

You have to use the most recent .npy schedule file saved before .pt model weight file.
For this sentence, it means Must the pt weight file be greater than the npy noise schedule file? Can't you use the same time step parameters? What happens when you use the command "bash ./inference_scrpts/iwslt_inf.sh path-to-ckpts/ema_0.9999_280000.pt path-to-save-results path-to-ckpts/alpha_cumprod_step_280000.npy
"

关于训练时间的询问以及是否可以多卡训练的请教

您好，请问您使用v100训练100k大概花了多长时间啊？有没有多卡训练的版本使其能够在3090上快一点训练呢？恕我才疏学浅，还不能解决这个问题，如果您能提供回复，不胜感激！salute！
1、训练命令是：bash ./train_scripts/iwslt_en_de.sh 6 de en
2、显卡占用内存情况显示如下：
[0] NVIDIA GeForce RTX 3090 | 62°C, 100 % | 8085 / 24576 MB | seqdiffuseq(253M) seqdiffuseq(253M) seqdiffuseq(253M) seqdiffuseq(651M) xx (6357M) gdm(4M)
[6] NVIDIA GeForce RTX 3090 | 76°C, 99 % | 17810 / 24576 MB | seqdiffuseq(11141M) xx 6351M) gdm(4M)
3、大概两天的训练情况：

| grad_norm | 0.026 |
| loss | 0.0315 |
| loss_q0 | 0.0172 |
| loss_q1 | 0.0266 |
| loss_q2 | 0.0354 |
| loss_q3 | 0.046 |
| mse | 0.0304 |
| mse_q0 | 0.0161 |
| mse_q1 | 0.0254 |
| mse_q2 | 0.0341 |
| mse_q3 | 0.0448 |
| samples | 2.53e+07 |
| step | 1.98e+05 |

Code Share

Hey guys, as the other issues this is about the code share for research purposes. Would it be possible for you to send me the code or do you plan to make the code public in the near future? My mail is [email protected]