Giter Site home page Giter Site logo

seqdiffuseq's Introduction

This is the official repo for the paper SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers

Parts of our codes are modified from DiffusionLM and minimaldiffusion repos.

Environment

Before running our code, you may setting the environments using the following lines.

conda create -n seqdiffuseq python=3.8
conda install mpi4py
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0
pip install -r requirements.txt

Preparing dataset

For the non-translation tasks, we follows DiffuSeq for the dataset settings.

For IWSLT14 and WMT14, we follow the data preprocessing from fairseq, we also provide the processed datasets in this links. (Update 04/13/2023: Sorry for missing WMT14 data, I just uploaded it. Download from here)

Training

To run the code, we use iwslt14 en-de as an illustrative example:

  1. Prepare the data of iwslt14 under ./data/iwslt14/ directory;
  2. Learning the BPE tokenizer by
python ./tokenizer_utils.py train-byte-level iwslt14 10000 
  1. To train with the following line:
mkdir ckpts
bash ./train_scripts/iwslt_en_de.sh 0 de en
#(for en to de translation) bash ./train_scripts/iwslt_en_de.sh 0 en de

You may modify the scripts in ./train_scripts for your own training settings.

Inference

After training accomplish, you can run the following line for inference:

bash ./inference_scrpts/iwslt_inf.sh path-to-ckpts/ema_0.9999_280000.pt path-to-save-results path-to-ckpts/alpha_cumprod_step_260000.npy

The ema_0.9999_280000.pt file is the model weights and alpha_cumprod_step_260000.npy is the saved noise schedule. You have to use the most recent .npy schedule file saved before .pt model weight file.

Other Comments

Note that for all the training experiments, we all set the maximum training steps and warmups to 1000000 and 10000. For different datasets, it is needless to stop training until maximum training steps. IWSLT14 use checkpoint around 300000 training steps, WMT15 around 500000 train steps and non-translation task around 100000 train steps.

You can change the hyperparameter setting for your own experiments, maybe increasing the training batches or modify the training schedule will bring some improvements.

Citation

If you find our work and codes interesting and useful, please cite:

@article{Yuan2022SeqDiffuSeqTD,
  title={SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers},
  author={Hongyi Yuan and Zheng Yuan and Chuanqi Tan and Fei Huang and Songfang Huang},
  journal={ArXiv},
  year={2022},
  volume={abs/2212.10325}
}

seqdiffuseq's People

Contributors

yuanhy1997 avatar

Stargazers

Zequn Li avatar humbleyll avatar Kaeli avatar  avatar Pengyu Cheng avatar Zheng Yuan avatar  avatar Milin0802 avatar  avatar  avatar  avatar Straughter "BatmanOsama" Guthrie avatar Suhail Kamal avatar USTC-liuchang avatar Maxim avatar 徐则灵 avatar Komorebi avatar Zefeng Zhu avatar QinHsiu avatar Youngtaek Oh avatar  avatar  avatar Yali Du avatar Nick Wall avatar  avatar  avatar  avatar Jiuzhouh avatar  avatar Jeff Carpenter avatar Swift avatar  avatar wuhuanqing avatar  avatar zcl avatar Udon avatar  avatar Sreyan Ghosh avatar JarvanW avatar Amao. avatar KaijunL avatar  avatar  avatar rzhwang avatar  avatar Baohao Liao avatar  avatar Zhiyuan Hu avatar Zhihao Fan avatar Qidong Liu avatar  avatar Wang Xingrui avatar  avatar Bocheng Li avatar Jianjie(JJ) Luo avatar 爱可可-爱生活 avatar Chuanqi Tan avatar Slice avatar  avatar Yuan-Man avatar  avatar tomato avatar Rishikesh (ऋषिकेश) avatar  avatar  avatar Yahiko avatar  avatar Yaya Shi avatar Zhihong Chen avatar Adrian H. avatar  avatar  avatar bansky-cl avatar Ziyuan Zhuang avatar Yongliang Shen avatar baeseongsu avatar

Watchers

Rishikesh (ऋषिकेश) avatar Rıza Özçelik avatar Kostas Georgiou avatar  avatar  avatar

seqdiffuseq's Issues

About the paper results.

sacre_results = metric_sacrebleu.compute(predictions=preds, references=refs)

I want to ask if the result 29.83 compared with the AR and CNAT is calculated by this code? And this result is test set or valid set?

Thanks~

Can't run any train script

When I tried to run the training script, I was reminded that mpi4py was missing, so I installed mpi4py

pip install mpi4py

Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Collecting mpi4py
  Using cached http://mirrors.aliyun.com/pypi/packages/2e/1a/1393e69df9cf7b04143a51776727dd048586781bca82543594ab439e2eb4/mpi4py-3.1.5.tar.gz (2.5 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: mpi4py
  Building wheel for mpi4py (pyproject.toml) ... done
  Created wheel for mpi4py: filename=mpi4py-3.1.5-cp38-cp38-linux_x86_64.whl size=6024408 sha256=64ef1c54d03ecb2c862c4e57da02d6dd8d9e33673ad3948afafca08d60edfd64
  Stored in directory: /root/.cache/pip/wheels/9d/2a/7e/c6575a1d595c7d8cce796177f1b9827975c5b48b31e28f25b9
Successfully built mpi4py
Installing collected packages: mpi4py
Successfully installed mpi4py-3.1.5

Then I re-ran the training script, and there was no output at all.

bash ./train_scripts/iwslt_en_de.sh 0 de en

I waited for a while, but the program still didn't output anything. I don't know what's wrong. My operating system is Ubuntu. Is it possible that it's an MPI problem?

About Inference Problem

Hello! Thanks for your code sharing.
When I run your default inference code:

bash ./inference_scrpts/iwslt_inf.sh path-to-ckpts/ema_0.9999_280000.pt path-to-save-results path-to-ckpts/alpha_cumprod_step_260000.npy

I met a problem:

created 6800 samples
sampling complete
Traceback (most recent call last):
File "inference_main.py", line 210, in
main()
File "inference_main.py", line 159, in main
write_outputs(args=args,
File "inference_main.py", line 192, in write_outputs
with open(output_file_basepath, "w") as text_fout:
FileNotFoundError: [Errno 2] No such file or directory: 'path-to-ckpts/ema_0.9999_280000.pt.samples_6750.steps-2000.clamp-no_clamp-normal_10708.txt'

So how I can find the txt file?
I would appreciate it if you can share your idea.

About code share

Hello! I am a student and I am doing the work related with diffusion and seq2seq model. I am very interested in your work code. If possible, may I can be shared with your code: My email is: [email protected]

Data for wmt14 is missing

Hi, thanks for your great work. I would like to train your model on WMT14, but I haven't found your data used in the google drive link provided from this repo. Would you please double check it? Thanks.

Issues of reproducing Table 1 results on Commonsense Conversation Dataset (CCD)

Hi, I try to use your script (ccd.sh) to reproduce the Table 1 results on Commonsense Conversation Dataset, but it turns out that my reproduced results (BLEU: 0.154, Rouge-L: 6.38) are far below your reported values (BLEU: 1.02, Rouge-L: 8.59). Could you check whether the hyperparameters in ccd.sh are the optimal ones that you use? It would be better if you could also provide the evaluation scripts for producing BLEU and Rouge-L (currently the inference_scripts only save the testing outputs but no metrics evaluation results if I run it right)? Besides, are there any model checkpoints and testing outputs available?

The training recipe for SeqDiffuSeq without adaptive noise schedule

Hi, I try to reproduce your result in Table 3, especially for SeqDiffuSeq without Adaptive Noise Schedule. Here is how I reproduce it:

  1. I remove these two lines:

    self.update_time_discretized_parameters(interp_alpha_cumprod)

    self.update_time_discretized_parameters(alphas_cumprod)

    After removing the first line, the noise schedule would not be changed during training. And after removing the second line, the model will not load saved noise schedule and only use the pre-defined one, i.e. the one in the init function.

  2. I use your default training and sampling script (sqrt noise schedule with uniform timestep sampling). And run sampling with different checkpoints, e.g. the 280K, 400K checkpoint.

However, my best result is about 18, which is quite different from your reported number 28.94. May I ask what I should change to get the reported number? BTW, I can reproduce the result of SeqDiffuSeq.

The self-conditioning part

I apply this adaptive technique by modifying my code, but the system prompts me
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
87 making sure all forward function outputs participate in calculating loss.
88 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable)

Can you give me a hint? About half using 0 and half using the Zt of the previous prediction?

Thanks!

Question about EMA

Hi,

I'm confused about the EMA implementation. Could you tell me whether my understanding is correct?

There are two ways to do EMA. The first one is

1. after optimizer.step() you obtain model_new
2. ema_model_new = a * ema_model_old + (1-a) * model_new   # a is ema_decay
3. model_new = copy(ema_model_new)

The second one is:

1. after optimizer.step() you obtain model_new
2. ema_model_new = a * ema_model_old + (1-a) * model_new   # a is ema_decay

The difference between these two ways are:

  • the first one needs to use the new ema_model parameters to initialize the model for the next step. Till convergence, ema_model and model should be very similar. We can use either to do sampling.
  • The second one just wants to store the ema_model and uses it for sampling. The update of ema_model will not affect the model.

May I ask which one you implement? I assume it is the second one, since the training outputs don't change when I change ema_decay.

Inference Problem

Hello, I would like to ask how to solve this problem
When I run the code for inference
Traceback (most recent call last):
File "inference_main.py", line 209, in
main()
File "inference_main.py", line 124, in main
if args.decoder_attention_mask:
AttributeError: 'Namespace' object has no attribute 'decoder_attention_mask'

'TransformerNetModel_encoder_decoder' object has no attribute 'module'

Hi authors,

Thank you for such a well written repository! Really appreciate researchers like you all.

We are interested in your work and are trying to run your code on our data. However, we are encountering this issue. While we dig into this, could you please also take a look (since you would be more well-versed)?

FWIW, we have not changed anything except the dataloader_utils.py file with code to load in our data.

Thank you!

Traceback (most recent call last):
File "main.py", line 107, in
main()
File "main.py", line 95, in main
warmup=args.warmup,
File "SeqDiffuSeq/trainer.py", line 177, in run_loop
self.run_step(batch, cond)
File "SeqDiffuSeq/trainer.py", line 196, in run_step
self.forward_backward(batch, cond)
File "SeqDiffuSeq/trainer.py", line 265, in forward_backward
losses = compute_losses()
File "SeqDiffuSeq/src/modeling/diffusion/respace.py", line 98, in training_losses
return super().training_losses(self._wrap_model(model), *args, **kwargs)
File "SeqDiffuSeq/src/modeling/diffusion/gaussian_diffusion.py", line 376, in training_losses
x_start_mean = model.model.module.get_embeds(input_ids)
File ".pyenv/versions/3.7.14/lib/python3.7/site-packages/torch/nn/modules/module.py", line 948, in getattr
type(self).name, name))
AttributeError: 'TransformerNetModel_encoder_decoder' object has no attribute 'module'

Checkpoints

Thanks for the work! For all datasets, could you upload the checkpoints used for the results in the paper?

Initialize model with pre-trained BART

Hi @Yuanhy1997 ,

Thanks for your great work.

I wonder if you try to use the weights of pre-trained BART to initialize your model. I didn't find these results in the paper while I can see you have a knob to keep pre-trained weights in the code.

Best,
Chiyu

the meaning of loss name in the output

Hi,

when I run your code, I get the output like:

| loss      | 0.0201   |
| loss_q0   | 0.0202   |
| loss_q1   | 0.0201   |
| loss_q2   | 0.0202   |
| loss_q3   | 0.0199   |
| mse       | 0.0201   |
| mse_q0    | 0.0202   |
| mse_q1    | 0.0201   |
| mse_q2    | 0.0202   |
| mse_q3    | 0.0199   |

what is the difference between loss and loss_qi for i in [0,3].

Question about decoder_nll loss

Hi,

thank you again for your clean code. Here I have a question about the decoder_nll loss.

According to your code, you calculate the decoder_nll loss in this way:

x_start_mean = model.model.module.get_embeds(input_ids)
x_start = self.get_x_start(x_start_mean, std)
decoder_nll = self.token_discrete_loss(x_start, get_logits, input_ids, mask=loss_mask)

def token_discrete_loss(self, x_t, get_logits, input_ids, mask=None):
    logits = get_logits(x_t)  # bsz, seqlen, vocab
    loss_fct = th.nn.CrossEntropyLoss(reduction="none", ignore_index=-100)
    decoder_nll = loss_fct(logits.view(-1, logits.size(-1)), input_ids.view(-1)).view(
        input_ids.shape
    )
    if mask is not None:
        decoder_nll[mask == 0] = 0
        decoder_nll = decoder_nll.sum(dim=-1) / mask.sum(dim=-1)
    else:
        decoder_nll = decoder_nll.mean(dim=-1)
    return decoder_nll

which means that x_start is the noisy word embedding. You calculate the cross entropy between the noisy word embedding and input_ids. However, in the diffusion_lm, "We now describe the inverse process of rounding a predicted x0 back to discrete text" (second sentence in Section 4.2). It seems they use the predicted x_start ( model_output) rather than the noisy x_start.

I know the original code also implements it in this way, but it confuses me. Why don't we replace x_start in the function self.token_discrete_loss with model_out? The noisy word embedding x_start should be very close to the original word embedding, since at the beginning we only add little noise. We don't need to calculate its loss. Instead, we should make sure the predicted x_start (model_out) to be close to the word embedding.

problems met during inference

Hi,

I could successfully train a model on IWSLT De-En. Now I have some questions about inference:

  1. decoder_attention_mask is not in the args. Should I just comment it out?
    if args.decoder_attention_mask:
  2. How could I get the BLEU score after I run
    bash ./inference_scrpts/iwslt_inf.sh path-to-ckpts/ema_0.9999_280000.pt path-to-save-results path-to-ckpts/alpha_cumprod_step_260000.npy
  3. It seems you remove the code for DDIM. Do I have to keep the number of timesteps for inference the same as the one for training?

About the decoding of Tokenizer.

Hi, when I run the decode of the trained bpe tokenizer, I found the parameter 'skip_special_tokens' is not work. It means the final generate text including the '</s>, <s>, <pad>' etc.

Did you meet this problem? I know it can use the 'replace' function to solve it. But it is really resonable?

Thanks for you sharing your code !

most recent .npy schedule file saved before .pt model weight file

You have to use the most recent .npy schedule file saved before .pt model weight file.
For this sentence, it means Must the pt weight file be greater than the npy noise schedule file? Can't you use the same time step parameters? What happens when you use the command "bash ./inference_scrpts/iwslt_inf.sh path-to-ckpts/ema_0.9999_280000.pt path-to-save-results path-to-ckpts/alpha_cumprod_step_280000.npy
"

关于训练时间的询问以及是否可以多卡训练的请教

您好,请问您使用v100训练100k大概花了多长时间啊?有没有多卡训练的版本使其能够在3090上快一点训练呢?恕我才疏学浅,还不能解决这个问题,如果您能提供回复,不胜感激!salute!
1、训练命令是:bash ./train_scripts/iwslt_en_de.sh 6 de en
2、显卡占用内存情况显示如下:
[0] NVIDIA GeForce RTX 3090 | 62°C, 100 % | 8085 / 24576 MB | seqdiffuseq(253M) seqdiffuseq(253M) seqdiffuseq(253M) seqdiffuseq(651M) xx (6357M) gdm(4M)
[6] NVIDIA GeForce RTX 3090 | 76°C, 99 % | 17810 / 24576 MB | seqdiffuseq(11141M) xx 6351M) gdm(4M)
3、大概两天的训练情况:

| grad_norm | 0.026 |
| loss | 0.0315 |
| loss_q0 | 0.0172 |
| loss_q1 | 0.0266 |
| loss_q2 | 0.0354 |
| loss_q3 | 0.046 |
| mse | 0.0304 |
| mse_q0 | 0.0161 |
| mse_q1 | 0.0254 |
| mse_q2 | 0.0341 |
| mse_q3 | 0.0448 |
| samples | 2.53e+07 |
| step | 1.98e+05 |

Code Share

Hey guys, as the other issues this is about the code share for research purposes. Would it be possible for you to send me the code or do you plan to make the code public in the near future? My mail is [email protected]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.