Giter Site home page Giter Site logo

sincerass / powernorm Goto Github PK

View Code? Open in Web Editor NEW
119.0 119.0 17.0 695 KB

[ICML 2020] code for "PowerNorm: Rethinking Batch Normalization in Transformers" https://arxiv.org/abs/2003.07845

License: GNU General Public License v3.0

Python 96.36% C++ 0.89% Cuda 1.75% Shell 1.00%

powernorm's People

Contributors

sincerass avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

powernorm's Issues

Does PowerNorm still work for NMT task after removing the GroupScaling layer?

Hi, PN is an interesting work and the performance reported in the manuscript is exciting.

However, I'm wondering that whether the PN still works after removing GroupScaling or not? As described in the manuscript, GroupScaling seems like a trick to improve the performance, while it's actually a kind of variant of LayerNorm and probably plays a key role in the architecture.

Would you mind showing the ablation study that removing the GroupScaling from PN?

Comparisons with RMSNorm?

Hi, I have seen LLAMA using RMSNorm in a pre-norm manner. However, I have read your paper a long time ago. And I realize that the forward propagations are nearly the same. Am I mistaking something?

A few questions regarding fairseq/modules/norms/mask_powernorm.py

Hi, first of all thank you for your work. I've been spending some time trying to understand what is happening in this script fairseq/modules/norms/mask_powernorm.py but I've been having some trouble. can you please answer these questions?

  1. was 'GroupScaling1D' (starting at line 17) specific for the data or model architecture that was used for the experiments, but not necessarily a part of the general method for PowerNorm?
    based on my understanding, the input is supposed to be shaped (T tokens or instances, B batches, C channels). it seems to be a modified layer norm where each value in the input tensor is divided by the mean of the squared values across (each Groups of 4 channels for each Batch for each Token). I believe this was not mentioned in the paper.

  2. on these few lines here in the forward function of PowerFunction:

if current_iter < warmup_iters:
running_phi.copy_(running_phi * (current_iter-1)/current_iter + var.mean(dim=0, keepdim=True)/current_iter)

running_phi.copy_(afwd*running_phi + (1-afwd)*var.mean(dim=0, keepdim=True))

since 'var' is (1,C,1,1), var.mean(dim=0,keepdim=True) is the same tensor as 'var'. was this intentional, or perhaps an artifact from an earlier version of the code?
also did you mean to put an else statement here for 'running_phi.copy_(afwd*running_phi + (1-afwd)*var.mean(dim=0, keepdim=True))'?

thank you, i'd very much appreciate your time

Cannot reproduce the results on IWSLT14.

Hi, I ran your codes with different settings but got unexpected results that the model with PN performs worse than the model with LN.
The results are shown as following:
Transformer with LN:

Namespace(beam=5, bpe=None, cpu=False, criterion='cross_entropy', data='data-bin/iwslt14.tokenized.de-en.joined/', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format='simple', log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', path='log/iwslt14_de_en/transformer_iwslt_de_en_v2_layer_layer_layer_layer_warm/averaged_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=True, raw_text=False, remove_bpe='@@ ', replace_unk=None, required_batch_size_multiple=8, results_path=None, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang='de', target_lang='en', task='translation', tbmf_wrapper=False, temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, unkpen=0, unnormalized=False, upsample_primary=1, user_dir=None, warmup_updates=0, weight_decay=0.0)
| [de] dictionary: 10152 types
| [en] dictionary: 10152 types
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.de
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.en
| data-bin/iwslt14.tokenized.de-en.joined/ test de-en 6750 examples
| loading model(s) from log/iwslt14_de_en/transformer_iwslt_de_en_v2_layer_layer_layer_layer_warm/averaged_model.pt
| Translated 6750 sentences (148676 tokens) in 105.6s (63.91 sentences/s, 1407.62 tokens/s)
| Generate test with beam=5: BLEU4 = 35.44, 69.6/44.1/30.0/20.7 (BP=0.954, ratio=0.955, syslen=125196, reflen=131156)

Transformer with PN:

Namespace(beam=5, bpe=None, cpu=False, criterion='cross_entropy', data='data-bin/iwslt14.tokenized.de-en.joined/', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format='simple', log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', path='log/iwslt14_de_en/transformer_iwslt_de_en_v2_power_power_power_power_warm/averaged_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=True, raw_text=False, remove_bpe='@@ ', replace_unk=None, required_batch_size_multiple=8, results_path=None, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang='de', target_lang='en', task='translation', tbmf_wrapper=False, temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, unkpen=0, unnormalized=False, upsample_primary=1, user_dir=None, warmup_updates=0, weight_decay=0.0)
| [de] dictionary: 10152 types
| [en] dictionary: 10152 types
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.de
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.en
| data-bin/iwslt14.tokenized.de-en.joined/ test de-en 6750 examples
| loading model(s) from log/iwslt14_de_en/transformer_iwslt_de_en_v2_power_power_power_power_warm/averaged_model.pt
| Translated 6750 sentences (148074 tokens) in 122.0s (55.34 sentences/s, 1214.00 tokens/s)
| Generate test with beam=5: BLEU4 = 35.27, 69.6/44.0/29.8/20.6 (BP=0.953, ratio=0.954, syslen=125107, reflen=131156)

Looking forward to your reply.

Why use group scaling?

Hi Sheng,

I see there is a GroupScaling1D operation at the very beginning of the PN layer. If I understand correctly, the GroupScaling1D operation scales the input feature across the channel dimension. I'm kind of confused why this is necessary. Seems this operation is not mentioned in the paper. Is it a standard way to preprocess features in NLP tasks?

Thanks in advance!

Haotao

Language Modelling code?

Hi, I was wondering if you could make the code for your language modelling experiments available? Thanks.

Different backward implementation from the content written in paper

Hi.

In my opinion, backward implementation is different from which written in paper.

approx_grad_g = (g - (1 - abkw) * ema_gz * z)
ema_gz.add_((approx_grad_g * z).mean(dim=3, keepdim=True).mean(dim=2, keepdim=True).mean(dim=0, keepdim=True))

In line 90,

approx_grad_g = (g - (1 - abkw) * ema_gz * z)

But, "Algorithm 2" in paper,

image

So I expected intermediate estimated gradient to be implemented like this:

approx_grad_g = (g - ema_gz * z)

If my opinion is correct, line 91 should also be modified like this:

ema_gz.add_((approx_grad_g * z * (1 - abkw)).mean(dim=3, keepdim=True).mean(dim=2, keepdim=True).mean(dim=0, keepdim=True))

In conclusion, I wonder why your implementation and written in paper is different.
Is there another reason?

Thanks.

Jumin

Gradient overflow (NaN problem)

Hi,

Thank you for your code! And, I have a question. I am trying to apply power normalization(PN) to Tacotron2. However, after I changed batch norm(BN) to PN, an overflow occurred after several thousands step training. When I checked, the MaskPowerNorm class's ema_gz parameter was smaller and smaller while training, and finally it became NaN. Is there any opinion or solution?

Thanks,

Heejo

Feature request: improved documentation

Hi! Thank you for releasing the paper code!

I had some issues understanding the implementation that are solved by now. However, I expect that many of the people who decide to use PowerNorm in their projects will face them too. Fixing them would increase the impact of this research and make lives of some people just a bit easier.

What I propose to do:

  • Indicate the location of the PowerNorm code in the readme
  • Improve docstrings

Fairseq is a pretty big repository and finding a module you are looking for is like fining a needle in a haystack. Explicitly showing the the module is placed right in the readme is an easy solution. Another solution, even a better one, would be to create a new project that would only contain PowerNorm implementation and the corresponding tests.

Currently, the docstring for MaskPowerNorm

    """
    An implementation of masked batch normalization, used for testing the numerical
    stability.
    """

does not indicate that this is exactly the PowerNorm described in the paper. It confuses, because it makes an impression that this module is only used for testing. After reading the docstring I spent some extra time searching for a different implementation and verifying that this one is exactly the one used in the experiments.

Initialization parameters are not documented at all and while some of them -- num_features, affine, ... behave exactly like in nn.BatchNorm1d, the others are specific to PowerNorm (alpha_fwd, alpha_bwd, warmup_iters, ...). It is not clear what they do without going back to the paper and reading the source code.

PowerFunction is not documented at all 😞

Could you please add these clarifications to the docstirings? It should not take more than an hour, and it will definitely save time for many people wanting to use PowerNorm in their projects.

Question regarding the batch norm vs masked batch norm

The paper mentions that batch normalization can have large fluctuations in the batch statistics. This occurs in vanilla BN because it calculates the statistics over input of varying lengths padded with 0. I was wondering whether this fluctuation still occurs in the masked version of BN (where padding is ignored). Additionally, how much of a performance gain can be expected by switching from BN to masked BN? Thanks.

ImportError: cannot import name 'libbleu' from 'fairseq'

I have already run "sudo python3 setup.py build develop". Basically, the results end with "Using /usr/local/lib/python3.6/dist-packages/pycparser-2.20-py3.6.egg
Finished processing dependencies for fairseq==0.8.0". I feel that this step is successful.
And I also tried "pip3 install fairseq".

But I still have the error of "cannot import name 'libbleu' from 'fairseq' ". Could you give me some clues about it? Thank you very much.

The broken affine parameter and the redundancy codes

Hello:
It seems the affine parameter of MaskPowerNorm in line 109 of mask_powernorm.py is not working properly. The weight and bias would be used no matter the boolean value of the affine. The codes in line 86 and line 88 of mask_powernorm.py seem to be redundancy. And I think the codes in 142-154 of mask_powernorm.py could be changed to the following codes to reduce redundancy.

        # construct the mask_input, size to be (BxL) x C: L is the real length here
        if pad_mask is None:
            mask_input = input.clone()
        else:
            # Transpose the bn_mask (B x T -> T x B)
            bn_mask = ~pad_mask
            bn_mask = bn_mask.transpose(0, 1)
            #pad_size = (~bn_mask).sum() //is pad_size redundancy?
            mask_input = input[bn_mask, :]

self.affine = affine
self.register_parameter('weight', nn.Parameter(torch.ones(num_features)))
self.register_parameter('bias', nn.Parameter(torch.zeros(num_features)))

gz = (g * z).mean(dim=3).mean(dim=2).mean(dim=0)

# construct the mask_input, size to be (BxL) x C: L is the real length here
if pad_mask is None:
mask_input = input.clone()
else:
# Transpose the bn_mask (B x T -> T x B)
bn_mask = ~pad_mask
bn_mask = bn_mask.transpose(0, 1)
if pad_mask is not None:
pad_size = (~bn_mask).sum()
mask_input = input[bn_mask, :]
else:
mask_input = input.clone()

Thanks,
grassking100

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.