sincerass / powernorm Goto Github PK

View Code? Open in Web Editor NEW

119.0 119.0 17.0 695 KB

[ICML 2020] code for "PowerNorm: Rethinking Batch Normalization in Transformers" https://arxiv.org/abs/2003.07845

License: GNU General Public License v3.0

Python 96.36% C++ 0.89% Cuda 1.75% Shell 1.00%

powernorm's People

Contributors

Stargazers

Watchers

Forkers

lilujunai yaozhewei mldl codejin jiaxp3144 sooheon njcx-ai smitakshigupta lduperier rezaarmand sshleifer yuezhibin shuowenwei kanikel ljj199859 standardgalactic jerry609

powernorm's Issues

Does PowerNorm still work for NMT task after removing the GroupScaling layer?

Hi, PN is an interesting work and the performance reported in the manuscript is exciting.

However, I'm wondering that whether the PN still works after removing GroupScaling or not? As described in the manuscript, GroupScaling seems like a trick to improve the performance, while it's actually a kind of variant of LayerNorm and probably plays a key role in the architecture.

Would you mind showing the ablation study that removing the GroupScaling from PN?

Comparisons with RMSNorm？

Hi, I have seen LLAMA using RMSNorm in a pre-norm manner. However, I have read your paper a long time ago. And I realize that the forward propagations are nearly the same. Am I mistaking something?

A few questions regarding fairseq/modules/norms/mask_powernorm.py

Hi, first of all thank you for your work. I've been spending some time trying to understand what is happening in this script fairseq/modules/norms/mask_powernorm.py but I've been having some trouble. can you please answer these questions?

was 'GroupScaling1D' (starting at line 17) specific for the data or model architecture that was used for the experiments, but not necessarily a part of the general method for PowerNorm?
based on my understanding, the input is supposed to be shaped (T tokens or instances, B batches, C channels). it seems to be a modified layer norm where each value in the input tensor is divided by the mean of the squared values across (each Groups of 4 channels for each Batch for each Token). I believe this was not mentioned in the paper.
on these few lines here in the forward function of PowerFunction:

if current_iter < warmup_iters:
running_phi.copy_(running_phi * (current_iter-1)/current_iter + var.mean(dim=0, keepdim=True)/current_iter)

running_phi.copy_(afwd*running_phi + (1-afwd)*var.mean(dim=0, keepdim=True))

since 'var' is (1,C,1,1), var.mean(dim=0,keepdim=True) is the same tensor as 'var'. was this intentional, or perhaps an artifact from an earlier version of the code?
also did you mean to put an else statement here for 'running_phi.copy_(afwd*running_phi + (1-afwd)*var.mean(dim=0, keepdim=True))'?

thank you, i'd very much appreciate your time

Cannot reproduce the results on IWSLT14.

Hi, I ran your codes with different settings but got unexpected results that the model with PN performs worse than the model with LN.
The results are shown as following:
Transformer with LN:

Namespace(beam=5, bpe=None, cpu=False, criterion='cross_entropy', data='data-bin/iwslt14.tokenized.de-en.joined/', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format='simple', log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', path='log/iwslt14_de_en/transformer_iwslt_de_en_v2_layer_layer_layer_layer_warm/averaged_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=True, raw_text=False, remove_bpe='@@ ', replace_unk=None, required_batch_size_multiple=8, results_path=None, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang='de', target_lang='en', task='translation', tbmf_wrapper=False, temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, unkpen=0, unnormalized=False, upsample_primary=1, user_dir=None, warmup_updates=0, weight_decay=0.0)
| [de] dictionary: 10152 types
| [en] dictionary: 10152 types
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.de
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.en
| data-bin/iwslt14.tokenized.de-en.joined/ test de-en 6750 examples
| loading model(s) from log/iwslt14_de_en/transformer_iwslt_de_en_v2_layer_layer_layer_layer_warm/averaged_model.pt
| Translated 6750 sentences (148676 tokens) in 105.6s (63.91 sentences/s, 1407.62 tokens/s)
| Generate test with beam=5: BLEU4 = 35.44, 69.6/44.1/30.0/20.7 (BP=0.954, ratio=0.955, syslen=125196, reflen=131156)

Transformer with PN:

Namespace(beam=5, bpe=None, cpu=False, criterion='cross_entropy', data='data-bin/iwslt14.tokenized.de-en.joined/', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format='simple', log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', path='log/iwslt14_de_en/transformer_iwslt_de_en_v2_power_power_power_power_warm/averaged_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=True, raw_text=False, remove_bpe='@@ ', replace_unk=None, required_batch_size_multiple=8, results_path=None, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang='de', target_lang='en', task='translation', tbmf_wrapper=False, temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, unkpen=0, unnormalized=False, upsample_primary=1, user_dir=None, warmup_updates=0, weight_decay=0.0)
| [de] dictionary: 10152 types
| [en] dictionary: 10152 types
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.de
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.en
| data-bin/iwslt14.tokenized.de-en.joined/ test de-en 6750 examples
| loading model(s) from log/iwslt14_de_en/transformer_iwslt_de_en_v2_power_power_power_power_warm/averaged_model.pt
| Translated 6750 sentences (148074 tokens) in 122.0s (55.34 sentences/s, 1214.00 tokens/s)
| Generate test with beam=5: BLEU4 = 35.27, 69.6/44.0/29.8/20.6 (BP=0.953, ratio=0.954, syslen=125107, reflen=131156)

Looking forward to your reply.

a question about the image of layer normalization in README.md

the pic in README.md about LN is like

in my understanding, I guess maybe LN should cover a whole layer instead of just a line of a layer? am i wrong somewhere?

Why use group scaling?

Hi Sheng,

I see there is a GroupScaling1D operation at the very beginning of the PN layer. If I understand correctly, the GroupScaling1D operation scales the input feature across the channel dimension. I'm kind of confused why this is necessary. Seems this operation is not mentioned in the paper. Is it a standard way to preprocess features in NLP tasks?

Thanks in advance!

Haotao

Language Modelling code?

Hi, I was wondering if you could make the code for your language modelling experiments available? Thanks.

Different backward implementation from the content written in paper

Hi.

In my opinion, backward implementation is different from which written in paper.

powernorm/fairseq/modules/norms/mask_powernorm.py

Lines 90 to 91 in bffa7dd

    
           approx_grad_g = (g - (1 - abkw) * ema_gz * z) 
        
           ema_gz.add_((approx_grad_g * z).mean(dim=3, keepdim=True).mean(dim=2, keepdim=True).mean(dim=0, keepdim=True))

In line 90,

approx_grad_g = (g - (1 - abkw) * ema_gz * z)

But, "Algorithm 2" in paper,

So I expected intermediate estimated gradient to be implemented like this:

approx_grad_g = (g - ema_gz * z)

If my opinion is correct, line 91 should also be modified like this:

ema_gz.add_((approx_grad_g * z * (1 - abkw)).mean(dim=3, keepdim=True).mean(dim=2, keepdim=True).mean(dim=0, keepdim=True))

In conclusion, I wonder why your implementation and written in paper is different.
Is there another reason?

Thanks.

Jumin

Gradient overflow (NaN problem)

Hi,

Thank you for your code! And, I have a question. I am trying to apply power normalization(PN) to Tacotron2. However, after I changed batch norm(BN) to PN, an overflow occurred after several thousands step training. When I checked, the MaskPowerNorm class's ema_gz parameter was smaller and smaller while training, and finally it became NaN. Is there any opinion or solution?

Thanks,

Heejo

Feature request: improved documentation

Hi! Thank you for releasing the paper code!

I had some issues understanding the implementation that are solved by now. However, I expect that many of the people who decide to use PowerNorm in their projects will face them too. Fixing them would increase the impact of this research and make lives of some people just a bit easier.

What I propose to do:

Indicate the location of the PowerNorm code in the readme
Improve docstrings

Fairseq is a pretty big repository and finding a module you are looking for is like fining a needle in a haystack. Explicitly showing the the module is placed right in the readme is an easy solution. Another solution, even a better one, would be to create a new project that would only contain PowerNorm implementation and the corresponding tests.

Currently, the docstring for MaskPowerNorm

    """
    An implementation of masked batch normalization, used for testing the numerical
    stability.
    """

does not indicate that this is exactly the PowerNorm described in the paper. It confuses, because it makes an impression that this module is only used for testing. After reading the docstring I spent some extra time searching for a different implementation and verifying that this one is exactly the one used in the experiments.

Initialization parameters are not documented at all and while some of them -- num_features, affine, ... behave exactly like in nn.BatchNorm1d, the others are specific to PowerNorm (alpha_fwd, alpha_bwd, warmup_iters, ...). It is not clear what they do without going back to the paper and reading the source code.

PowerFunction is not documented at all 😞

Could you please add these clarifications to the docstirings? It should not take more than an hour, and it will definitely save time for many people wanting to use PowerNorm in their projects.

Question regarding the batch norm vs masked batch norm

The paper mentions that batch normalization can have large fluctuations in the batch statistics. This occurs in vanilla BN because it calculates the statistics over input of varying lengths padded with 0. I was wondering whether this fluctuation still occurs in the masked version of BN (where padding is ignored). Additionally, how much of a performance gain can be expected by switching from BN to masked BN? Thanks.

ImportError: cannot import name 'libbleu' from 'fairseq'

I have already run "sudo python3 setup.py build develop". Basically, the results end with "Using /usr/local/lib/python3.6/dist-packages/pycparser-2.20-py3.6.egg
Finished processing dependencies for fairseq==0.8.0". I feel that this step is successful.
And I also tried "pip3 install fairseq".

But I still have the error of "cannot import name 'libbleu' from 'fairseq' ". Could you give me some clues about it? Thank you very much.

PowerNorm link broken

hi, the link of PowerNorm in readme is broken

Is MaskPowerNorm the powernorm proposed by the paper?

Hello,

Thanks for sharing,
Just confused about maskPowerNorm class, which implements mask in the github but not mentioned in the paper.

The broken affine parameter and the redundancy codes

Hello:
It seems the affine parameter of MaskPowerNorm in line 109 of mask_powernorm.py is not working properly. The weight and bias would be used no matter the boolean value of the affine. The codes in line 86 and line 88 of mask_powernorm.py seem to be redundancy. And I think the codes in 142-154 of mask_powernorm.py could be changed to the following codes to reduce redundancy.

        # construct the mask_input, size to be (BxL) x C: L is the real length here
        if pad_mask is None:
            mask_input = input.clone()
        else:
            # Transpose the bn_mask (B x T -> T x B)
            bn_mask = ~pad_mask
            bn_mask = bn_mask.transpose(0, 1)
            #pad_size = (~bn_mask).sum() //is pad_size redundancy?
            mask_input = input[bn_mask, :]

powernorm/fairseq/modules/norms/mask_powernorm.py

Lines 109 to 112 in bffa7dd

    
           self.affine = affine 
        
           self.register_parameter('weight', nn.Parameter(torch.ones(num_features))) 
        
           self.register_parameter('bias', nn.Parameter(torch.zeros(num_features)))

powernorm/fairseq/modules/norms/mask_powernorm.py

Line 86 in bffa7dd

g = g * 1

powernorm/fairseq/modules/norms/mask_powernorm.py

Line 88 in bffa7dd

gz = (g * z).mean(dim=3).mean(dim=2).mean(dim=0)

powernorm/fairseq/modules/norms/mask_powernorm.py

Lines 142 to 154 in bffa7dd

    
           # construct the mask_input, size to be (BxL) x C: L is the real length here 
        
           if pad_mask is None: 
        
               mask_input = input.clone() 
        
           else: 
        
               # Transpose the bn_mask (B x T -> T x B) 
        
               bn_mask = ~pad_mask 
        
               bn_mask = bn_mask.transpose(0, 1) 
        
           if pad_mask is not None: 
        
               pad_size = (~bn_mask).sum() 
        
               mask_input = input[bn_mask, :] 
        
           else: 
        
               mask_input = input.clone()

Thanks,
grassking100

	approx_grad_g = (g - (1 - abkw) * ema_gz * z)
	ema_gz.add_((approx_grad_g * z).mean(dim=3, keepdim=True).mean(dim=2, keepdim=True).mean(dim=0, keepdim=True))

	self.affine = affine

	self.register_parameter('weight', nn.Parameter(torch.ones(num_features)))
	self.register_parameter('bias', nn.Parameter(torch.zeros(num_features)))

	# construct the mask_input, size to be (BxL) x C: L is the real length here
	if pad_mask is None:
	mask_input = input.clone()
	else:
	# Transpose the bn_mask (B x T -> T x B)
	bn_mask = ~pad_mask
	bn_mask = bn_mask.transpose(0, 1)

	if pad_mask is not None:
	pad_size = (~bn_mask).sum()
	mask_input = input[bn_mask, :]
	else:
	mask_input = input.clone()

sincerass / powernorm Goto Github PK

powernorm's People

Contributors

Stargazers

Watchers

Forkers

powernorm's Issues

Recommend Projects

Recommend Topics

Recommend Org