sincerass / powernorm Goto Github PK
View Code? Open in Web Editor NEW[ICML 2020] code for "PowerNorm: Rethinking Batch Normalization in Transformers" https://arxiv.org/abs/2003.07845
License: GNU General Public License v3.0
[ICML 2020] code for "PowerNorm: Rethinking Batch Normalization in Transformers" https://arxiv.org/abs/2003.07845
License: GNU General Public License v3.0
Hi, PN is an interesting work and the performance reported in the manuscript is exciting.
However, I'm wondering that whether the PN still works after removing GroupScaling or not? As described in the manuscript, GroupScaling seems like a trick to improve the performance, while it's actually a kind of variant of LayerNorm and probably plays a key role in the architecture.
Would you mind showing the ablation study that removing the GroupScaling from PN?
Hi, first of all thank you for your work. I've been spending some time trying to understand what is happening in this script fairseq/modules/norms/mask_powernorm.py but I've been having some trouble. can you please answer these questions?
was 'GroupScaling1D' (starting at line 17) specific for the data or model architecture that was used for the experiments, but not necessarily a part of the general method for PowerNorm?
based on my understanding, the input is supposed to be shaped (T tokens or instances, B batches, C channels). it seems to be a modified layer norm where each value in the input tensor is divided by the mean of the squared values across (each Groups of 4 channels for each Batch for each Token). I believe this was not mentioned in the paper.
on these few lines here in the forward function of PowerFunction:
if current_iter < warmup_iters:
running_phi.copy_(running_phi * (current_iter-1)/current_iter + var.mean(dim=0, keepdim=True)/current_iter)
running_phi.copy_(afwd*running_phi + (1-afwd)*var.mean(dim=0, keepdim=True))
since 'var' is (1,C,1,1), var.mean(dim=0,keepdim=True) is the same tensor as 'var'. was this intentional, or perhaps an artifact from an earlier version of the code?
also did you mean to put an else statement here for 'running_phi.copy_(afwd*running_phi + (1-afwd)*var.mean(dim=0, keepdim=True))'?
thank you, i'd very much appreciate your time
Hi, I ran your codes with different settings but got unexpected results that the model with PN performs worse than the model with LN.
The results are shown as following:
Transformer with LN:
Namespace(beam=5, bpe=None, cpu=False, criterion='cross_entropy', data='data-bin/iwslt14.tokenized.de-en.joined/', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format='simple', log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', path='log/iwslt14_de_en/transformer_iwslt_de_en_v2_layer_layer_layer_layer_warm/averaged_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=True, raw_text=False, remove_bpe='@@ ', replace_unk=None, required_batch_size_multiple=8, results_path=None, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang='de', target_lang='en', task='translation', tbmf_wrapper=False, temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, unkpen=0, unnormalized=False, upsample_primary=1, user_dir=None, warmup_updates=0, weight_decay=0.0)
| [de] dictionary: 10152 types
| [en] dictionary: 10152 types
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.de
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.en
| data-bin/iwslt14.tokenized.de-en.joined/ test de-en 6750 examples
| loading model(s) from log/iwslt14_de_en/transformer_iwslt_de_en_v2_layer_layer_layer_layer_warm/averaged_model.pt
| Translated 6750 sentences (148676 tokens) in 105.6s (63.91 sentences/s, 1407.62 tokens/s)
| Generate test with beam=5: BLEU4 = 35.44, 69.6/44.1/30.0/20.7 (BP=0.954, ratio=0.955, syslen=125196, reflen=131156)
Transformer with PN:
Namespace(beam=5, bpe=None, cpu=False, criterion='cross_entropy', data='data-bin/iwslt14.tokenized.de-en.joined/', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format='simple', log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', path='log/iwslt14_de_en/transformer_iwslt_de_en_v2_power_power_power_power_warm/averaged_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=True, raw_text=False, remove_bpe='@@ ', replace_unk=None, required_batch_size_multiple=8, results_path=None, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang='de', target_lang='en', task='translation', tbmf_wrapper=False, temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, unkpen=0, unnormalized=False, upsample_primary=1, user_dir=None, warmup_updates=0, weight_decay=0.0)
| [de] dictionary: 10152 types
| [en] dictionary: 10152 types
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.de
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.en
| data-bin/iwslt14.tokenized.de-en.joined/ test de-en 6750 examples
| loading model(s) from log/iwslt14_de_en/transformer_iwslt_de_en_v2_power_power_power_power_warm/averaged_model.pt
| Translated 6750 sentences (148074 tokens) in 122.0s (55.34 sentences/s, 1214.00 tokens/s)
| Generate test with beam=5: BLEU4 = 35.27, 69.6/44.0/29.8/20.6 (BP=0.953, ratio=0.954, syslen=125107, reflen=131156)
Looking forward to your reply.
Hi Sheng,
I see there is a GroupScaling1D operation at the very beginning of the PN layer. If I understand correctly, the GroupScaling1D operation scales the input feature across the channel dimension. I'm kind of confused why this is necessary. Seems this operation is not mentioned in the paper. Is it a standard way to preprocess features in NLP tasks?
Thanks in advance!
Haotao
Hi, I was wondering if you could make the code for your language modelling experiments available? Thanks.
Hi.
In my opinion, backward implementation is different from which written in paper.
powernorm/fairseq/modules/norms/mask_powernorm.py
Lines 90 to 91 in bffa7dd
In line 90,
approx_grad_g = (g - (1 - abkw) * ema_gz * z)
But, "Algorithm 2" in paper,
So I expected intermediate estimated gradient to be implemented like this:
approx_grad_g = (g - ema_gz * z)
If my opinion is correct, line 91 should also be modified like this:
ema_gz.add_((approx_grad_g * z * (1 - abkw)).mean(dim=3, keepdim=True).mean(dim=2, keepdim=True).mean(dim=0, keepdim=True))
In conclusion, I wonder why your implementation and written in paper is different.
Is there another reason?
Thanks.
Jumin
Hi,
Thank you for your code! And, I have a question. I am trying to apply power normalization(PN) to Tacotron2. However, after I changed batch norm(BN) to PN, an overflow occurred after several thousands step training. When I checked, the MaskPowerNorm
class's ema_gz
parameter was smaller and smaller while training, and finally it became NaN. Is there any opinion or solution?
Thanks,
Heejo
Hi! Thank you for releasing the paper code!
I had some issues understanding the implementation that are solved by now. However, I expect that many of the people who decide to use PowerNorm in their projects will face them too. Fixing them would increase the impact of this research and make lives of some people just a bit easier.
What I propose to do:
Fairseq is a pretty big repository and finding a module you are looking for is like fining a needle in a haystack. Explicitly showing the the module is placed right in the readme is an easy solution. Another solution, even a better one, would be to create a new project that would only contain PowerNorm implementation and the corresponding tests.
Currently, the docstring for MaskPowerNorm
"""
An implementation of masked batch normalization, used for testing the numerical
stability.
"""
does not indicate that this is exactly the PowerNorm described in the paper. It confuses, because it makes an impression that this module is only used for testing. After reading the docstring I spent some extra time searching for a different implementation and verifying that this one is exactly the one used in the experiments.
Initialization parameters are not documented at all and while some of them -- num_features
, affine
, ... behave exactly like in nn.BatchNorm1d
, the others are specific to PowerNorm (alpha_fwd
, alpha_bwd
, warmup_iters
, ...). It is not clear what they do without going back to the paper and reading the source code.
PowerFunction
is not documented at all 😞
Could you please add these clarifications to the docstirings? It should not take more than an hour, and it will definitely save time for many people wanting to use PowerNorm in their projects.
The paper mentions that batch normalization can have large fluctuations in the batch statistics. This occurs in vanilla BN because it calculates the statistics over input of varying lengths padded with 0. I was wondering whether this fluctuation still occurs in the masked version of BN (where padding is ignored). Additionally, how much of a performance gain can be expected by switching from BN to masked BN? Thanks.
I have already run "sudo python3 setup.py build develop". Basically, the results end with "Using /usr/local/lib/python3.6/dist-packages/pycparser-2.20-py3.6.egg
Finished processing dependencies for fairseq==0.8.0". I feel that this step is successful.
And I also tried "pip3 install fairseq".
But I still have the error of "cannot import name 'libbleu' from 'fairseq' ". Could you give me some clues about it? Thank you very much.
hi, the link of PowerNorm in readme is broken
Hello,
Thanks for sharing,
Just confused about maskPowerNorm class, which implements mask in the github but not mentioned in the paper.
Hello:
It seems the affine parameter of MaskPowerNorm in line 109 of mask_powernorm.py is not working properly. The weight and bias would be used no matter the boolean value of the affine. The codes in line 86 and line 88 of mask_powernorm.py seem to be redundancy. And I think the codes in 142-154 of mask_powernorm.py could be changed to the following codes to reduce redundancy.
# construct the mask_input, size to be (BxL) x C: L is the real length here
if pad_mask is None:
mask_input = input.clone()
else:
# Transpose the bn_mask (B x T -> T x B)
bn_mask = ~pad_mask
bn_mask = bn_mask.transpose(0, 1)
#pad_size = (~bn_mask).sum() //is pad_size redundancy?
mask_input = input[bn_mask, :]
powernorm/fairseq/modules/norms/mask_powernorm.py
Lines 109 to 112 in bffa7dd
powernorm/fairseq/modules/norms/mask_powernorm.py
Lines 142 to 154 in bffa7dd
Thanks,
grassking100
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.