I found that the hook function will not be called when calculating MultiheadAttention

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Test case with requires_grad=False: <div class="snippet-clipboard-content notransl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

There was a bug with computing MultiheadAttention flops about flops-counter.pytorch HOT 9 OPEN

ssk1997 commented on June 1, 2024

There was a bug with computing MultiheadAttention flops

from flops-counter.pytorch.

Comments (9)

sovrasov commented on June 1, 2024 1

I've made the previous experiment with batch_first=False, when I set batch_first=True indeed the output is zero.
The reason is in the pytorch internals: it doesn't actually use MultiheadAttention.forward method (instead an optimized cuda kernel is directly called), so ptflops can't trace it. Therefore this bug with batch_first=True and param.requires_grad = False is not fixable.

from flops-counter.pytorch.

sovrasov commented on June 1, 2024

@ssk1997 please provide a code snippet to reproduce

from flops-counter.pytorch.

ssk1997 commented on June 1, 2024

Test case with requires_grad=False:

import torch
import torch.nn as nn
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from ptflops import get_model_complexity_info

def prepare_input(resolution):
    input1 = torch.randn(1, 64, 512)
    return dict(src=input1)

layer = TransformerEncoderLayer(d_model=512 , nhead=4, dim_feedforward=512 , batch_first=True)
model = TransformerEncoder(layer, 1)
for param in model.parameters():
    param.requires_grad = False  

flop1, params = get_model_complexity_info(model, input_res=(1, 1, 1), 
                                              input_constructor=prepare_input,
                                              as_strings=True, print_per_layer_stat=True)
print(flop1,params)

output:

TransformerEncoder(
  0, 0.000% Params, 0.0 Mac, 0.000% MACs, 
  (layers): ModuleList(
    0, 0.000% Params, 0.0 Mac, 0.000% MACs, 
    (0): TransformerEncoderLayer(
      0, 0.000% Params, 0.0 Mac, 0.000% MACs, 
      (self_attn): MultiheadAttention(
        0, 0.000% Params, 0.0 Mac, 0.000% MACs, 
        (out_proj): NonDynamicallyQuantizableLinear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
      )
      (linear1): Linear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
      (dropout): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
      (linear2): Linear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
      (norm1): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
      (dropout2): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
    )
  )
)
0.0 Mac 0

Test case with requires_grad=True:

import torch
import torch.nn as nn
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from ptflops import get_model_complexity_info

def prepare_input(resolution):
    input1 = torch.randn(1, 64, 512)
    return dict(src=input1)

layer = TransformerEncoderLayer(d_model=512 , nhead=4, dim_feedforward=512 , batch_first=True)
model = TransformerEncoder(layer, 1)

flop1, params = get_model_complexity_info(model, input_res=(1, 1, 1), 
                                              input_constructor=prepare_input,
                                              as_strings=True, print_per_layer_stat=True)
print(flop1,params)

Output:

TransformerEncoder(
  1.58 M, 99.870% Params, 105.04 MMac, 100.000% MACs, 
  (layers): ModuleList(
    1.58 M, 99.870% Params, 105.04 MMac, 100.000% MACs, 
    (0): TransformerEncoderLayer(
      1.58 M, 99.870% Params, 105.04 MMac, 100.000% MACs, 
      (self_attn): MultiheadAttention(
        1.05 M, 66.580% Params, 71.48 MMac, 68.054% MACs, 
        (out_proj): NonDynamicallyQuantizableLinear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
      )
      (linear1): Linear(262.66 k, 16.645% Params, 16.78 MMac, 15.973% MACs, in_features=512, out_features=512, bias=True)
      (dropout): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
      (linear2): Linear(262.66 k, 16.645% Params, 16.78 MMac, 15.973% MACs, in_features=512, out_features=512, bias=True)
      (norm1): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
      (dropout2): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
    )
  )
)
105.04 MMac 1.58 M

Thanks for your response. @sovrasov

from flops-counter.pytorch.

sovrasov commented on June 1, 2024

My output if I launch this snipped with param.requires_grad = False will be 100.89 MMac 0. ptflops returns 0 params in that case because it counts only ones with a gradients (it's a natural definition of learnable parameters). Which version of ptflops do you use?

from flops-counter.pytorch.

sovrasov commented on June 1, 2024

For the reference you can grep torch._transformer_encoder_layer_fwd in the pytorch's source code.

from flops-counter.pytorch.

ssk1997 commented on June 1, 2024

Thanks a lot. It works fine with batch_first=False.

from flops-counter.pytorch.

sovrasov commented on June 1, 2024

This bug may occur in other cases as well because of pytorch inference optimization: https://pytorch.org/tutorials/beginner/bettertransformer_tutorial.html https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/

from flops-counter.pytorch.

quancs commented on June 1, 2024

Encountered the same bug. Is there any hook to fix this?

from flops-counter.pytorch.

sovrasov commented on June 1, 2024

@quancs as I already wrote, this is a wontfix problem

from flops-counter.pytorch.

There was a bug with computing MultiheadAttention flops about flops-counter.pytorch HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent