Giter Site home page Giter Site logo

Comments (9)

sovrasov avatar sovrasov commented on June 1, 2024 1

I've made the previous experiment with batch_first=False, when I set batch_first=True indeed the output is zero.
The reason is in the pytorch internals: it doesn't actually use MultiheadAttention.forward method (instead an optimized cuda kernel is directly called), so ptflops can't trace it. Therefore this bug with batch_first=True and param.requires_grad = False is not fixable.

from flops-counter.pytorch.

sovrasov avatar sovrasov commented on June 1, 2024

@ssk1997 please provide a code snippet to reproduce

from flops-counter.pytorch.

ssk1997 avatar ssk1997 commented on June 1, 2024

Test case with requires_grad=False:

import torch
import torch.nn as nn
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from ptflops import get_model_complexity_info

def prepare_input(resolution):
    input1 = torch.randn(1, 64, 512)
    return dict(src=input1)

layer = TransformerEncoderLayer(d_model=512 , nhead=4, dim_feedforward=512 , batch_first=True)
model = TransformerEncoder(layer, 1)
for param in model.parameters():
    param.requires_grad = False  

flop1, params = get_model_complexity_info(model, input_res=(1, 1, 1), 
                                              input_constructor=prepare_input,
                                              as_strings=True, print_per_layer_stat=True)
print(flop1,params)

output:

TransformerEncoder(
  0, 0.000% Params, 0.0 Mac, 0.000% MACs, 
  (layers): ModuleList(
    0, 0.000% Params, 0.0 Mac, 0.000% MACs, 
    (0): TransformerEncoderLayer(
      0, 0.000% Params, 0.0 Mac, 0.000% MACs, 
      (self_attn): MultiheadAttention(
        0, 0.000% Params, 0.0 Mac, 0.000% MACs, 
        (out_proj): NonDynamicallyQuantizableLinear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
      )
      (linear1): Linear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
      (dropout): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
      (linear2): Linear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
      (norm1): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
      (dropout2): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
    )
  )
)
0.0 Mac 0

Test case with requires_grad=True:

import torch
import torch.nn as nn
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from ptflops import get_model_complexity_info

def prepare_input(resolution):
    input1 = torch.randn(1, 64, 512)
    return dict(src=input1)

layer = TransformerEncoderLayer(d_model=512 , nhead=4, dim_feedforward=512 , batch_first=True)
model = TransformerEncoder(layer, 1)

flop1, params = get_model_complexity_info(model, input_res=(1, 1, 1), 
                                              input_constructor=prepare_input,
                                              as_strings=True, print_per_layer_stat=True)
print(flop1,params)

Output:

TransformerEncoder(
  1.58 M, 99.870% Params, 105.04 MMac, 100.000% MACs, 
  (layers): ModuleList(
    1.58 M, 99.870% Params, 105.04 MMac, 100.000% MACs, 
    (0): TransformerEncoderLayer(
      1.58 M, 99.870% Params, 105.04 MMac, 100.000% MACs, 
      (self_attn): MultiheadAttention(
        1.05 M, 66.580% Params, 71.48 MMac, 68.054% MACs, 
        (out_proj): NonDynamicallyQuantizableLinear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
      )
      (linear1): Linear(262.66 k, 16.645% Params, 16.78 MMac, 15.973% MACs, in_features=512, out_features=512, bias=True)
      (dropout): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
      (linear2): Linear(262.66 k, 16.645% Params, 16.78 MMac, 15.973% MACs, in_features=512, out_features=512, bias=True)
      (norm1): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
      (dropout2): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
    )
  )
)
105.04 MMac 1.58 M

Thanks for your response. @sovrasov

from flops-counter.pytorch.

sovrasov avatar sovrasov commented on June 1, 2024

My output if I launch this snipped with param.requires_grad = False will be 100.89 MMac 0. ptflops returns 0 params in that case because it counts only ones with a gradients (it's a natural definition of learnable parameters). Which version of ptflops do you use?

from flops-counter.pytorch.

sovrasov avatar sovrasov commented on June 1, 2024

For the reference you can grep torch._transformer_encoder_layer_fwd in the pytorch's source code.

from flops-counter.pytorch.

ssk1997 avatar ssk1997 commented on June 1, 2024

Thanks a lot. It works fine with batch_first=False.

from flops-counter.pytorch.

sovrasov avatar sovrasov commented on June 1, 2024

This bug may occur in other cases as well because of pytorch inference optimization: https://pytorch.org/tutorials/beginner/bettertransformer_tutorial.html https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/

from flops-counter.pytorch.

quancs avatar quancs commented on June 1, 2024

Encountered the same bug. Is there any hook to fix this?

from flops-counter.pytorch.

sovrasov avatar sovrasov commented on June 1, 2024

@quancs as I already wrote, this is a wontfix problem

from flops-counter.pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.