leaderj1001 / attention-augmented-conv2d Goto Github PK

View Code? Open in Web Editor NEW

643.0 15.0 100.0 91 KB

Implementing Attention Augmented Convolutional Networks using Pytorch

License: MIT License

Python 100.00%

attention-augmented-conv pytorch

attention-augmented-conv2d's People

Contributors

Stargazers

Watchers

Forkers

xiaopingzeng zy1620454507 ddeeppnneett wangkingkingking juingzhou kelvinson stjordanis giorking maisyzhang littleserendipity shuizhilinxin qq2737499951 tonyle9 wdd614 kingofoz canermercancs chiukin r03943158 huangwenwenlili shubhampachori12110095 lianglili imfinethankyou wangkanger xig007 xjzhao18 bigheqiduo highdxy frankbolander crystal22 ylee1123 saifsayed jingang-cv christinaliang shengzhang90 valencebond daisey666 eshuka vainaixr alborzrs chomin khwajawisal lifengcs lipeng-gu gaiya2050 sparkparis fanbenchao dedeapo fulna louisnust lilujunai yaohuaxin csj11058 sunshlnw manujosephv yutpa davidko3 liugj101 wojiaoyanmin hopstone lawrencewxj bruinxiong xrosliang thomershen simenglv feiyang2008 liuyichaosoftware hzcxq hotshotabrog rayshark cyao118 paipaipaidaxing mymuli hanjie-gitch yg19930918 mldl yimikai enochkan wyibo85 senwang98 grapeaaa shuiniu86 chnoguchi forks-learning gedamua rbz-99 gordon0803 lovegood-1 miaomiaogarden rahilm97 apcc-geoslegend lu0x1a0 xkhunx sunssssssss qtaqt1 tifat58 bailu921 aoteman233 ningjingzhiwei licairong123 ugrkilc

attention-augmented-conv2d's Issues

1d version

Thanks a lot for sharing your work! I would very much appreciate if you can also include a 1d version.

why rel_embedding is 2w-1 or 2h-1, can you explain??

Replace einsum operation with matmul

First, thank you for posting this, it has been helpful to understand the method, especially the relative encoding part.

Now, I see in the main page of this repo that you say the einsum operation is slow, so, why not to replace it with matmul here?

Instead of

rel_logits = torch.einsum('bhxyd,md->bhxym', q, rel_k)

We can have

rel_logits = q.matmul(rel_k.transpose(-1, -2))

I have tested it, and in my network (SimpleResNet56 used for CIFAR-10 in the original paper) I get that matmul is, on average, 2x faster (110:220μs).

You can see also this discussion:

pytorch/pytorch#32591

Output 0 of ReshapeAliasBackward0 is a view and is being modified inplace

I can't run the AA-conv net,which occured
RuntimeError: Output 0 of ReshapeAliasBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

the net structure is
class AACNN(nn.Module):
def init(self):
super(AACNN, self).init()
self.conv1_model = nn.Sequential(OrderedDict([
('conv1', nn.Conv2d(1, 3, 3, padding=1)),
('relu1', nn.ReLU()),
]))

    self.augmented_conv1 = AugmentedConv(in_channels=3, out_channels=20, kernel_size=3, dk=40, dv=4, Nh=4, relative=True,
                                    stride=1, shape=64).to(device)

    self.conv2_model = nn.Sequential(OrderedDict([
        ('conv2', nn.Conv2d(32, 16, 5, padding=2)),
        ('pool2', nn.MaxPool2d(2)),
        ('conv3', nn.Conv2d(16, 8, 5)),#out：8*12*12
        ('relu', nn.ReLU()),
    ]))

    self.linear = nn.Sequential(OrderedDict([
        ('linear1', nn.Linear(8*12*12, 512)),
        ('linear2', nn.Linear(512, 128)),
        ('linear3', nn.Linear(128, 1)),
    ]))

def forward(self, x):
    x = self.conv1_model(x)
    print(x.shape)
    x = self.augmented_conv1(x)
    x = self.conv2_model(x)
    x = self.linear(x)
    return x

the print torch.Size([10, 3, 64, 64])

Thanks for any help

I am working on image size of 256x256x3 for Attention augmented convolution ResUNet so whenenver I start to train model I get OOM when allocating tensor with shape [2,2,256,256,256,256] issue

2022-05-28 12:16:19.811844: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2022-05-28 12:16:19.811984: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-05-28 12:17:59.868122: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2022-05-28 12:17:59.887137: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2022-05-28 12:17:59.890233: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-M8C53RA
2022-05-28 12:17:59.890303: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-M8C53RA
2022-05-28 12:18:15.246483: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-28 12:18:15.254740: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1c4cef7ff80 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-05-28 12:18:15.254775: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2022-05-30 11:02:52.095422: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 535822336 exceeds 10% of free system memory.
2022-05-30 11:02:52.065682: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 535822336 exceeds 10% of free system memory.
2022-05-30 11:02:52.095431: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 68719476736 exceeds 10% of free system memory.
2022-05-30 11:02:52.651227: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at batch_matmul_op_impl.h:730 : Resource exhausted: OOM when allocating tensor with shape[2,2,65536,65536] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2022-05-30 11:02:59.958471: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 536870912 exceeds 10% of free system memory.
2022-05-30 11:02:59.958471: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 536870912 exceeds 10% of free system memory.
2022-05-30 11:03:09.770279: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at tile_ops.cc:223 : Resource exhausted: OOM when allocating tensor with shape[2,2,256,256,256,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2022-05-30 11:03:09.916520: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at tile_ops.cc:223 : Resource exhausted: OOM when allocating tensor with shape[2,2,256,256,256,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2022-05-30 11:14:34.514953: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at batch_matmul_op_impl.h:730 : Resource exhausted: OOM when allocating tensor with shape[2,2,65536,65536] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2022-05-30 11:15:06.529323: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at tile_ops.cc:223 : Resource exhausted: OOM when allocating tensor with shape[2,2,256,256,256,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2022-05-30 11:15:06.529514: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at tile_ops.cc:223 : Resource exhausted: OOM when allocating tensor with shape[2,2,256,256,256,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu

Bug in forward of attention_augmented_conv.py

Hi! I think there's a bug at this line in the forward function. Specifically, if the attention tensor attn_out is as follows for an input image with shape (channels, h(=2), w(=3)) and self-attention channels dv = 2:

# attention values of the 6 pixels
Att tensor([[-3.5002, -1.2102],
        [-4.3694, -1.5107],
        [-4.7621, -1.6465],
        [-4.9178, -1.7003],
        [-2.2335, -0.7722],
        [-5.0056, -1.7307]], grad_fn=<SliceBackward>)

you should not reshape it directly using

attn_out = torch.reshape(attn_out, (batch, Nh, dv // Nh, height, width)) # Method 1

but instead you should use

attn_out = torch.reshape(attn_out.permute(0, 1, 3, 2), (bs, Nh, dv // Nh, H, W)) # Method 2

The output difference:

# Method 1
Att tensor([[[-3.5002, -1.2102, -4.3694],
         [-1.5107, -4.7621, -1.6465]],

        [[-4.9178, -1.7003, -2.2335],
         [-0.7722, -5.0056, -1.7307]]], grad_fn=<SliceBackward>)

vs.

# Method 2
Att tensor([[[-3.5002, -4.3694, -4.7621],
         [-4.9178, -2.2335, -5.0056]],

        [[-1.2102, -1.5107, -1.6465],
         [-1.7003, -0.7722, , -1.7307]]], grad_fn=<SliceBackward>)

Hope it helps!

Any reason for dk_k higher than 1?

According to the paper, they used dk_k value less than 1(mostly equal to 2*dv_v or dv_v).

Is there any reason for such value( dk_k = 2)? I'm just curious

A question about relative position embeddings

How long will it cost for training AA-Wide-ResNet on CIFAR100 dataset?

How long will it cost for training AA-Wide-ResNet on CIFAR100 dataset?
And can you share your training device with me?

Thank you!

could you share the code of resnet-50/101 with AA-Conv2d?

relative positional encoding not shared between heads

In the paper, it's stated that "The relative positional embeddings rH and rW are learned and shared across heads but not layers." I think in your implementation as well as in the one printed in the paper, they are learned separately for each head. I would expect a repeat per head as it's done in line 104 for the height. Could you explain, if I overlook something in your code?

TypeError: can't multiply sequence by non-int of type 'float' in _wide_layer

I get this issue with using the _wide_layer

 30     def make_layer(self, block, out_channels, n_blocks, dropout_rate, stride):
---> 31         strides = [stride] + [1]*(n_blocks-1)
     32         layers = []
     33 

TypeError: can't multiply sequence by non-int of type 'float'

Confused about the size of key_rel

When self.relative == True, self.key_rel_w = nn.Parameter(torch.randn((2 * self.shape - 1, dk // Nh), requires_grad=True)).
However, I'm confused about why the dim0 should be 2 * self.shape - 1?

Does here exist some inconsistency about this code ?

     # flat_q, flat_k, flat_v
    # (batch_size, Nh, height * width, dvh or dkh)

   flat_q = torch.reshape(q, (N, Nh, dk // Nh, H * W))
    flat_k = torch.reshape(k, (N, Nh, dk // Nh, H * W))
    flat_v = torch.reshape(v, (N, Nh, dv // Nh, H * W))

Memory complexity of self attention

Memory complexity of self-attention is O((HW)^2 * Nh) ? is it correct?

torch.einsum() compatibility

Hi, i am testing your sample code in attention_augmented_conv.py with:
tmp = torch.randn((16, 3, 32, 32)) a = AugmentedConv(3, 20, kernel_size=3, dk=40, dv=4, Nh=2, relative=True) print(a(tmp).shape)
But it raises:
Traceback (most recent call last): File "attention_augmented_conv.py", line 131, in <module> print(a(tmp).shape) File "/Users/scouly/anaconda3/envs/Pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "attention_augmented_conv.py", line 44, in forward h_rel_logits, w_rel_logits = self.relative_logits(q) File "attention_augmented_conv.py", line 90, in relative_logits rel_logits_w = self.relative_logits_1d(q, key_rel_w, H, W, Nh, "w") File "attention_augmented_conv.py", line 99, in relative_logits_1d rel_logits = torch.einsum('bhxyd,md->bhxym', q, rel_k) TypeError: einsum() takes 2 positional arguments but 3 were given
I'm guessing if it's caused by the version compatibility issue of pytorch.
BTW i am currently using pytorch 0.4.1 on Mac OS

Height and Weight size must be equal for input Features?

How to change the padding of the convolution layer for example (0)?

Problems of Parameter registration

Attention-Augmented-Conv2d/attention_augmented_conv.py

Lines 95 to 99 in c04acfb

    
           key_rel_w = nn.Parameter(torch.randn((2 * W - 1, dk), requires_grad=True)).to(device) 
        
           rel_logits_w = self.relative_logits_1d(q, key_rel_w, H, W, Nh, "w") 
        
           key_rel_h = nn.Parameter(torch.randn((2 * H - 1, dk), requires_grad=True)).to(device) 
        
           rel_logits_h = self.relative_logits_1d(torch.transpose(q, 2, 3), key_rel_h, W, H, Nh, "h")

I think if you register your Parameters here, it can not be correctly optimized.
Generally your optimizer takes model.named_parameters() as input. And the optimizer.step() and optimizer.zero_grad() will ignore your key_rel_w and key_rel_h because they are not in the model.named_parameters(). [Through the gradients will be calculated normally when loss.backward() is called.]
use self.key_rel_w and self.key_rel_h instead.

这里的矩阵乘法应该有问题

Attention-Augmented-Conv2d/in_paper_attention_augmented_conv/attention_augmented_conv.py

Line 51 in c04acfb

attn_out = torch.matmul(weights, flat_v.transpose(2, 3))

正确的直觉应该是这样：

attn_out = torch.matmul(flat_v, weights)

CUDA out of memory : tried to allocate 32768.00 Gib (only 10.00 Gib free)

Getting a memory error of unreal proportions on Wide-resnet

Possible bugs in relative_logits functions

In the relative_logits() function, you have

q = torch.transpose(q, 2, 4)

which gives a tensor with shape (B, Nh, W, H, dkh), not (B, Nh, H, W, dkh).

In the relative_logits_1d() function, you have

rel_logits = torch.einsum('bhxyd,md->bhmxy', q, rel_k)
rel_logits = torch.reshape(rel_logits, (-1, Nh * H, W, 2 * W - 1))

Shouldn't the einsum string be 'bhxyd,md->bhxym'? Otherwise, you are reshaping a tensor with shape (B, Nh, 2 * W - 1, H, W) to a tensor with shape (B, Nh * H, W, 2 * W - 1) in the second line.

Identity is not the same thing as equality in Python

Use ==/!= to compare str, bytes, and int literals

flake8 testing of https://github.com/leaderj1001/Attention-Augmented-Conv2d on Python 3.7.1

$ flake8 . --count --select=E9,F63,F72,F82 --show-source --statistics

./attention_augmented_conv.py:106:12: F632 use ==/!= to compare str, bytes, and int literals
        if case is "w":
           ^
./attention_augmented_conv.py:108:14: F632 use ==/!= to compare str, bytes, and int literals
        elif case is "h":
             ^
./AA-Wide-ResNet/attention_augmented_conv.py:107:12: F632 use ==/!= to compare str, bytes, and int literals
        if case is "w":
           ^
./AA-Wide-ResNet/attention_augmented_conv.py:109:14: F632 use ==/!= to compare str, bytes, and int literals
        elif case is "h":
             ^
./AA-Wide-ResNet/preprocess.py:13:8: F632 use ==/!= to compare str, bytes, and int literals
    if args.dataset_mode is "CIFAR100":
       ^
./AA-Wide-ResNet/preprocess.py:38:10: F632 use ==/!= to compare str, bytes, and int literals
    elif args.dataset_mode is "CIFAR10":
         ^
./AA-Wide-ResNet/preprocess.py:63:10: F632 use ==/!= to compare str, bytes, and int literals
    elif args.dataset_mode is "MNIST":
         ^
./AA-Wide-ResNet/main.py:86:8: F632 use ==/!= to compare str, bytes, and int literals
    if args.dataset_mode is "CIFAR10":
       ^
./AA-Wide-ResNet/main.py:88:10: F632 use ==/!= to compare str, bytes, and int literals
    elif args.dataset_mode is "CIFAR100":
         ^
./AA-Wide-ResNet/main.py:90:10: F632 use ==/!= to compare str, bytes, and int literals
    elif args.dataset_mode is "MNIST":
         ^
10    F632 use ==/!= to compare str, bytes, and int literals
10

Can the width and height be different?

It seems the width and height of the feature map must be same.
Can they be differenct?

Attention-Augmented-Conv2d/AA-Wide-ResNet/attention_augmented_conv.py

Lines 35 to 36 in 1ce94a3

    
           self.key_rel_w = nn.Parameter(torch.randn((2 * self.shape - 1, dk // Nh), requires_grad=True)) 
        
           self.key_rel_h = nn.Parameter(torch.randn((2 * self.shape - 1, dk // Nh), requires_grad=True))

Has anyone successfully added this convolution to the resnet network

Memory blow up issue

Hi,

Thanks for the open impl of AAConv/AAWRN 😄
I have access to a 16 gb GPU to do a few experiments on AA Wide Res Net, but the memory grows out of bounds at the start of training. For a AAWRN28-10 it requires approx 2gb of memory, for 206229580 parameters. At that point, running the model with a batch of 128 images from CIFAR 100 causes RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 15.90 GiB total capacity; 11.70 GiB already allocated; 1.26 GiB free; 2.24 GiB cached).

The error outputs points at the line 113 of the AA Conv class : rel_logits = rel_logits.repeat((1, 1, 1, H, 1, 1)).

On the other hand, it happens in the first convolution of the first layer. Tried to switch every AA Conv to relative=False which performs a bit better, to the 2nd conv of the first layer.

Had to downscale the model to either batch size = 16 or a terribly low widen factor. If you any idea/plan on how to improve the memory efficiency it would be neat ! 😆

kernel_size in qkv_conv

Is it right to use self.kernel_size here for kernel_size instead of 1?

Attention-Augmented-Conv2d/AA-Wide-ResNet/attention_augmented_conv.py

Line 30 in 2f6b41f

    
           self.qkv_conv = nn.Conv2d(self.in_channels, 2 * self.dk + self.dv, kernel_size=self.kernel_size, stride=stride, padding=self.padding)

Memory/Time Complexity of the relative positional encoding

Thanks for your project.

I have some questions about the implementation of the relative positional encoding.
According to your implementation, the memory cost is O((H^2W^2) while the paper mentions that they optimize the memory cost to O(HW).

Besides, I have also tried your method on the semantic segmentation tasks and find it is very slow and consumes a huge amount of memory.

I am wondering whether you have improved memory and time issues.

	key_rel_w = nn.Parameter(torch.randn((2 * W - 1, dk), requires_grad=True)).to(device)
	rel_logits_w = self.relative_logits_1d(q, key_rel_w, H, W, Nh, "w")

	key_rel_h = nn.Parameter(torch.randn((2 * H - 1, dk), requires_grad=True)).to(device)
	rel_logits_h = self.relative_logits_1d(torch.transpose(q, 2, 3), key_rel_h, W, H, Nh, "h")

	self.key_rel_w = nn.Parameter(torch.randn((2 * self.shape - 1, dk // Nh), requires_grad=True))
	self.key_rel_h = nn.Parameter(torch.randn((2 * self.shape - 1, dk // Nh), requires_grad=True))