leaderj1001 / attention-augmented-conv2d Goto Github PK
View Code? Open in Web Editor NEWImplementing Attention Augmented Convolutional Networks using Pytorch
License: MIT License
Implementing Attention Augmented Convolutional Networks using Pytorch
License: MIT License
Thanks a lot for sharing your work! I would very much appreciate if you can also include a 1d version.
First, thank you for posting this, it has been helpful to understand the method, especially the relative encoding part.
Now, I see in the main page of this repo that you say the einsum
operation is slow, so, why not to replace it with matmul
here?
Instead of
rel_logits = torch.einsum('bhxyd,md->bhxym', q, rel_k)
We can have
rel_logits = q.matmul(rel_k.transpose(-1, -2))
I have tested it, and in my network (SimpleResNet56 used for CIFAR-10 in the original paper) I get that matmul
is, on average, 2x faster (110:220μs).
You can see also this discussion:
I can't run the AA-conv net,which occured
RuntimeError: Output 0 of ReshapeAliasBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
the net structure is
class AACNN(nn.Module):
def init(self):
super(AACNN, self).init()
self.conv1_model = nn.Sequential(OrderedDict([
('conv1', nn.Conv2d(1, 3, 3, padding=1)),
('relu1', nn.ReLU()),
]))
self.augmented_conv1 = AugmentedConv(in_channels=3, out_channels=20, kernel_size=3, dk=40, dv=4, Nh=4, relative=True,
stride=1, shape=64).to(device)
self.conv2_model = nn.Sequential(OrderedDict([
('conv2', nn.Conv2d(32, 16, 5, padding=2)),
('pool2', nn.MaxPool2d(2)),
('conv3', nn.Conv2d(16, 8, 5)),#out:8*12*12
('relu', nn.ReLU()),
]))
self.linear = nn.Sequential(OrderedDict([
('linear1', nn.Linear(8*12*12, 512)),
('linear2', nn.Linear(512, 128)),
('linear3', nn.Linear(128, 1)),
]))
def forward(self, x):
x = self.conv1_model(x)
print(x.shape)
x = self.augmented_conv1(x)
x = self.conv2_model(x)
x = self.linear(x)
return x
the print torch.Size([10, 3, 64, 64])
Thanks for any help
2022-05-28 12:16:19.811844: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2022-05-28 12:16:19.811984: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-05-28 12:17:59.868122: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2022-05-28 12:17:59.887137: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2022-05-28 12:17:59.890233: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-M8C53RA
2022-05-28 12:17:59.890303: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-M8C53RA
2022-05-28 12:18:15.246483: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-28 12:18:15.254740: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1c4cef7ff80 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-05-28 12:18:15.254775: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2022-05-30 11:02:52.095422: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 535822336 exceeds 10% of free system memory.
2022-05-30 11:02:52.065682: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 535822336 exceeds 10% of free system memory.
2022-05-30 11:02:52.095431: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 68719476736 exceeds 10% of free system memory.
2022-05-30 11:02:52.651227: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at batch_matmul_op_impl.h:730 : Resource exhausted: OOM when allocating tensor with shape[2,2,65536,65536] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2022-05-30 11:02:59.958471: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 536870912 exceeds 10% of free system memory.
2022-05-30 11:02:59.958471: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 536870912 exceeds 10% of free system memory.
2022-05-30 11:03:09.770279: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at tile_ops.cc:223 : Resource exhausted: OOM when allocating tensor with shape[2,2,256,256,256,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2022-05-30 11:03:09.916520: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at tile_ops.cc:223 : Resource exhausted: OOM when allocating tensor with shape[2,2,256,256,256,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2022-05-30 11:14:34.514953: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at batch_matmul_op_impl.h:730 : Resource exhausted: OOM when allocating tensor with shape[2,2,65536,65536] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2022-05-30 11:15:06.529323: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at tile_ops.cc:223 : Resource exhausted: OOM when allocating tensor with shape[2,2,256,256,256,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2022-05-30 11:15:06.529514: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at tile_ops.cc:223 : Resource exhausted: OOM when allocating tensor with shape[2,2,256,256,256,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
Hi! I think there's a bug at this line in the forward
function. Specifically, if the attention tensor attn_out
is as follows for an input image with shape (channels, h(=2), w(=3))
and self-attention channels dv = 2
:
# attention values of the 6 pixels
Att tensor([[-3.5002, -1.2102],
[-4.3694, -1.5107],
[-4.7621, -1.6465],
[-4.9178, -1.7003],
[-2.2335, -0.7722],
[-5.0056, -1.7307]], grad_fn=<SliceBackward>)
you should not reshape it directly using
attn_out = torch.reshape(attn_out, (batch, Nh, dv // Nh, height, width)) # Method 1
but instead you should use
attn_out = torch.reshape(attn_out.permute(0, 1, 3, 2), (bs, Nh, dv // Nh, H, W)) # Method 2
The output difference:
# Method 1
Att tensor([[[-3.5002, -1.2102, -4.3694],
[-1.5107, -4.7621, -1.6465]],
[[-4.9178, -1.7003, -2.2335],
[-0.7722, -5.0056, -1.7307]]], grad_fn=<SliceBackward>)
vs.
# Method 2
Att tensor([[[-3.5002, -4.3694, -4.7621],
[-4.9178, -2.2335, -5.0056]],
[[-1.2102, -1.5107, -1.6465],
[-1.7003, -0.7722, , -1.7307]]], grad_fn=<SliceBackward>)
Hope it helps!
According to the paper, they used dk_k value less than 1(mostly equal to 2*dv_v or dv_v).
Is there any reason for such value( dk_k = 2)? I'm just curious
How long will it cost for training AA-Wide-ResNet on CIFAR100 dataset?
And can you share your training device with me?
Thank you!
In the paper, it's stated that "The relative positional embeddings rH and rW are learned and shared across heads but not layers." I think in your implementation as well as in the one printed in the paper, they are learned separately for each head. I would expect a repeat per head as it's done in line 104 for the height. Could you explain, if I overlook something in your code?
I get this issue with using the _wide_layer
30 def make_layer(self, block, out_channels, n_blocks, dropout_rate, stride):
---> 31 strides = [stride] + [1]*(n_blocks-1)
32 layers = []
33
TypeError: can't multiply sequence by non-int of type 'float'
When self.relative == True, self.key_rel_w = nn.Parameter(torch.randn((2 * self.shape - 1, dk // Nh), requires_grad=True)).
However, I'm confused about why the dim0 should be 2 * self.shape - 1?
# flat_q, flat_k, flat_v
# (batch_size, Nh, height * width, dvh or dkh)
flat_q = torch.reshape(q, (N, Nh, dk // Nh, H * W))
flat_k = torch.reshape(k, (N, Nh, dk // Nh, H * W))
flat_v = torch.reshape(v, (N, Nh, dv // Nh, H * W))
Memory complexity of self-attention is O((HW)^2 * Nh) ? is it correct?
Hi, i am testing your sample code in attention_augmented_conv.py with:
tmp = torch.randn((16, 3, 32, 32)) a = AugmentedConv(3, 20, kernel_size=3, dk=40, dv=4, Nh=2, relative=True) print(a(tmp).shape)
But it raises:
Traceback (most recent call last): File "attention_augmented_conv.py", line 131, in <module> print(a(tmp).shape) File "/Users/scouly/anaconda3/envs/Pytorch_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "attention_augmented_conv.py", line 44, in forward h_rel_logits, w_rel_logits = self.relative_logits(q) File "attention_augmented_conv.py", line 90, in relative_logits rel_logits_w = self.relative_logits_1d(q, key_rel_w, H, W, Nh, "w") File "attention_augmented_conv.py", line 99, in relative_logits_1d rel_logits = torch.einsum('bhxyd,md->bhxym', q, rel_k) TypeError: einsum() takes 2 positional arguments but 3 were given
I'm guessing if it's caused by the version compatibility issue of pytorch.
BTW i am currently using pytorch 0.4.1 on Mac OS
Attention-Augmented-Conv2d/attention_augmented_conv.py
Lines 95 to 99 in c04acfb
model.named_parameters()
as input. And the optimizer.step()
and optimizer.zero_grad()
will ignore your key_rel_w
and key_rel_h
because they are not in the model.named_parameters()
. [Through the gradients will be calculated normally when loss.backward() is called.]self.key_rel_w
and self.key_rel_h
instead.正确的直觉应该是这样:
attn_out = torch.matmul(flat_v, weights)
In the relative_logits()
function, you have
q = torch.transpose(q, 2, 4)
which gives a tensor with shape (B, Nh, W, H, dkh)
, not (B, Nh, H, W, dkh)
.
In the relative_logits_1d()
function, you have
rel_logits = torch.einsum('bhxyd,md->bhmxy', q, rel_k)
rel_logits = torch.reshape(rel_logits, (-1, Nh * H, W, 2 * W - 1))
Shouldn't the einsum string be 'bhxyd,md->bhxym'
? Otherwise, you are reshaping a tensor with shape (B, Nh, 2 * W - 1, H, W)
to a tensor with shape (B, Nh * H, W, 2 * W - 1)
in the second line.
Use ==/!= to compare str, bytes, and int literals
flake8 testing of https://github.com/leaderj1001/Attention-Augmented-Conv2d on Python 3.7.1
$ flake8 . --count --select=E9,F63,F72,F82 --show-source --statistics
./attention_augmented_conv.py:106:12: F632 use ==/!= to compare str, bytes, and int literals
if case is "w":
^
./attention_augmented_conv.py:108:14: F632 use ==/!= to compare str, bytes, and int literals
elif case is "h":
^
./AA-Wide-ResNet/attention_augmented_conv.py:107:12: F632 use ==/!= to compare str, bytes, and int literals
if case is "w":
^
./AA-Wide-ResNet/attention_augmented_conv.py:109:14: F632 use ==/!= to compare str, bytes, and int literals
elif case is "h":
^
./AA-Wide-ResNet/preprocess.py:13:8: F632 use ==/!= to compare str, bytes, and int literals
if args.dataset_mode is "CIFAR100":
^
./AA-Wide-ResNet/preprocess.py:38:10: F632 use ==/!= to compare str, bytes, and int literals
elif args.dataset_mode is "CIFAR10":
^
./AA-Wide-ResNet/preprocess.py:63:10: F632 use ==/!= to compare str, bytes, and int literals
elif args.dataset_mode is "MNIST":
^
./AA-Wide-ResNet/main.py:86:8: F632 use ==/!= to compare str, bytes, and int literals
if args.dataset_mode is "CIFAR10":
^
./AA-Wide-ResNet/main.py:88:10: F632 use ==/!= to compare str, bytes, and int literals
elif args.dataset_mode is "CIFAR100":
^
./AA-Wide-ResNet/main.py:90:10: F632 use ==/!= to compare str, bytes, and int literals
elif args.dataset_mode is "MNIST":
^
10 F632 use ==/!= to compare str, bytes, and int literals
10
It seems the width and height of the feature map must be same.
Can they be differenct?
Attention-Augmented-Conv2d/AA-Wide-ResNet/attention_augmented_conv.py
Lines 35 to 36 in 1ce94a3
Has anyone successfully added this convolution to the resnet network
Hi,
Thanks for the open impl of AAConv/AAWRN 😄
I have access to a 16 gb GPU to do a few experiments on AA Wide Res Net, but the memory grows out of bounds at the start of training. For a AAWRN28-10 it requires approx 2gb of memory, for 206229580 parameters. At that point, running the model with a batch of 128 images from CIFAR 100 causes RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 15.90 GiB total capacity; 11.70 GiB already allocated; 1.26 GiB free; 2.24 GiB cached)
.
The error outputs points at the line 113 of the AA Conv class : rel_logits = rel_logits.repeat((1, 1, 1, H, 1, 1))
.
On the other hand, it happens in the first convolution of the first layer. Tried to switch every AA Conv to relative=False which performs a bit better, to the 2nd conv of the first layer.
Had to downscale the model to either batch size = 16 or a terribly low widen factor. If you any idea/plan on how to improve the memory efficiency it would be neat ! 😆
Is it right to use self.kernel_size here for kernel_size instead of 1?
Thanks for your project.
I have some questions about the implementation of the relative positional encoding.
According to your implementation, the memory cost is O((H^2W^2) while the paper mentions that they optimize the memory cost to O(HW).
Besides, I have also tried your method on the semantic segmentation tasks and find it is very slow and consumes a huge amount of memory.
I am wondering whether you have improved memory and time issues.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.