leaplabthu / dat Goto Github PK

Repository of Vision Transformer with Deformable Attention (CVPR2022) and DAT++: Spatially Dynamic Vision Transformerwith Deformable Attention

Home Page: https://arxiv.org/abs/2309.01430

License: Apache License 2.0

Python 99.23% Shell 0.77%

deep-learning deformable-attention image-classification pytorch vision-transformer

dat's People

Contributors

Stargazers

Watchers

Forkers

recogsal yicho-yue benjamesbabala li-qingyun cv-ip semiai lb-hit neonsign247 noob919 alan-yly pugangqiang zax130 gabbysuwichaya wanboyang ucasqcz tyzaizl wstchhwp ahmadibtehaj likyoo zhiqi-li shuowang-ai asdf2kr danny0559 0xboater qc-sw zxp771 kimx3966 zivzone praful1993 misterluka poonono qianjinfighter wentaozhu hongbo-sun suyanzhou626 qilei123 yncao qhfan mryu001 blackfeather-wang chi1027 udayeon lxxxdovo soxhwhat shuvo001 rfarssi00 gmum ahwhbc shitoudidi young1906 virtualxqf matteo-fincato gauenk whuhxb shadowior nikeshdevkota wcswcswcs dl-vit winter-jon 88aggressive wzhengkai giaobruce sorokinvld cenkbircanoglu sprinkletian jwdang hebychen

dat's Issues

Changing model input size from 384 -> 1024

I'd like to know what layers to change if I wanted to input an image of size 1024 instead of 384. I'd also want to know if there are any additional concerns about using this model for a much bigger size input. Thanks.

How to run the model

if name == "main":
x = torch.ones((2, 3, 224, 224))
model = DAT()
y = model(x)
print(y.shape)

I tried to run the model with the above code to learn its details, but the following error occurred.

File "Model\DAT\DAT.py", line 232, in
model = DAT()
File "Model\DAT\DAT.py", line 134, in init
use_dwc_mlps[i])
File "Model\DAT\DAT.py", line 59, in init
no_off, fixed_pe, stage_idx)
File "Model\DAT\DAT_Block.py", line 201, in init
nn.Conv2d(self.n_group_channels, self.n_group_channels, kk, stride, kk // 2, groups=self.n_group_channels),
File "env\lib\site-packages\torch\nn\modules\conv.py", line 446, in init
False, _pair(0), groups, bias, padding_mode, **factory_kwargs)
File "env\lib\site-packages\torch\nn\modules\conv.py", line 132, in init
(out_channels, in_channels // groups, *kernel_size), **factory_kwargs))
RuntimeError: Trying to create tensor with negative dimension -96: [-96, 1, 9, 9]

Process finished with exit code 1

Deformable Attention Journal Paper not referenced

Dear Authors
I had way back in May 2021 - already had published a journal article on Deformable Attention
https://pubmed.ncbi.nlm.nih.gov/34022421/

It was even published much earlier in MedArxiv in August 2020
https://www.medrxiv.org/content/10.1101/2020.08.25.20181834v1

I would have expected that you atleast cite my paper in your journal.

I am surprised that you had not done thorough search of prior art and report all prior work in this space of Deformable Attention (and also the reviewers of CVPR also did not thoroughly check if good prior art was done.

Can you please cite my paper in any further publications of Deformable Attention?

Best Regards
Kumar

When you're going to release pretrained weights?

evaluate.sh: line 6: path-to-imagenet: No such file or directory

when running bash evaluate.sh 1 configs/dat_tiny.yaml ./dat_tiny_in1k_224.pth, i get the following error:

evaluate.sh: line 6: path-to-imagenet: No such file or directory

some questions about the reference points and offset network

Really nice work! I have some questions about the code. I see your implementation about the conv_offset and I find you use stride of 1 so the reference points is actually the whole map. But the paper says there is a stride of r. If there is no stride larger than 1, the complexity is the same as standard MHSA even larger! I think there maybe something wrong here.

Face negative dimension issue when running on CIFAR10

Hi, I am Lukas Wang, a master's student from Columbia. I am planning to review cutting-edged VIT-based models on medium-size datasets and found your work really interesting! I was trying to run the code using CIFAT10 dataset for testing but the following error came out.
RuntimeError: Trying to create tensor with negative dimension -96: [-96, 1, 9, 9]

I have noticed that the environment variable groups is set to groups=[-1, -1, 3, 6] as default in DAT model while the operation for DAttentionBaseline in dat_block.py will compute a negative value for first two stages. Could you please check out this issue? Really appreciate your help :)!

Why set the reference point coordinates like this

    def _get_ref_points(self, H_key, W_key, B, dtype, device):

        ref_y, ref_x = torch.meshgrid(
            torch.linspace(0.5, H_key - 0.5, H_key, dtype=dtype, device=device),
            torch.linspace(0.5, W_key - 0.5, W_key, dtype=dtype, device=device)
        )
        ref = torch.stack((ref_y, ref_x), -1)
        ref[..., 1].div_(W_key).mul_(2).sub_(1)
        ref[..., 0].div_(H_key).mul_(2).sub_(1)
        ref = ref[None, ...].expand(B * self.n_groups, -1, -1, -1)  # B * g H W 2

        return ref

i don't understand this one ref[..., 1].div_(W_key).mul_(2).sub_(1) ,
specially why use .mul_(2).sub_(1)?

RuntimeError: Trying to create tensor with negative dimension -96: [-96, 1, 9, 9]

你好，我直接在dat.py文件中使用如下代码测试你的模型，但是报错了，
model = DAT()
x = torch.randn((1, 3, 48, 48)).cuda(0)
y = model(x)
print(y.shape)

报错如下：
RuntimeError: Trying to create tensor with negative dimension -96: [-96, 1, 9, 9]

请问这是什么原因导致的，谢谢~

Pretrain Model

Would you release the pretrain model of DAT.

Thanks.

The computational cost of deformabel attention

Hi, thanks for your excellent work.
I notice that the number of the sampled keys/values is the same as the querys. Therefore, the computational cost of deformable attention is the same as global attention, is it right? So I'm curious why don't you use a global self-attention at the last two stages?

How can i get the most important keys in visualizations?

I wanna get the overall important keys like the figure from visualization in the github main page.
Which variable should i utilize?

Thanks in advance!

Different depthwise convolution kernel sizes?

In your paper, I see you use a depthwise convolution with kernel size 5 to generate offsets, but in the code you use different kernel sizes for different input sizes. Am I missing something?

how to visual the Receptive Field like the figure 1 in the paper ?

固定输入图片尺寸

你好，非常引人注目的工作，我注意到position embeddings采用relative方法的话，需要固定测试图片的大小，而采用dwconv方法可能会占用非常大的显存，不知道作者有没有处理动态输入图像尺寸的方法。

Controlling the number of keys per query

In the appendix of DAT vs D-DETR, you mentioned changing the number of keys in Stage 3 and Stage 4. I was wondering where in the code, can you change for that? Thank you.

处理文本数据或一维序列数据

您好，请问Deformable Attention处理文本数据或一维序列数据是否有效？

About Low Accuracy

Hi, thanks for your excellent work.But when I use your DAT model and pre trained weights for image segmentation tasks, the effect is not ideal. I do this: take out the features of each layer and then recover the image size through simple deconvolution up sampling and skip connection operations. The code is as follows:

If possible, please tell me where the error is. I hope you can publish the segmentation model as soon as possible and look forward to your reply. Thank you

Error: Trying to create tensor with negative dimension -96: [-96, 1, 9, 9]

@Vladimir2506 @Panxuran @LeapLabTHU

I tried to use your basic DAT module, iam getting error below:

Trying to create tensor with negative dimension -96: [-96, 1, 9, 9].

It is because of

DAT/models/dat_blocks.py

Line 163 in 1029c76

    
           nn.Conv2d(self.n_group_channels, self.n_group_channels, kk, stride, kk//2, groups=self.n_group_channels),

which is because of
https://github.com/LeapLabTHU/DAT/issues/new?permalink=https%3A%2F%2Fgithub.com%2FLeapLabTHU%2FDAT%2Fblob%2F1029c76003b346ddcc80de6293ae9c7e2b6c3565%2Fmodels%2Fdat.py%23L98

Kindly help

How to visualize the learned offsets?

Hi,nice work!
I'm confusing about visualizing the learned offsets as shown in Fig. 6?
Could you give some reference code?

采样点个数问题

您好啊，我调试的代码是DAT不是DAT++，
在类DAttentionBaseline的forward函数中有这样一段内容：in dat_blocks.py line 222

q = self.proj_q(x)
q_off = einops.rearrange(q, 'b (g c) h w -> (b g) c h w', g=self.n_groups, c=self.n_group_channels)
offset = self.conv_offset(q_off) # B * g 2 Hg Wg
Hk, Wk = offset.size(2), offset.size(3)
n_sample = Hk * Wk

这一段代码我有疑问，self.conv_offset是这样初始化的：

self.conv_offset = nn.Sequential(
            nn.Conv2d(self.n_group_channels, self.n_group_channels, kk, stride, kk//2, groups=self.n_group_channels),
            LayerNormProxy(self.n_group_channels),
            nn.GELU(),
            nn.Conv2d(self.n_group_channels, 2, 1, 1, 0, bias=False)
        )

这意味这offset的高宽只受stride的影响而和offset_range_factor无关，这是否和你在论文中阐述的内容有冲突；
其次这里的stride在config文件中是1，这也就意味着采样点的是HW而不是HW/r**2.

想请教一下您是我有地方没有考虑到吗。我非常期待您的答复，谢谢！

训练问题

不使用命令行的情况下，该如何通过修改运行配置来调试代码呢？而且可以迁移到其他跟小的数据集上训练吗？

displacement problem

hello sir,
why the displacement is calculated as below ?
displacement = (q_grid.reshape(B * self.n_groups, H * W, 2).unsqueeze(2) - pos.reshape(B * self.n_groups, n_sample, 2).unsqueeze(1)).mul(0.5)
why not
displacement = (pos.reshape(B * self.n_groups, H * W, 2).unsqueeze(2) - pos.reshape(B * self.n_groups, n_sample, 2).unsqueeze(1))

what does -1 mean in strides and groups variable

Hi, thanks for your nice work. I'm very confused with the negative values (-1) in strides and groups. For example, strides=[-1,-1,1,1] or groups=[-1.-1.3,6]. This also triggers errors when creating weight tensors due to that negative dimensions are not allowed in tensors:

RuntimeError: Trying to create tensor with negative dimension -96: [-96, 1, 9, 9]

It would be appreciated if you can explain what does -1 mean here.

Thanks in advance!

About the range of offset

Thank you for your work!
I have a question, whether the range of offset is related to the kernel size of offset_conv, for example, whether the offset_conv of 3*3 can only consider the offset of only one point around each pixel in that feature map layer?

Exporting of object detection & segmentation code

Hello
How are you?
Thanks for contributing to this project.
When can we expect to get the object detection & segmentation code?

您好，DAT可以当作即插即用的模块吗

how to visualize Figure 5,Figure 6, Figure7

Interested in your paper, how to visualize Figure 5,Figure 6, Figure7？Thanks

训练时计算维度出错

在使用DAT时出现了如下错误，参数是按照config设置的，与config一致，每一种方案都试过了，都会报错，麻烦帮忙看下
File "/root/BasicSR-master/basicsr/archs/discriminator_arch.py", line 202, in forward
x_total = einops.rearrange(x, 'b c (r1 h1) (r2 w1) -> b (r1 r2) (h1 w1) c', h1=self.window_size[0], w1=self.window_size[1]) # B x Nr x Ws x C
File "/root/miniconda3/lib/python3.8/site-packages/einops/einops.py", line 487, in rearrange
return reduce(tensor, pattern, reduction='rearrange', **axes_lengths)
File "/root/miniconda3/lib/python3.8/site-packages/einops/einops.py", line 418, in reduce
raise EinopsError(message + '\n {}'.format(e))
einops.EinopsError: Error while processing rearrange-reduction pattern "b c (r1 h1) (r2 w1) -> b (r1 r2) (h1 w1) c".
Input tensor shape: torch.Size([128, 96, 32, 32]). Additional info: {'h1': 7, 'w1': 7}.
Shape mismatch, can't divide axis of length 32 in chunks of 7

what's the difference between DAT and deformable DETR's attention?

build_loader function missing

DAT/main.py

Line 27 in 4bb699e

from data import build_loader

I think ./data is missing. Is build_loader function supposed to be:
https://github.com/microsoft/Swin-Transformer/blob/5d2aede42b4b12cb0e7a2448b58820aeda604426/data/build.py#L38

Which version of imagenet are you using? Is it ILSVRC 2012?

an unused parameters for class TransformerStage:"ns_per_pt" and "sr_ratio"

Hello,I found an unused parameters "ns_per_pt" and "sr_ratio",I was wondering what is it used for?
Thank you very much!

class TransformerStage(nn.Module):
def init(self, fmap_size, window_size, ns_per_pt,
dim_in, dim_embed, depths, stage_spec, n_groups,
use_pe, sr_ratio,
heads, stride, offset_range_factor, stage_idx,
dwc_pe, no_off, fixed_pe,
attn_drop, proj_drop, expansion, drop, drop_path_rate, use_dwc_mlp):

unalignment of classification result on imageNet

thanks for the contribution .
I trained the 224 x 224 imageNet classification model, while the acc has a gap between mine and yours. Hope there could be the pretrained model and related setting.
Thanks.