Hi Ali, Thank you for sharing your impressive work. <p dir="auto

NAT Tiny performance on ImageNet 1k about neighborhood-attention-transformer HOT 7 CLOSED

jamesben6688 commented on June 30, 2024

NAT Tiny performance on ImageNet 1k

from neighborhood-attention-transformer.

Comments (7)

alihassanijr commented on June 30, 2024

Hello and thank you for your interest,

I'd refer you to our classification training configs here. We trained on 8xA100s with a total batch size of 1024 (128 per GPU).

When we were working on the original NAT paper and trained NAT Tiny, we only had very early stages of our naive kernels at hand, so training took a few days. But if you install NATTEN right now on an A100 machine and train it with our config, it should take less than 24 hours with mixed precision.
If you build NATTEN from source on your machine, you'll be running our GEMM kernels, which should be even faster than the public version.

from neighborhood-attention-transformer.

jamesben6688 commented on June 30, 2024

I installed NATTEN with

pip3 install natten -f https://shi-labs.com/natten/wheels/{cu_version}/torch{torch_version}/index.html.

I tested the functions:

import natten

# Check if NATTEN was built with CUDA
print(natten.has_cuda())

# Check if NATTEN with CUDA was built with support for float16
print(natten.has_half())

# Check if NATTEN with CUDA was built with support for bfloat16
print(natten.has_bfloat())

# Check if NATTEN with CUDA was built with the new GEMM kernels
print(natten.has_gemm())

However, it shows Module natten' has no attribute 'has_cuda'/has_half/has_bfloat/has_gemm.

By the way, I found that a batch size of 128 cannot fully utilize the 80G memory of the A100, so I increased the batch size to 832. This took 8 minutes to train one epoch. Will this affect the performance? May I know why you set the batch size to 128 for each GPU?

from neighborhood-attention-transformer.

alihassanijr commented on June 30, 2024

This is because you installed a NATTEN release; to get the GEMM kernels, you need to build from source.

And re: batch size, there is absolutely nothing wrong with leaving memory free in your GPU. Just because the GPU has free memory doesn't mean it has enough compute to handle that. Without any specific reasons, increasing your batch size to fill up your GPU memory is generally not a good idea.

And given that batch statistics heavily impact training, if you're looking to reproduce a number, you should follow the exact settings.
Batch size 1024 is very common for similar models (architecture and size) trained on ImageNet-1K from scratch.

from neighborhood-attention-transformer.

jamesben6688 commented on June 30, 2024

Thank you so much for your response.

Does this mean I need to change the batch size from 128 to 256 for 4 GPUs?

Will install NATTEN from source and try it.

from neighborhood-attention-transformer.

alihassanijr commented on June 30, 2024

Yes that's correct, 256 on 4 GPUs is still 1024 total. Although we don't use batch norm anywhere, it still might converge to a slightly different accuracy with that change, but it should be minimal.

from neighborhood-attention-transformer.

jamesben6688 commented on June 30, 2024

Ok. Thank you so much.

from neighborhood-attention-transformer.

alihassanijr commented on June 30, 2024

Closing this due to inactivity.

from neighborhood-attention-transformer.

NAT Tiny performance on ImageNet 1k about neighborhood-attention-transformer HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent