Comments (7)
Hello and thank you for your interest,
I'd refer you to our classification training configs here. We trained on 8xA100s with a total batch size of 1024 (128 per GPU).
When we were working on the original NAT paper and trained NAT Tiny, we only had very early stages of our naive kernels at hand, so training took a few days. But if you install NATTEN right now on an A100 machine and train it with our config, it should take less than 24 hours with mixed precision.
If you build NATTEN from source on your machine, you'll be running our GEMM kernels, which should be even faster than the public version.
from neighborhood-attention-transformer.
I installed NATTEN with
pip3 install natten -f https://shi-labs.com/natten/wheels/{cu_version}/torch{torch_version}/index.html
.
I tested the functions:
import natten
# Check if NATTEN was built with CUDA
print(natten.has_cuda())
# Check if NATTEN with CUDA was built with support for float16
print(natten.has_half())
# Check if NATTEN with CUDA was built with support for bfloat16
print(natten.has_bfloat())
# Check if NATTEN with CUDA was built with the new GEMM kernels
print(natten.has_gemm())
However, it shows Module natten' has no attribute 'has_cuda'/has_half/has_bfloat/has_gemm
.
By the way, I found that a batch size of 128 cannot fully utilize the 80G memory of the A100, so I increased the batch size to 832. This took 8 minutes to train one epoch. Will this affect the performance? May I know why you set the batch size to 128 for each GPU?
from neighborhood-attention-transformer.
This is because you installed a NATTEN release; to get the GEMM kernels, you need to build from source.
And re: batch size, there is absolutely nothing wrong with leaving memory free in your GPU. Just because the GPU has free memory doesn't mean it has enough compute to handle that. Without any specific reasons, increasing your batch size to fill up your GPU memory is generally not a good idea.
And given that batch statistics heavily impact training, if you're looking to reproduce a number, you should follow the exact settings.
Batch size 1024 is very common for similar models (architecture and size) trained on ImageNet-1K from scratch.
from neighborhood-attention-transformer.
Thank you so much for your response.
Does this mean I need to change the batch size from 128 to 256 for 4 GPUs?
Will install NATTEN from source and try it.
from neighborhood-attention-transformer.
Yes that's correct, 256 on 4 GPUs is still 1024 total. Although we don't use batch norm anywhere, it still might converge to a slightly different accuracy with that change, but it should be minimal.
from neighborhood-attention-transformer.
Ok. Thank you so much.
from neighborhood-attention-transformer.
Closing this due to inactivity.
from neighborhood-attention-transformer.
Related Issues (20)
- Can you release your training log of NAT? I mean, the summary.csv in output folder. HOT 3
- ONNX HOT 2
- How to visualize the attention map? HOT 3
- Welcome update to OpenMMLab 2.0 HOT 1
- Is it possible to do upsampling using NAT ? HOT 2
- Where is natten.py
- May I ask whether the code of coco instance segmentation mask2former is dinat or NAT? HOT 1
- some problem during train HOT 9
- Is DiNAT code is runnable? HOT 2
- Is dectect model available? HOT 2
- freeze_at be set to 2 to freeze the pretrained weight downloaded from the official website? HOT 2
- About the receptive field of image pixel HOT 4
- training from scratch with different size for height and width HOT 3
- Cannot repeat the results of Mask2Former+DiNAT-Large on ADE20K HOT 12
- mmdetection on COCO2017 not converge HOT 1
- How to calculate the number of params? HOT 1
- For 3D segmentation HOT 2
- instance segmentation mask2former + dinat HOT 1
- Some comparisons against Deformable Attention HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from neighborhood-attention-transformer.