Giter Site home page Giter Site logo

Comments (7)

alihassanijr avatar alihassanijr commented on June 30, 2024

Hello and thank you for your interest,

I'd refer you to our classification training configs here. We trained on 8xA100s with a total batch size of 1024 (128 per GPU).

When we were working on the original NAT paper and trained NAT Tiny, we only had very early stages of our naive kernels at hand, so training took a few days. But if you install NATTEN right now on an A100 machine and train it with our config, it should take less than 24 hours with mixed precision.
If you build NATTEN from source on your machine, you'll be running our GEMM kernels, which should be even faster than the public version.

from neighborhood-attention-transformer.

jamesben6688 avatar jamesben6688 commented on June 30, 2024

I installed NATTEN with

pip3 install natten -f https://shi-labs.com/natten/wheels/{cu_version}/torch{torch_version}/index.html.

I tested the functions:

import natten

# Check if NATTEN was built with CUDA
print(natten.has_cuda())

# Check if NATTEN with CUDA was built with support for float16
print(natten.has_half())

# Check if NATTEN with CUDA was built with support for bfloat16
print(natten.has_bfloat())

# Check if NATTEN with CUDA was built with the new GEMM kernels
print(natten.has_gemm())

However, it shows Module natten' has no attribute 'has_cuda'/has_half/has_bfloat/has_gemm.

By the way, I found that a batch size of 128 cannot fully utilize the 80G memory of the A100, so I increased the batch size to 832. This took 8 minutes to train one epoch. Will this affect the performance? May I know why you set the batch size to 128 for each GPU?

from neighborhood-attention-transformer.

alihassanijr avatar alihassanijr commented on June 30, 2024

This is because you installed a NATTEN release; to get the GEMM kernels, you need to build from source.

And re: batch size, there is absolutely nothing wrong with leaving memory free in your GPU. Just because the GPU has free memory doesn't mean it has enough compute to handle that. Without any specific reasons, increasing your batch size to fill up your GPU memory is generally not a good idea.

And given that batch statistics heavily impact training, if you're looking to reproduce a number, you should follow the exact settings.
Batch size 1024 is very common for similar models (architecture and size) trained on ImageNet-1K from scratch.

from neighborhood-attention-transformer.

jamesben6688 avatar jamesben6688 commented on June 30, 2024

Thank you so much for your response.

Does this mean I need to change the batch size from 128 to 256 for 4 GPUs?

Will install NATTEN from source and try it.

from neighborhood-attention-transformer.

alihassanijr avatar alihassanijr commented on June 30, 2024

Yes that's correct, 256 on 4 GPUs is still 1024 total. Although we don't use batch norm anywhere, it still might converge to a slightly different accuracy with that change, but it should be minimal.

from neighborhood-attention-transformer.

jamesben6688 avatar jamesben6688 commented on June 30, 2024

Ok. Thank you so much.

from neighborhood-attention-transformer.

alihassanijr avatar alihassanijr commented on June 30, 2024

Closing this due to inactivity.

from neighborhood-attention-transformer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.