Giter Site home page Giter Site logo

Benchmarking about tiny-cuda-nn HOT 4 OPEN

rmbrualla avatar rmbrualla commented on May 14, 2024
Benchmarking

from tiny-cuda-nn.

Comments (4)

Tom94 avatar Tom94 commented on May 14, 2024

Yes, that's correct! The command line was

tiny-cuda-nn> .\build\bench_image_ours.exe .\data\images\albert.exr .\data\config.json

with n_neurons: 128 and n_neurons: 64, respectively.

The benchmark was run on Windows / MSVC 2019 / CUDA 11.3. Fan speed & power envelope of the GPU was also cranked to 100% and 114%, respectively, to minimize the impact of dynamic clocking. Unfortunately, the artificial 10-second pauses inbetween the measurements aren't quite enough to work around this in all cases. It's best to monitor GPU clock and temperature (e.g. using MSI Afterburner) to confirm.

from tiny-cuda-nn.

rmbrualla avatar rmbrualla commented on May 14, 2024

Thanks for clarifying!

I'm getting somewhat confusing results though. I had issues in building the project in my environment, and it is linked against CUTLASS 2.3, and some loop unrolling failed. Unfortunately, it's hard to pinpoint which loop unroll failed.

In any case, I observe lower performance than yours, except for the case of neurons=128, where I get 2x throughput, which is actually faster than the case of neurons=64 (close to 1e9 elements per second). Maybe there is a bug in my patches, I haven't checked for correctness. I also haven't looked into the profiler carefully -- I'm guessing some of the kernels are spilling. I am benchmarking on a 3090 without any power/fan tricks.

Also, what is the extent of the modifications of CUTLASS wrt the latest version available on github? I saw the PreReLU options in GemmShape, but those are only used for the resnet and I ignored them.

from tiny-cuda-nn.

Tom94 avatar Tom94 commented on May 14, 2024

In any case, I observe lower performance than yours, except for the case of neurons=128, where I get 2x throughput, which is actually faster than the case of neurons=64 (close to 1e9 elements per second). Maybe there is a bug in my patches, I haven't checked for correctness. I also haven't looked into the profiler carefully -- I'm guessing some of the kernels are spilling. I am benchmarking on a 3090 without any power/fan tricks.

It might be worth verifying the correctness of the results (are the output images trained correctly?) to see whether something is wrong under the hood is affecting the performance numbers. As you say, 2x higher throughput sounds too good to be true. :)

Also, if your assessment is based on the console output rather than the emitted .json files, it's worth double-checking the ordering: the program first benchmarks CutlassMLP, which is expected to be slower than the graphs from the README, before benchmarking FullyFusedMLP. It also interleaves training and inference. In pseudocode, the ordering is:

for network in ["CutlassMLP", "FullyFusedMLP"]:
    for batch_size in [2**i for i in range(14, 21)]:
        bench_training_speed(network, batch_size)
        bench_inference_speed(network, batch_size)

from tiny-cuda-nn.

Tom94 avatar Tom94 commented on May 14, 2024

Also, what is the extent of the modifications of CUTLASS wrt the latest version available on github? I saw the PreReLU options in GemmShape, but those are only used for the resnet and I ignored them.

I haven't actually followed CUTLASS development for a while, but the PreReLU option is indeed the only change I remember making at the time.

from tiny-cuda-nn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.