Hi! First, thanks for sharing this! It's super impressive. I'm tryin

Yes, that's correct! The command line was <div class="highlight highlight-source-s

Benchmarking about tiny-cuda-nn HOT 4 OPEN

rmbrualla commented on May 14, 2024

Benchmarking

from tiny-cuda-nn.

Comments (4)

Tom94 commented on May 14, 2024

Yes, that's correct! The command line was

tiny-cuda-nn> .\build\bench_image_ours.exe .\data\images\albert.exr .\data\config.json

with n_neurons: 128 and n_neurons: 64, respectively.

The benchmark was run on Windows / MSVC 2019 / CUDA 11.3. Fan speed & power envelope of the GPU was also cranked to 100% and 114%, respectively, to minimize the impact of dynamic clocking. Unfortunately, the artificial 10-second pauses inbetween the measurements aren't quite enough to work around this in all cases. It's best to monitor GPU clock and temperature (e.g. using MSI Afterburner) to confirm.

from tiny-cuda-nn.

rmbrualla commented on May 14, 2024

Thanks for clarifying!

I'm getting somewhat confusing results though. I had issues in building the project in my environment, and it is linked against CUTLASS 2.3, and some loop unrolling failed. Unfortunately, it's hard to pinpoint which loop unroll failed.

In any case, I observe lower performance than yours, except for the case of neurons=128, where I get 2x throughput, which is actually faster than the case of neurons=64 (close to 1e9 elements per second). Maybe there is a bug in my patches, I haven't checked for correctness. I also haven't looked into the profiler carefully -- I'm guessing some of the kernels are spilling. I am benchmarking on a 3090 without any power/fan tricks.

Also, what is the extent of the modifications of CUTLASS wrt the latest version available on github? I saw the PreReLU options in GemmShape, but those are only used for the resnet and I ignored them.

from tiny-cuda-nn.

Tom94 commented on May 14, 2024

In any case, I observe lower performance than yours, except for the case of neurons=128, where I get 2x throughput, which is actually faster than the case of neurons=64 (close to 1e9 elements per second). Maybe there is a bug in my patches, I haven't checked for correctness. I also haven't looked into the profiler carefully -- I'm guessing some of the kernels are spilling. I am benchmarking on a 3090 without any power/fan tricks.

It might be worth verifying the correctness of the results (are the output images trained correctly?) to see whether something is wrong under the hood is affecting the performance numbers. As you say, 2x higher throughput sounds too good to be true. :)

Also, if your assessment is based on the console output rather than the emitted .json files, it's worth double-checking the ordering: the program first benchmarks CutlassMLP, which is expected to be slower than the graphs from the README, before benchmarking FullyFusedMLP. It also interleaves training and inference. In pseudocode, the ordering is:

for network in ["CutlassMLP", "FullyFusedMLP"]:
    for batch_size in [2**i for i in range(14, 21)]:
        bench_training_speed(network, batch_size)
        bench_inference_speed(network, batch_size)

from tiny-cuda-nn.

Tom94 commented on May 14, 2024

Also, what is the extent of the modifications of CUTLASS wrt the latest version available on github? I saw the PreReLU options in GemmShape, but those are only used for the resnet and I ignored them.

I haven't actually followed CUTLASS development for a while, but the PreReLU option is indeed the only change I remember making at the time.

from tiny-cuda-nn.

Benchmarking about tiny-cuda-nn HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent