Hi Mustafa and the team, I seem to have managed to install it correc

Hey Guy, could you try reducing the parameter <a href="https://github.com/synthesiares

Hi again <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

CUDA Error during validation about humanrf HOT 8 CLOSED

gafniguy commented on August 29, 2024

CUDA Error during validation

from humanrf.

Comments (8)

ZhaoLizz commented on August 29, 2024 1

I guess I just figured this out. This error is caused by the input rays sampled uniformly during validation all hit the background and are all filtered out. My solution is to also sample rays randomly during validation but keep each pixel sampled once and only once. To do so, we can simply define a pixel-level mapping tensor as

# `__init__` of `data_loader.py` (Line 356)
self.random_pixel_indices = torch.randperm(self.num_pixels_per_camera, device=self.device)

and during define ray sampling indices in __next__ (Line 582) , we can simply do a index mapping like:

# data_loader, __next__, Line 582
ray_indices = torch.arange(ray_index_start, ray_index_end, dtype=torch.int64, device=self.device)
ray_indices = self.random_pixel_indices[ray_indices] # random 1-to-1 projection

Therefore, the randomly sampled rays are most likely not all hit the background; thus, the validation can run properly.

from humanrf.

isikmustafa commented on August 29, 2024

Hey Guy, could you try reducing the parameter here.

As for the reason, during training, we randomly sample rays and have similar number of samples on average per ray. However, during validation we sample rays uniformly. So, some batches hit background completely and some do hit the actor which causes having high variance on the number of samples on the rays. I hope this helps if you further experience this issue

from humanrf.

gafniguy commented on August 29, 2024

Hey Mustafa, thanks for the prompt reply!

What you say makes sense,I tried reducing the number of rays but it didn't change it (probably would have been the fix for a CUDA OOM error). After looking further, I think there the problem might be here

    def forward(self, query_input: QueryInput) -> QueryOutput:
        output = self.density(query_input)

        # query_input.directions has value range [-1, 1]. To query the color_net this is transformed to [0, 1].
        color_net_input = [(query_input.directions + 1) * 0.5, output.geometry_features]
        if self.camera_embedding_dim > 0:
            if query_input.is_training:
                color_net_input.append(self.camera_embeddings(query_input.camera_numbers.squeeze(1).long()))
            else:
                **# Using zeros during validation&test time.**
                color_net_input.append(
                    torch.zeros(
                        (query_input.directions.shape[0], self.camera_embedding_dim),
                        dtype=query_input.directions.dtype,
                        device=query_input.directions.device,
                    )
                )

        output.radiance = self.color_net(torch.cat(color_net_input, dim=-1))

        return output

What ends up happening is that during validation, self.color_net is being called with an input of shape [0,20] (in training it's roughly [640000,20]. I wonder if in your setup you also get shape [0,20] and it's handled fine?

Edit: attaching screenshot with prints of shapes. Seems like it calls this forward function twice during val, the second time has query_input.directions.shape[0] = 0.

from humanrf.

gafniguy commented on August 29, 2024

Hi again @isikmustafa , I have adapted my data (so far just with AABB) and experience a similar issue in validation. I have localized the issue to be after the CUDA kernel in the next function of the dataloader, in the section that is only for val/test.

In the screenshot (bottom part) you can see the inputs to get_samples_aabb_minmax seem to make sense (I am also able to view the cameras properly with your snippet after invoking the dataloader), and on the left you can see all the outputs (even the pixel colors) for the batch have the first dimension of the shape as 0. This is happening on the last image of the validation set.

Really appreciate any tips you have :)
Thanks!

from humanrf.

isikmustafa commented on August 29, 2024

Hey Guy, this is something we have seen as well. After the ray sampler is called, the rays that hit the background are eliminated. So, it is possible to see shapes like (0, 3) etc. If you just run at default config without changing anything this shouldn’t really happen.

So, I’m wondering if this could be related to somehow the pytorch or tiny-cuda-nn version. Because they might’ve changed behavior.

One hotfix could be to do if checks for these scenarios and not run the networks.

I’m on a vacation now, so I can’t dig deep into the code but I’ll take a closer look if the problem persists somehow.

from humanrf.

gafniguy commented on August 29, 2024

Thanks! that makes sense. For now I am working on a workaround to just skip the networks as you suggested, by creating an input batch of zeros with unmasked rays that have RenderOutput of zeros.

After having implemented that, I noticed that if I get a batch with just 1 ray to compute, I encounter the same error... (I change the hyperparams around a bit and then it happens less)

Do you remember what version of tinycudann did you use? Mine shows 1.7

Enjoy your time there :)

from humanrf.

ZhaoLizz commented on August 29, 2024

@gafniguy @isikmustafa
Hey guys! Thanks for all your comments!
I have encountered the same issue for validation only (training works fine).
The input of self.ray_sampler_func is valid but the output results are empty. My tinycudann shows 1.7 as well.
Have you solve this problem yet?
Thanks for any tips in advance!

from humanrf.

isikmustafa commented on August 29, 2024

I just noticed having an empty batch might cause issues in some versions of pytorch (e.g., 2.1) while others don't lead to this. So, I'm guessing there might be other relevant factors as well but as @ZhaoLizz pointed out (many thanks!), this problem can be easily avoided by using random indices.

from humanrf.

CUDA Error during validation about humanrf HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent