Giter Site home page Giter Site logo

CUDA Error during validation about humanrf HOT 8 CLOSED

gafniguy avatar gafniguy commented on August 29, 2024
CUDA Error during validation

from humanrf.

Comments (8)

ZhaoLizz avatar ZhaoLizz commented on August 29, 2024 1

I guess I just figured this out. This error is caused by the input rays sampled uniformly during validation all hit the background and are all filtered out. My solution is to also sample rays randomly during validation but keep each pixel sampled once and only once. To do so, we can simply define a pixel-level mapping tensor as

# `__init__` of `data_loader.py` (Line 356)
self.random_pixel_indices = torch.randperm(self.num_pixels_per_camera, device=self.device)

and during define ray sampling indices in __next__ (Line 582) , we can simply do a index mapping like:

# data_loader, __next__, Line 582
ray_indices = torch.arange(ray_index_start, ray_index_end, dtype=torch.int64, device=self.device)
ray_indices = self.random_pixel_indices[ray_indices] # random 1-to-1 projection

Therefore, the randomly sampled rays are most likely not all hit the background; thus, the validation can run properly.

from humanrf.

isikmustafa avatar isikmustafa commented on August 29, 2024

Hey Guy, could you try reducing the parameter here.

As for the reason, during training, we randomly sample rays and have similar number of samples on average per ray. However, during validation we sample rays uniformly. So, some batches hit background completely and some do hit the actor which causes having high variance on the number of samples on the rays. I hope this helps if you further experience this issue

from humanrf.

gafniguy avatar gafniguy commented on August 29, 2024

Hey Mustafa, thanks for the prompt reply!

What you say makes sense,I tried reducing the number of rays but it didn't change it (probably would have been the fix for a CUDA OOM error). After looking further, I think there the problem might be here

    def forward(self, query_input: QueryInput) -> QueryOutput:
        output = self.density(query_input)

        # query_input.directions has value range [-1, 1]. To query the color_net this is transformed to [0, 1].
        color_net_input = [(query_input.directions + 1) * 0.5, output.geometry_features]
        if self.camera_embedding_dim > 0:
            if query_input.is_training:
                color_net_input.append(self.camera_embeddings(query_input.camera_numbers.squeeze(1).long()))
            else:
                **# Using zeros during validation&test time.**
                color_net_input.append(
                    torch.zeros(
                        (query_input.directions.shape[0], self.camera_embedding_dim),
                        dtype=query_input.directions.dtype,
                        device=query_input.directions.device,
                    )
                )

        output.radiance = self.color_net(torch.cat(color_net_input, dim=-1))

        return output

What ends up happening is that during validation, self.color_net is being called with an input of shape [0,20] (in training it's roughly [640000,20]. I wonder if in your setup you also get shape [0,20] and it's handled fine?

Edit: attaching screenshot with prints of shapes. Seems like it calls this forward function twice during val, the second time has query_input.directions.shape[0] = 0.

image

from humanrf.

gafniguy avatar gafniguy commented on August 29, 2024

Hi again @isikmustafa , I have adapted my data (so far just with AABB) and experience a similar issue in validation. I have localized the issue to be after the CUDA kernel in the next function of the dataloader, in the section that is only for val/test.

In the screenshot (bottom part) you can see the inputs to get_samples_aabb_minmax seem to make sense (I am also able to view the cameras properly with your snippet after invoking the dataloader), and on the left you can see all the outputs (even the pixel colors) for the batch have the first dimension of the shape as 0. This is happening on the last image of the validation set.

Really appreciate any tips you have :)
Thanks!
image

from humanrf.

isikmustafa avatar isikmustafa commented on August 29, 2024

Hey Guy, this is something we have seen as well. After the ray sampler is called, the rays that hit the background are eliminated. So, it is possible to see shapes like (0, 3) etc. If you just run at default config without changing anything this shouldn’t really happen.

So, I’m wondering if this could be related to somehow the pytorch or tiny-cuda-nn version. Because they might’ve changed behavior.

One hotfix could be to do if checks for these scenarios and not run the networks.

I’m on a vacation now, so I can’t dig deep into the code but I’ll take a closer look if the problem persists somehow.

from humanrf.

gafniguy avatar gafniguy commented on August 29, 2024

Thanks! that makes sense. For now I am working on a workaround to just skip the networks as you suggested, by creating an input batch of zeros with unmasked rays that have RenderOutput of zeros.

After having implemented that, I noticed that if I get a batch with just 1 ray to compute, I encounter the same error... (I change the hyperparams around a bit and then it happens less)

Do you remember what version of tinycudann did you use? Mine shows 1.7

Enjoy your time there :)

from humanrf.

ZhaoLizz avatar ZhaoLizz commented on August 29, 2024

@gafniguy @isikmustafa
Hey guys! Thanks for all your comments!
I have encountered the same issue for validation only (training works fine).
The input of self.ray_sampler_func is valid but the output results are empty. My tinycudann shows 1.7 as well.
Have you solve this problem yet?
Thanks for any tips in advance!

from humanrf.

isikmustafa avatar isikmustafa commented on August 29, 2024

I just noticed having an empty batch might cause issues in some versions of pytorch (e.g., 2.1) while others don't lead to this. So, I'm guessing there might be other relevant factors as well but as @ZhaoLizz pointed out (many thanks!), this problem can be easily avoided by using random indices.

from humanrf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.