I am trying to understand how the static mask is being computed. I haven't had the cha

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I see. So because identity_reprojection_loss and <cod

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How is static mask being computed? about monodepth2 HOT 15 CLOSED

nianticlabs commented on June 2, 2024

How is static mask being computed?

from monodepth2.

Comments (15)

mdfirman commented on June 2, 2024 9

Thanks @FangGet for chiming in here.

I have looked back over the paper and the code, and I don't think I have helped things as I believe there is a typo in the paper in the computation of \mu: the < should be a > in equation (5)! The text and explanations are all correct; the idea is that the mask should be zero when we want to ignore a pixel, i.e. when the reprojection loss of the unwarped frame is lower than the loss of the warped frame. We will re-check and correct this on the next arxiv update.

Here is my explanation of why the code and the paper produce equivalent results. Notice the correction of the formula for \mu between eqns (2) and (3):

from monodepth2.

aj96 commented on June 2, 2024 3

Thank you so much for this discussion. I have learned a lot. I am closing this issue now.

from monodepth2.

mdfirman commented on June 2, 2024 2

Correct, but that line is only used for visualisation purposes. The actual 'application' of the mask occurs in the torch.min line.

The line you refer to should really be something like 'idxs > len(self.opt.frame_ids) - 1` or something, to be source-frame-number agnostic. I'll look into a codefix soon.

from monodepth2.

mdfirman commented on June 2, 2024 2

Thanks – me too! And thanks for leading to our discovery of the paper typo.

from monodepth2.

mdfirman commented on June 2, 2024 1

Thanks for your interest! You are basically right. The only thing is, that the reprojection_loss and identity_reprojection_loss variables are not a single channel large; in fact, they have as many channels as source frames. So in the mono-only training case, each of these variables has two channels, so combined has four channels. The idea of taking min over this four-channel tensor is to combine the min in the automasking with the min in the min-reproj losses into one single operation, with the hope of making it all a bit more efficient.

We compare idxs > 1 to get the mask for visualisation as a dirty/undocumented trick, which only works in the mono-case. Because identity_reprojection_loss has two channels, where idx > 1 it indicates that one of the reprojection_loss channels was smallest for this pixel.

Does that help explain what is happening?

from monodepth2.

patrick-llgc commented on June 2, 2024 1

@aj96 @mdfirman thanks guys for the discussion so far. It has been quite informative reading through your conversation.

I have to agree with @aj96 that masking the loss as shown in the paper is not equivalent to picking the smaller of two differences (losses).

That said, I’m not sure how much difference these two implementations will lead to in practice. The whole idea of comparison of the two differences originated from the assumption that if a scene or an object is relatively static, the loss of the corresponding pixels will be very small (or zero, in theory, for example, in a completely static frame).

from monodepth2.

aj96 commented on June 2, 2024

I see. So because identity_reprojection_loss and reprojection_loss each have two channels, one channel for each target-source pair, checking for where idx > 1 is where the reprojection_loss is greater which is likely a dynamic pixel, the one we want to keep. So it really only works if you have 2 source frames.

from monodepth2.

aj96 commented on June 2, 2024

Sorry one more thing, you also find the mean of the reprojection losses along the channel dimensions before doing all these computations for the static mask. Isn't there ambiguity in that? Say you compare an RGB pixel of [25, 25, 75] with an RGB pixel of [75, 25, 25]. Because you computed the mean along the channels, those pixels will look the same when you compare identity_reprojection_loss and reprojection_loss when they are clearly different?

from monodepth2.

aj96 commented on June 2, 2024

@mdfirman I promise this is the last question. I am working with tensorflow, and by just reading the paper, I came up with this naive implementation for computing the mask:

def compute_static_mask(target, source, warped):
    """
    Computes the static mask to mask out likely static pixels so to prevent
    from contaminating the loss.
    
    Args:
        target: target image as tensor [B, H, W, 3]
        source: source image as tensor [B, H, W ,3]
        warped: source warped into target as tensor [B, H, W, 3]
    Returns:
        tensor of shape [B, H, W, 3] and type tf.float32 containing 0.0 at
        pixel locations that are likely to be static and 1.0 at pixel
        locations that are likely to be dynamic
    """
    
    target = tf.image.rgb_to_grayscale(target)
    source = tf.image.rgb_to_grayscale(source)
    warped = tf.image.rgb_to_grayscale(warped)
    auto_mask = tf.abs(target - source) >= tf.abs(target - warped)
    auto_mask = tf.cast(auto_mask, tf.float32)
    return auto_mask

But the static mask looks all wrong. It looks nothing like the paper's examples. I understand that you need to keep the minimum of the two when computing the loss, but for visualization purposes, this seems to be correct?

from monodepth2.

mdfirman commented on June 2, 2024

Hmmm. I think you're almost there. It seems you're converting RGB to grey scale, then computing L1 loss on the comparison? In our work we compute L1+SSIM on the colour images, which might be giving quite different results.

from monodepth2.

aj96 commented on June 2, 2024

I understand that the loss was being calculated incorrectly. But what about the visualization for the mask?

I re-did it and implemented it as you did in the code, and the mask now looks correct. I am only using this concept for the reconstruction loss, not SSIM at the moment, just for rapid prototyping purposes. I've already seen an improvement in the evaluation results after just one epoch. But what I don't understand is why my previous mask visualization was not working.

def compute_static_mask(target, source, warped):
    identity_warped_error = tf.reduce_mean(tf.abs(target - source), axis=-1)
    warped_error = tf.reduce_mean(tf.abs(target - warped), axis=-1)

    identity_warped_error = tf.expand_dims(identity_warped_error, axis=-1)
    warped_error = tf.expand_dims(warped_error, axis=-1)

    combined = tf.concat((identity_warped_error, warped_error), axis=-1)

    to_optimise = tf.reduce_min(combined, axis=-1)
    to_optimise = tf.expand_dims(to_optimise, axis=-1)
    static_mask = tf.math.argmin(combined, axis=-1)
    # Because combined only consists of two channels, whereever the index
    # of the min is 1, that is when the pixel is dynamic
    static_mask = tf.expand_dims(static_mask, axis=-1)
    static_mask = tf.cast(static_mask, tf.float32)
    static_mask2 = identity_warped_error >= warped_error
    static_mask2 = tf.cast(static_mask2, tf.float32)
    return to_optimise, static_mask, static_mask2

static_mask is the new correct mask while static_mask2 is the old mask.

You also don't seem to elaborate on taking the minimum in your paper. I only realized that after reading through the code. Is the idea that we want to reconstruct the target image, so we choose the pixel that is closest to resembling the target? For cases where pixels are static, you would choose from target - source. For cases where pixels are dynamic, we would choose target - warped. I don't quite understand how this is preventing the infinite loss problem. Is this because during backpropagation, if the pixel was chosen from target - source, then it can't be backpropagated through the view-synthesis loss and therefore can't be projected to infinity?

from monodepth2.

mdfirman commented on June 2, 2024

I'm not fluent enough in tensorflow to be sure, but your two implementations do look like they should do the same thing. So I'm not sure why one looks more right than the other, sorry.

In the code we take the min, while in the paper we directly compare and then apply the mask. These two are functionally equivalent.

The idea of automasking isn't to prevent infinite loss, but to mask out regions corresponding to objects which are approximately moving at the same speed as the camera (to help to prevent infinite depth predictions). Pixels belonging to these objects generally don't move much between source and target images, and automasking is a way of finding these pixels.

from monodepth2.

aj96 commented on June 2, 2024

Sorry, I mis-spoke. I meant to prevent infinite depth, not infinite loss. So that is how it works. But because you are passing the static pixels through a different loss function that does not depend on the depth or pose net, those pixels do not affect how the depth and pose net weights are updated, correct? So what would be the difference if you just didn't include those pixels at all when computing the loss?

I don't see how taking the min and applying the mask are the same thing. The mask will either keep a pixel or zero it out. That is not the same function as taking a minimum.

You are basically just adding a constant to the loss for the static pixels, and it doesn't affect any of the gradients.

from monodepth2.

mdfirman commented on June 2, 2024

Hi both, I'm out of the office at the moment but will come back to these tomorrow. Thanks

from monodepth2.

FangGet commented on June 2, 2024

@aj96 you may misunderstand the auto mask computation, for example
there are two reprojection_error map for warped imaged and target image: reproj_warped_1, reproj_warped_2
also, there are two reprojection_error map for source image and target image: reproj_source_1, reproj_source_2

As the paper said, the mask computed from: [min pe(I(t), I(t')) < min pe(I(t), I(t'->t))]
so, we can get the binary mask like this:

reproj_source = tf.concat([reproj_source_1, reproj_source_2], axis=3)
reproj_warped = tf.concat([reproj_warped_1, reproj_warped_2], axis=3)
min pe(I(t), I(t')) = tf.min(reproj_source, axis=3)
min pe(I(t), I(t'->t)) = tf.min(reproj_warped, axis=3)
# now we do concat and compare
binary_mask = tf.to_float32(tf.argmin(tf.concat([min pe(I(t), I(t')), min pe(I(t), I(t'->t))], axis=3), axis=3))

this should be essentially equal to what author did in this pytorch code.

to be simple, for a1, a2, b1, b2
method1: binary_mask = (min(a1, a2)) < (min(b1, b2))
method2: binary_mask = min(a1, a2, b1, b2) > 1

from monodepth2.

How is static mask being computed? about monodepth2 HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent