Comments (15)
Thanks @FangGet for chiming in here.
I have looked back over the paper and the code, and I don't think I have helped things as I believe there is a typo in the paper in the computation of \mu
: the < should be a > in equation (5)! The text and explanations are all correct; the idea is that the mask should be zero when we want to ignore a pixel, i.e. when the reprojection loss of the unwarped frame is lower than the loss of the warped frame. We will re-check and correct this on the next arxiv update.
Here is my explanation of why the code and the paper produce equivalent results. Notice the correction of the formula for \mu
between eqns (2) and (3):
from monodepth2.
Thank you so much for this discussion. I have learned a lot. I am closing this issue now.
from monodepth2.
Correct, but that line is only used for visualisation purposes. The actual 'application' of the mask occurs in the torch.min
line.
The line you refer to should really be something like 'idxs > len(self.opt.frame_ids) - 1` or something, to be source-frame-number agnostic. I'll look into a codefix soon.
from monodepth2.
Thanks ā me too! And thanks for leading to our discovery of the paper typo.
from monodepth2.
Thanks for your interest! You are basically right. The only thing is, that the reprojection_loss
and identity_reprojection_loss
variables are not a single channel large; in fact, they have as many channels as source frames. So in the mono-only training case, each of these variables has two channels, so combined
has four channels. The idea of taking min
over this four-channel tensor is to combine the min in the automasking with the min in the min-reproj losses into one single operation, with the hope of making it all a bit more efficient.
We compare idxs > 1
to get the mask for visualisation as a dirty/undocumented trick, which only works in the mono-case. Because identity_reprojection_loss
has two channels, where idx > 1
it indicates that one of the reprojection_loss
channels was smallest for this pixel.
Does that help explain what is happening?
from monodepth2.
@aj96 @mdfirman thanks guys for the discussion so far. It has been quite informative reading through your conversation.
I have to agree with @aj96 that masking the loss as shown in the paper is not equivalent to picking the smaller of two differences (losses).
That said, Iām not sure how much difference these two implementations will lead to in practice. The whole idea of comparison of the two differences originated from the assumption that if a scene or an object is relatively static, the loss of the corresponding pixels will be very small (or zero, in theory, for example, in a completely static frame).
from monodepth2.
I see. So because identity_reprojection_loss
and reprojection_loss
each have two channels, one channel for each target-source pair, checking for where idx > 1 is where the reprojection_loss is greater which is likely a dynamic pixel, the one we want to keep. So it really only works if you have 2 source frames.
from monodepth2.
Sorry one more thing, you also find the mean of the reprojection losses along the channel dimensions before doing all these computations for the static mask. Isn't there ambiguity in that? Say you compare an RGB pixel of [25, 25, 75] with an RGB pixel of [75, 25, 25]. Because you computed the mean along the channels, those pixels will look the same when you compare identity_reprojection_loss
and reprojection_loss
when they are clearly different?
from monodepth2.
@mdfirman I promise this is the last question. I am working with tensorflow, and by just reading the paper, I came up with this naive implementation for computing the mask:
def compute_static_mask(target, source, warped):
"""
Computes the static mask to mask out likely static pixels so to prevent
from contaminating the loss.
Args:
target: target image as tensor [B, H, W, 3]
source: source image as tensor [B, H, W ,3]
warped: source warped into target as tensor [B, H, W, 3]
Returns:
tensor of shape [B, H, W, 3] and type tf.float32 containing 0.0 at
pixel locations that are likely to be static and 1.0 at pixel
locations that are likely to be dynamic
"""
target = tf.image.rgb_to_grayscale(target)
source = tf.image.rgb_to_grayscale(source)
warped = tf.image.rgb_to_grayscale(warped)
auto_mask = tf.abs(target - source) >= tf.abs(target - warped)
auto_mask = tf.cast(auto_mask, tf.float32)
return auto_mask
But the static mask looks all wrong. It looks nothing like the paper's examples. I understand that you need to keep the minimum of the two when computing the loss, but for visualization purposes, this seems to be correct?
from monodepth2.
Hmmm. I think you're almost there. It seems you're converting RGB to grey scale, then computing L1 loss on the comparison? In our work we compute L1+SSIM on the colour images, which might be giving quite different results.
from monodepth2.
I understand that the loss was being calculated incorrectly. But what about the visualization for the mask?
I re-did it and implemented it as you did in the code, and the mask now looks correct. I am only using this concept for the reconstruction loss, not SSIM at the moment, just for rapid prototyping purposes. I've already seen an improvement in the evaluation results after just one epoch. But what I don't understand is why my previous mask visualization was not working.
def compute_static_mask(target, source, warped):
identity_warped_error = tf.reduce_mean(tf.abs(target - source), axis=-1)
warped_error = tf.reduce_mean(tf.abs(target - warped), axis=-1)
identity_warped_error = tf.expand_dims(identity_warped_error, axis=-1)
warped_error = tf.expand_dims(warped_error, axis=-1)
combined = tf.concat((identity_warped_error, warped_error), axis=-1)
to_optimise = tf.reduce_min(combined, axis=-1)
to_optimise = tf.expand_dims(to_optimise, axis=-1)
static_mask = tf.math.argmin(combined, axis=-1)
# Because combined only consists of two channels, whereever the index
# of the min is 1, that is when the pixel is dynamic
static_mask = tf.expand_dims(static_mask, axis=-1)
static_mask = tf.cast(static_mask, tf.float32)
static_mask2 = identity_warped_error >= warped_error
static_mask2 = tf.cast(static_mask2, tf.float32)
return to_optimise, static_mask, static_mask2
static_mask is the new correct mask while static_mask2 is the old mask.
You also don't seem to elaborate on taking the minimum in your paper. I only realized that after reading through the code. Is the idea that we want to reconstruct the target image, so we choose the pixel that is closest to resembling the target? For cases where pixels are static, you would choose from target - source. For cases where pixels are dynamic, we would choose target - warped. I don't quite understand how this is preventing the infinite loss problem. Is this because during backpropagation, if the pixel was chosen from target - source, then it can't be backpropagated through the view-synthesis loss and therefore can't be projected to infinity?
from monodepth2.
I'm not fluent enough in tensorflow to be sure, but your two implementations do look like they should do the same thing. So I'm not sure why one looks more right than the other, sorry.
In the code we take the min, while in the paper we directly compare and then apply the mask. These two are functionally equivalent.
The idea of automasking isn't to prevent infinite loss, but to mask out regions corresponding to objects which are approximately moving at the same speed as the camera (to help to prevent infinite depth predictions). Pixels belonging to these objects generally don't move much between source and target images, and automasking is a way of finding these pixels.
from monodepth2.
Sorry, I mis-spoke. I meant to prevent infinite depth, not infinite loss. So that is how it works. But because you are passing the static pixels through a different loss function that does not depend on the depth or pose net, those pixels do not affect how the depth and pose net weights are updated, correct? So what would be the difference if you just didn't include those pixels at all when computing the loss?
I don't see how taking the min and applying the mask are the same thing. The mask will either keep a pixel or zero it out. That is not the same function as taking a minimum.
You are basically just adding a constant to the loss for the static pixels, and it doesn't affect any of the gradients.
from monodepth2.
Hi both, I'm out of the office at the moment but will come back to these tomorrow. Thanks
from monodepth2.
@aj96 you may misunderstand the auto mask computation, for example
there are two reprojection_error map for warped imaged and target image: reproj_warped_1, reproj_warped_2
also, there are two reprojection_error map for source image and target image: reproj_source_1, reproj_source_2
As the paper said, the mask computed from: [min pe(I(t), I(t')) < min pe(I(t), I(t'->t))]
so, we can get the binary mask like this:
reproj_source = tf.concat([reproj_source_1, reproj_source_2], axis=3)
reproj_warped = tf.concat([reproj_warped_1, reproj_warped_2], axis=3)
min pe(I(t), I(t')) = tf.min(reproj_source, axis=3)
min pe(I(t), I(t'->t)) = tf.min(reproj_warped, axis=3)
# now we do concat and compare
binary_mask = tf.to_float32(tf.argmin(tf.concat([min pe(I(t), I(t')), min pe(I(t), I(t'->t))], axis=3), axis=3))
this should be essentially equal to what author did in this pytorch code.
to be simple, for a1, a2, b1, b2
method1: binary_mask = (min(a1, a2)) < (min(b1, b2))
method2: binary_mask = min(a1, a2, b1, b2) > 1
from monodepth2.
Related Issues (20)
- onnx
- What are the units in which the results are predicted HOT 2
- A problem when I train my repo code HOT 1
- Can't run the initial training
- Network inference time problem
- The requested array has an inhomogeneous shape after 1 dimensions. HOT 5
- Write split file HOT 1
- obtained some very strange depth maps HOT 4
- The difference in the intrinsic matrix affects the results
- Question about image resolution
- question for the Data Preparation
- Why is smooth_loss divided by 2**scale?
- Questions about the meaning of grid in the F.grid_sample function
- the eval file about 'gt_depths.npz'
- Run on Google Colab,
- Run on Google Colab, but out of System RAM
- The new constraint about pose is not useful?
- RuntimeError: CuDNN error: CUDNN_STATUS_SUCCESS HOT 2
- How to obtain reconstructed image and loss for a single demo image
- How to setup already trained computer vision model Ultralytics YOLOv8 with monodepth2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ššš
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ā¤ļø Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from monodepth2.