Giter Site home page Giter Site logo

Comments (7)

JUGGHM avatar JUGGHM commented on June 3, 2024
          在训练中,大部分数据也并不1000。比如taskonomy,大部分在500-700左右,测试的NYU等数据集也并不在500左右。所以并不是平均值在1000。关于这部分的ablation,还可以继续深入探索一下。

Originally posted by @YvanYin in #3 (comment) Hi, thanks for your great work! I think I met a duplicate of this issue when testing on different dataset. When inferring on outdoor images with 'intrinsic': [1966.9, 1969.5, 948.7, 498.4], the predicted depth is unexpectedly poor. I tried different crop_size, but it didn't help much, RMSE is about 10m on sparse lidar measurements.

image

for SHIFT dataset where 'intrinsic': [640, 640, 640, 400], the result is much more reasonable, RMSE is about 7m. image

This performance difference seems to be related to the training data, could you share your ideas on this? Thank you!

Thanks for your cases, we will check the first sample later, the depth map looks terrible.

from metric3d.

mwdotzom avatar mwdotzom commented on June 3, 2024

Thanks for your cases, we will check the first sample later, the depth map looks terrible.

Appreciated! Please use the original image here.
Size: 1920*1080
Intrinsic: [1966.9, 1969.5, 948.7, 498.4]

from metric3d.

JUGGHM avatar JUGGHM commented on June 3, 2024

We tested this case with several models. Our ConvNeXt models cannot predict reasonable depth maps, while an ongoing vit models can output the basic elements (road, vehicles, sound insulation wall). Since there are still problems in this model, it will not be released very soon.

from metric3d.

mwdotzom avatar mwdotzom commented on June 3, 2024

Thanks for looking into this! Metric3D has a convincing theory and shows good generalizing ability in practice, as a novice I can only make blind guesses towards this extreme scenario(focal length = 1967) :

  1. Is it due to the large difference compared with training data distribution (focal length mainly below 1000), such that the ConvNeXt model doesn't perform well in the CNN regression fashion?
  2. Or it has sth to do with the drastic hypothetical move in canonical camera space? I'm trying to tell which of the 2 patterns this website shows applies to Metric3D:

https://exposuretherapy.ca/photography-guide/perspective-and-camera-position/

In the first set of images, 4 cameras shoot at the same spot, and then crop/enlarge to get identical images.

image

However, in the second set of images, 4 cameras positions differently to directly shoot the target object to be of same size. The background gets wider as a result of pin-hole imaging of shorter focal length.

image

Are we "cropping / dragging" the objects or "moving closer/away" ourselves? If the latter, does this cone of vision contribute to the distorted prediction?

Forgive me for any silly mistakes, and please do correct me, clearly I could use some help on optics... Cheers!

from metric3d.

JUGGHM avatar JUGGHM commented on June 3, 2024

Thanks for looking into this! Metric3D has a convincing theory and shows good generalizing ability in practice, as a novice I can only make blind guesses towards this extreme scenario(focal length = 1967) :

  1. Is it due to the large difference compared with training data distribution (focal length mainly below 1000), such that the ConvNeXt model doesn't perform well in the CNN regression fashion?
  2. Or it has sth to do with the drastic hypothetical move in canonical camera space? I'm trying to tell which of the 2 patterns this website shows applies to Metric3D:

https://exposuretherapy.ca/photography-guide/perspective-and-camera-position/

In the first set of images, 4 cameras shoot at the same spot, and then crop/enlarge to get identical images.

image

However, in the second set of images, 4 cameras positions differently to directly shoot the target object to be of same size. The background gets wider as a result of pin-hole imaging of shorter focal length.

image

Are we "cropping / dragging" the objects or "moving closer/away" ourselves? If the latter, does this cone of vision contribute to the distorted prediction?

Forgive me for any silly mistakes, and please do correct me, clearly I could use some help on optics... Cheers!

  1. Personally I do not think focal length will affect the shapes of objects in prediction. However it might affect the scale learning.

  2. For your case, I think following could explain well:
    b654cff760147a523c0e26db35b042c

  • CROP will not affect the scales
  • RESIZE towards LARGER / SMALLER sizes means the focal length becomes larger/smaller
    For one specific object, it can be regarded as moving closer/away from ourselves. But in the real world, while the cameras are posed differently, different objects (with different depth) will be resized differently, as we have derived in the figure above.

from metric3d.

kwea123 avatar kwea123 commented on June 3, 2024

I'm also very suspicious of training using image crops, like if your crop looks like this, how do you know how far it is? It looks the same at multiple different distances, as you don't know the surroundings

截圖 2023-10-29 下午10 20 58

from metric3d.

YvanYin avatar YvanYin commented on June 3, 2024

I'm also very suspicious of training using image crops, like if your crop looks like this, how do you know how far it is? It looks the same at multiple different distances, as you don't know the surroundings

截圖 2023-10-29 下午10 20 58

The showed case has a very small field of view. If you enlarge the training crops size, this problem can be allievated.

from metric3d.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.