Comments (7)
在训练中,大部分数据也并不1000。比如taskonomy,大部分在500-700左右,测试的NYU等数据集也并不在500左右。所以并不是平均值在1000。关于这部分的ablation,还可以继续深入探索一下。
Originally posted by @YvanYin in #3 (comment) Hi, thanks for your great work! I think I met a duplicate of this issue when testing on different dataset. When inferring on outdoor images with
'intrinsic': [1966.9, 1969.5, 948.7, 498.4]
, the predicted depth is unexpectedly poor. I tried differentcrop_size
, but it didn't help much, RMSE is about 10m on sparse lidar measurements.for SHIFT dataset where
'intrinsic': [640, 640, 640, 400]
, the result is much more reasonable, RMSE is about 7m.This performance difference seems to be related to the training data, could you share your ideas on this? Thank you!
Thanks for your cases, we will check the first sample later, the depth map looks terrible.
from metric3d.
Thanks for your cases, we will check the first sample later, the depth map looks terrible.
Appreciated! Please use the original image here.
Size: 1920*1080
Intrinsic: [1966.9, 1969.5, 948.7, 498.4]
from metric3d.
We tested this case with several models. Our ConvNeXt models cannot predict reasonable depth maps, while an ongoing vit models can output the basic elements (road, vehicles, sound insulation wall). Since there are still problems in this model, it will not be released very soon.
from metric3d.
Thanks for looking into this! Metric3D has a convincing theory and shows good generalizing ability in practice, as a novice I can only make blind guesses towards this extreme scenario(focal length = 1967) :
- Is it due to the large difference compared with training data distribution (focal length mainly below 1000), such that the ConvNeXt model doesn't perform well in the CNN regression fashion?
- Or it has sth to do with the drastic hypothetical move in canonical camera space? I'm trying to tell which of the 2 patterns this website shows applies to Metric3D:
https://exposuretherapy.ca/photography-guide/perspective-and-camera-position/
In the first set of images, 4 cameras shoot at the same spot, and then crop/enlarge to get identical images.
However, in the second set of images, 4 cameras positions differently to directly shoot the target object to be of same size. The background gets wider as a result of pin-hole imaging of shorter focal length.
Are we "cropping / dragging" the objects or "moving closer/away" ourselves? If the latter, does this cone of vision contribute to the distorted prediction?
Forgive me for any silly mistakes, and please do correct me, clearly I could use some help on optics... Cheers!
from metric3d.
Thanks for looking into this! Metric3D has a convincing theory and shows good generalizing ability in practice, as a novice I can only make blind guesses towards this extreme scenario(focal length = 1967) :
- Is it due to the large difference compared with training data distribution (focal length mainly below 1000), such that the ConvNeXt model doesn't perform well in the CNN regression fashion?
- Or it has sth to do with the drastic hypothetical move in canonical camera space? I'm trying to tell which of the 2 patterns this website shows applies to Metric3D:
https://exposuretherapy.ca/photography-guide/perspective-and-camera-position/
In the first set of images, 4 cameras shoot at the same spot, and then crop/enlarge to get identical images.
However, in the second set of images, 4 cameras positions differently to directly shoot the target object to be of same size. The background gets wider as a result of pin-hole imaging of shorter focal length.
Are we "cropping / dragging" the objects or "moving closer/away" ourselves? If the latter, does this cone of vision contribute to the distorted prediction?
Forgive me for any silly mistakes, and please do correct me, clearly I could use some help on optics... Cheers!
-
Personally I do not think focal length will affect the shapes of objects in prediction. However it might affect the scale learning.
- CROP will not affect the scales
- RESIZE towards LARGER / SMALLER sizes means the focal length becomes larger/smaller
For one specific object, it can be regarded as moving closer/away from ourselves. But in the real world, while the cameras are posed differently, different objects (with different depth) will be resized differently, as we have derived in the figure above.
from metric3d.
I'm also very suspicious of training using image crops, like if your crop looks like this, how do you know how far it is? It looks the same at multiple different distances, as you don't know the surroundings
from metric3d.
I'm also very suspicious of training using image crops, like if your crop looks like this, how do you know how far it is? It looks the same at multiple different distances, as you don't know the surroundings
The showed case has a very small field of view. If you enlarge the training crops size, this problem can be allievated.
from metric3d.
Related Issues (20)
- Pixel represented focal length or real world scale focal length(mm) HOT 2
- Some problems in Training
- Supporting old GPUs? HOT 3
- metric_scale in nyu.py HOT 1
- Speed Up Inference HOT 2
- NYU dataset and json HOT 1
- Inference Speed data
- normals not normal HOT 2
- Unable to adjust scale of depth correctly in the wild-mode HOT 1
- How to convert the DINO2reg-ViT model to an ONNX model HOT 1
- torch.hub.load error HOT 4
- Failed to find function: mono.model.backbones.convnext_large HOT 1
- Fine tune on custom dataset HOT 4
- Sparse GT depth from LiDAR for supervision? HOT 1
- Question regarding losses HOT 1
- Depth scale vs Metric scale HOT 5
- What does the pkl file contain in training with Matterport3D?
- generate only a depth matrix without generating a 3D point cloud HOT 2
- Is there any reference code to generate kitti dataset annotation?
- Camera parameters of taskonomy HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from metric3d.