compvis / depth-fm Goto Github PK

View Code? Open in Web Editor NEW

345.0 345.0 25.0 4.27 MB

DepthFM: Fast Monocular Depth Estimation with Flow Matching

License: MIT License

Python 17.20% Jupyter Notebook 82.80%

depth-fm's People

Contributors

Stargazers

Watchers

Forkers

kijai fannovel16 sdbds layer-norm strategist922 hhy5277 huainf touchclasslandscaping xiaozhiob wemersiveadmin luckyadugithub zdyshine gsongx seanpm2001 cv-depth bruinxiong tiandaji andupotorac

depth-fm's Issues

timesteps during the training

Hi, thanks for great work. I found the provided checkpoint has a fixed tilmestep 400. which is not explained in the paper. It is really appreciated to help me make it more clear for the following questions:

do you use a fixed tilmestep for the input of unet during the training
during the training, do you directly add gaussian noise to the rgb latent, without using the q_sample function

Thanks very much.

Vector displacement? [feature request]

Hi I want to know if it's possible to add vector displacement ? it's allow to get better result than depth map on 3d objects
depth vs vector :

why is NFE=1 for marigold pure noise

Hi thanks for the inspiring work.
In fig 6, for marigold NFE=1, the result is pure noise. That seems counter-intuitive. At NFE=1, we should just get the conditional mean of the prediction i.e. x0 hat. It may be blurry, but it's hard to see why it should be pure noise.

training problem

Hi, authors! Thanks for open-source the inference code. I am interested in reimplementing the training process. However, the results look weird. Here is the code snippet I use, it would be appreciated if you could help me check the problem in the code.

  x_1 = depth_latents
  t = torch.rand((rgb_latents.size(0),))
  num_train_timesteps=1000
  x_0 = q_sample(rgb_latents, t=t * num_train_timesteps)
  x_t = (1 - t) * x_0 + t * x_1
  targets = x_1 - x_0
  pred = UNet(xt, t)
  loss = F.mse_loss(pred, targets)

Explanation of # of Steps and # of Ensembles

Can you please specify in general terms what is number of steps and number of ensembles? I do not have a good understanding of diffusion and flow matching models. Also, what would you suggest as a good combination. I see the defaults set to 2 & 4. But as I increased the number of steps, I am seeing more details in the depth map.

Mismatch B/w Different Runs

I tried running depth_fm on an image twice with the same settings but I am getting visual difference between the two runs -- is that due to the generative nature of depth_fm?

Surface Normal Loss

Where can I find the surface normal loss mentioned in the paper

what's the best processing res that produces the best results?

how to achieve depth inpainting by your model

Thank you for the great work, but wo can not find how to achieve depth inpainting. Can you offer some information？

Assumes CUDA, doesn't use safetensors

HI any chance you could change the checkpoint to be safetensor format, and maybe change the code not to assume the device is CUDA.

There are a lot of people that want to run stuff like this on MPS for example, and pickle files can contain executable code so should not be offered as a file format for security reasons.

Are you ready to open source train files?

I think your work is great, but do you plan to open source the training code? would like to try training my own model based on your work, and I would appreciate it if you could open source it.

size of tensor mismatch

Traceback (most recent call last):
File "/content/depth-fm/inference.py", line 134, in
main(args)
File "/content/depth-fm/inference.py", line 89, in main
depth = model.predict_depth(im, num_steps=args.num_steps, ensemble_size=args.ensemble_size)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/content/depth-fm/depthfm/dfm.py", line 97, in predict_depth
return self.forward(ims, num_steps, ensemble_size)
File "/content/depth-fm/depthfm/dfm.py", line 81, in forward
depth_z = self.generate(x_source, num_steps=num_steps, context=context, context_ca=conditioning)
File "/content/depth-fm/depthfm/dfm.py", line 51, in generate
ode_results = odeint(ode_fn, z, t, **ode_kwargs)
File "/usr/local/lib/python3.10/dist-packages/torchdiffeq/_impl/odeint.py", line 77, in odeint
solution = solver.integrate(t)
File "/usr/local/lib/python3.10/dist-packages/torchdiffeq/_impl/solvers.py", line 105, in integrate
dy, f0 = self._step_func(self.func, t0, dt, t1, y0)
File "/usr/local/lib/python3.10/dist-packages/torchdiffeq/_impl/fixed_grid.py", line 10, in _step_func
f0 = func(t0, y0, perturb=Perturb.NEXT if self.perturb else Perturb.NONE)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torchdiffeq/_impl/misc.py", line 189, in forward
return self.base_func(t, y)
File "/content/depth-fm/depthfm/dfm.py", line 34, in ode_fn
return self.model(x=x, t=t, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/content/depth-fm/depthfm/unet/openaimodel.py", line 841, in forward
h = th.cat([h, hs.pop()], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 20 but got size 19 for tensor number 1 in the list.

OSError: runwayml/stable-diffusion-v1-5 does not appear to have a file named config.json.

Traceback (most recent call last):
File "/home/lhs/project/nerf...wu/depth-fm-main/inference.py", line 113, in
main(args)
File "/home/lhs/project/nerf...wu/depth-fm-main/inference.py", line 64, in main
model = DepthFM(args.ckpt)
^^^^^^^^^^^^^^^^^^
File "/home/lhs/project/nerf...wu/depth-fm-main/depthfm/dfm.py", line 21, in init
self.vae = AutoencoderKL.from_pretrained(vae_id, subfolder="vae")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lhs/.conda/envs/depthfm/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/lhs/.conda/envs/depthfm/lib/python3.11/site-packages/diffusers/models/modeling_utils.py", line 569, in from_pretrained
config, unused_kwargs, commit_hash = cls.load_config(
^^^^^^^^^^^^^^^^
File "/home/lhs/.conda/envs/depthfm/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/lhs/.conda/envs/depthfm/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 402, in load_config
raise EnvironmentError(
OSError: runwayml/stable-diffusion-v1-5 does not appear to have a file named config.json.

Cant reproduce image on project website

Use default config with num_steps =2, ensemble size =4

about the encoder

Hi! Is the pre-trained encoder for RGB images and depth images the same one?

nice work，the train code

nice work！would you release the train code? thanks.

Question about the flow-matching formulation in the paper

Hi!

Congrats for the great work.

I have a question regarding the flow matching formulation in the paper.
In many parts of the paper you refer to X_0 as the clean image and X_1 as its depth map, aiming to learn the flow between the two distributions. But the training pipeline figure seem to depict that you predict the flow for a noised version of the real image, rather than of the interpolation of the image latent and the depth latent, as stated in equation (4) (and as aligns with the flow matching formulation, to my understanding).
Additionally, in both eq (2) and (4), you also have z in the LHS of the equation, but not on the RHS, so I wasn't sure what was z there.

Could you please clarify your training paradigm? Essentially my question is -- do you define x_t as an interpolation of the depth and the image, or of the image and random noise? If it's the latter- Could you please explain how it aligns with the flow matching formulation, and with eq (4) in the paper?

Thanks in advance,

Michal

Results converted to 3D look very inaccurate

Results converted to 3D look very inaccurate, unfortunately.

Attached is one of our standard test images, converted to 3D using 3D combine, default settings of 0.20, yes, yes.

many things look to be on the wrong plane. The rocks/crystals in the foreground are floating above the ground. The character appears to also be floating above the ground

It's rough on the eyes.

Other images; some outputs are poor and unusable.. other conversions suffer form the 'bulge effect' where the mid-range bulges forward. It looks like a hill where it should be flat.

Some depth maps look fantastic.. but they don't perform when actually put to use.

These were converted using 'binary' colormap output, equivalent to --no_color output that's been inverted to white front / black back...

Here is the original image if you don't believe the results. If you can get better results, I'd be glad to learn how to use the program better, but so far, I'm seeing many inaccuracies.

MAJOR WARNING - jpg files get overwritten by the output

jpg files get overwritten by the output

The output should add a suffix or something, so it doesn't overwrite the original.

Luckily I was smart enough to use copies of images and not my one and only image.. otherwise it would be overwritten and lost