compvis / depth-fm Goto Github PK
View Code? Open in Web Editor NEWDepthFM: Fast Monocular Depth Estimation with Flow Matching
License: MIT License
DepthFM: Fast Monocular Depth Estimation with Flow Matching
License: MIT License
Hi, thanks for great work. I found the provided checkpoint has a fixed tilmestep 400. which is not explained in the paper. It is really appreciated to help me make it more clear for the following questions:
Thanks very much.
Hi thanks for the inspiring work.
In fig 6, for marigold NFE=1, the result is pure noise. That seems counter-intuitive. At NFE=1, we should just get the conditional mean of the prediction i.e. x0 hat. It may be blurry, but it's hard to see why it should be pure noise.
Hi, authors! Thanks for open-source the inference code. I am interested in reimplementing the training process. However, the results look weird. Here is the code snippet I use, it would be appreciated if you could help me check the problem in the code.
x_1 = depth_latents
t = torch.rand((rgb_latents.size(0),))
num_train_timesteps=1000
x_0 = q_sample(rgb_latents, t=t * num_train_timesteps)
x_t = (1 - t) * x_0 + t * x_1
targets = x_1 - x_0
pred = UNet(xt, t)
loss = F.mse_loss(pred, targets)
Can you please specify in general terms what is number of steps and number of ensembles? I do not have a good understanding of diffusion and flow matching models. Also, what would you suggest as a good combination. I see the defaults set to 2 & 4. But as I increased the number of steps, I am seeing more details in the depth map.
I tried running depth_fm on an image twice with the same settings but I am getting visual difference between the two runs -- is that due to the generative nature of depth_fm?
Where can I find the surface normal loss mentioned in the paper
Thank you for the great work, but wo can not find how to achieve depth inpainting. Can you offer some information?
HI any chance you could change the checkpoint to be safetensor format, and maybe change the code not to assume the device is CUDA.
There are a lot of people that want to run stuff like this on MPS for example, and pickle files can contain executable code so should not be offered as a file format for security reasons.
I think your work is great, but do you plan to open source the training code? would like to try training my own model based on your work, and I would appreciate it if you could open source it.
Traceback (most recent call last):
File "/content/depth-fm/inference.py", line 134, in
main(args)
File "/content/depth-fm/inference.py", line 89, in main
depth = model.predict_depth(im, num_steps=args.num_steps, ensemble_size=args.ensemble_size)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/content/depth-fm/depthfm/dfm.py", line 97, in predict_depth
return self.forward(ims, num_steps, ensemble_size)
File "/content/depth-fm/depthfm/dfm.py", line 81, in forward
depth_z = self.generate(x_source, num_steps=num_steps, context=context, context_ca=conditioning)
File "/content/depth-fm/depthfm/dfm.py", line 51, in generate
ode_results = odeint(ode_fn, z, t, **ode_kwargs)
File "/usr/local/lib/python3.10/dist-packages/torchdiffeq/_impl/odeint.py", line 77, in odeint
solution = solver.integrate(t)
File "/usr/local/lib/python3.10/dist-packages/torchdiffeq/_impl/solvers.py", line 105, in integrate
dy, f0 = self._step_func(self.func, t0, dt, t1, y0)
File "/usr/local/lib/python3.10/dist-packages/torchdiffeq/_impl/fixed_grid.py", line 10, in _step_func
f0 = func(t0, y0, perturb=Perturb.NEXT if self.perturb else Perturb.NONE)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torchdiffeq/_impl/misc.py", line 189, in forward
return self.base_func(t, y)
File "/content/depth-fm/depthfm/dfm.py", line 34, in ode_fn
return self.model(x=x, t=t, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/content/depth-fm/depthfm/unet/openaimodel.py", line 841, in forward
h = th.cat([h, hs.pop()], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 20 but got size 19 for tensor number 1 in the list.
Traceback (most recent call last):
File "/home/lhs/project/nerf...wu/depth-fm-main/inference.py", line 113, in
main(args)
File "/home/lhs/project/nerf...wu/depth-fm-main/inference.py", line 64, in main
model = DepthFM(args.ckpt)
^^^^^^^^^^^^^^^^^^
File "/home/lhs/project/nerf...wu/depth-fm-main/depthfm/dfm.py", line 21, in init
self.vae = AutoencoderKL.from_pretrained(vae_id, subfolder="vae")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lhs/.conda/envs/depthfm/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/lhs/.conda/envs/depthfm/lib/python3.11/site-packages/diffusers/models/modeling_utils.py", line 569, in from_pretrained
config, unused_kwargs, commit_hash = cls.load_config(
^^^^^^^^^^^^^^^^
File "/home/lhs/.conda/envs/depthfm/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/lhs/.conda/envs/depthfm/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 402, in load_config
raise EnvironmentError(
OSError: runwayml/stable-diffusion-v1-5 does not appear to have a file named config.json.
Hi! Is the pre-trained encoder for RGB images and depth images the same one?
nice work!would you release the train code? thanks.
Hi!
Congrats for the great work.
I have a question regarding the flow matching formulation in the paper.
In many parts of the paper you refer to X_0 as the clean image and X_1 as its depth map, aiming to learn the flow between the two distributions. But the training pipeline figure seem to depict that you predict the flow for a noised version of the real image, rather than of the interpolation of the image latent and the depth latent, as stated in equation (4) (and as aligns with the flow matching formulation, to my understanding).
Additionally, in both eq (2) and (4), you also have z in the LHS of the equation, but not on the RHS, so I wasn't sure what was z there.
Could you please clarify your training paradigm? Essentially my question is -- do you define x_t as an interpolation of the depth and the image, or of the image and random noise? If it's the latter- Could you please explain how it aligns with the flow matching formulation, and with eq (4) in the paper?
Thanks in advance,
Michal
Results converted to 3D look very inaccurate, unfortunately.
Attached is one of our standard test images, converted to 3D using 3D combine, default settings of 0.20, yes, yes.
many things look to be on the wrong plane. The rocks/crystals in the foreground are floating above the ground. The character appears to also be floating above the ground
It's rough on the eyes.
Other images; some outputs are poor and unusable.. other conversions suffer form the 'bulge effect' where the mid-range bulges forward. It looks like a hill where it should be flat.
Some depth maps look fantastic.. but they don't perform when actually put to use.
These were converted using 'binary' colormap output, equivalent to --no_color output that's been inverted to white front / black back...
Here is the original image if you don't believe the results. If you can get better results, I'd be glad to learn how to use the program better, but so far, I'm seeing many inaccuracies.
jpg files get overwritten by the output
The output should add a suffix or something, so it doesn't overwrite the original.
Luckily I was smart enough to use copies of images and not my one and only image.. otherwise it would be overwritten and lost
Is there no batch processing of folders? Only single images?
Hi! Can you release the command that both output the predicted depth and uncertaity? Thank you!
output non-colored maps so they're usable in programs that will convert to 3d using white front / black back
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.