Giter Site home page Giter Site logo

depth-fm's People

Contributors

dongzhuoyao avatar fannovel16 avatar joh-fischer avatar mgui7 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

depth-fm's Issues

timesteps during the training

Hi, thanks for great work. I found the provided checkpoint has a fixed tilmestep 400. which is not explained in the paper. It is really appreciated to help me make it more clear for the following questions:

  1. do you use a fixed tilmestep for the input of unet during the training
  2. during the training, do you directly add gaussian noise to the rgb latent, without using the q_sample function

Thanks very much.

why is NFE=1 for marigold pure noise

Hi thanks for the inspiring work.
In fig 6, for marigold NFE=1, the result is pure noise. That seems counter-intuitive. At NFE=1, we should just get the conditional mean of the prediction i.e. x0 hat. It may be blurry, but it's hard to see why it should be pure noise.

training problem

Hi, authors! Thanks for open-source the inference code. I am interested in reimplementing the training process. However, the results look weird. Here is the code snippet I use, it would be appreciated if you could help me check the problem in the code.

  x_1 = depth_latents
  t = torch.rand((rgb_latents.size(0),))
  num_train_timesteps=1000
  x_0 = q_sample(rgb_latents, t=t * num_train_timesteps)
  x_t = (1 - t) * x_0 + t * x_1
  targets = x_1 - x_0
  pred = UNet(xt, t)
  loss = F.mse_loss(pred, targets)

Explanation of # of Steps and # of Ensembles

Can you please specify in general terms what is number of steps and number of ensembles? I do not have a good understanding of diffusion and flow matching models. Also, what would you suggest as a good combination. I see the defaults set to 2 & 4. But as I increased the number of steps, I am seeing more details in the depth map.

Mismatch B/w Different Runs

I tried running depth_fm on an image twice with the same settings but I am getting visual difference between the two runs -- is that due to the generative nature of depth_fm?

Assumes CUDA, doesn't use safetensors

HI any chance you could change the checkpoint to be safetensor format, and maybe change the code not to assume the device is CUDA.

There are a lot of people that want to run stuff like this on MPS for example, and pickle files can contain executable code so should not be offered as a file format for security reasons.

Are you ready to open source train files?

I think your work is great, but do you plan to open source the training code? would like to try training my own model based on your work, and I would appreciate it if you could open source it.

size of tensor mismatch

Traceback (most recent call last):
File "/content/depth-fm/inference.py", line 134, in
main(args)
File "/content/depth-fm/inference.py", line 89, in main
depth = model.predict_depth(im, num_steps=args.num_steps, ensemble_size=args.ensemble_size)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/content/depth-fm/depthfm/dfm.py", line 97, in predict_depth
return self.forward(ims, num_steps, ensemble_size)
File "/content/depth-fm/depthfm/dfm.py", line 81, in forward
depth_z = self.generate(x_source, num_steps=num_steps, context=context, context_ca=conditioning)
File "/content/depth-fm/depthfm/dfm.py", line 51, in generate
ode_results = odeint(ode_fn, z, t, **ode_kwargs)
File "/usr/local/lib/python3.10/dist-packages/torchdiffeq/_impl/odeint.py", line 77, in odeint
solution = solver.integrate(t)
File "/usr/local/lib/python3.10/dist-packages/torchdiffeq/_impl/solvers.py", line 105, in integrate
dy, f0 = self._step_func(self.func, t0, dt, t1, y0)
File "/usr/local/lib/python3.10/dist-packages/torchdiffeq/_impl/fixed_grid.py", line 10, in _step_func
f0 = func(t0, y0, perturb=Perturb.NEXT if self.perturb else Perturb.NONE)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torchdiffeq/_impl/misc.py", line 189, in forward
return self.base_func(t, y)
File "/content/depth-fm/depthfm/dfm.py", line 34, in ode_fn
return self.model(x=x, t=t, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/content/depth-fm/depthfm/unet/openaimodel.py", line 841, in forward
h = th.cat([h, hs.pop()], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 20 but got size 19 for tensor number 1 in the list.

OSError: runwayml/stable-diffusion-v1-5 does not appear to have a file named config.json.

Traceback (most recent call last):
File "/home/lhs/project/nerf...wu/depth-fm-main/inference.py", line 113, in
main(args)
File "/home/lhs/project/nerf...wu/depth-fm-main/inference.py", line 64, in main
model = DepthFM(args.ckpt)
^^^^^^^^^^^^^^^^^^
File "/home/lhs/project/nerf...wu/depth-fm-main/depthfm/dfm.py", line 21, in init
self.vae = AutoencoderKL.from_pretrained(vae_id, subfolder="vae")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lhs/.conda/envs/depthfm/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/lhs/.conda/envs/depthfm/lib/python3.11/site-packages/diffusers/models/modeling_utils.py", line 569, in from_pretrained
config, unused_kwargs, commit_hash = cls.load_config(
^^^^^^^^^^^^^^^^
File "/home/lhs/.conda/envs/depthfm/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/lhs/.conda/envs/depthfm/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 402, in load_config
raise EnvironmentError(
OSError: runwayml/stable-diffusion-v1-5 does not appear to have a file named config.json.

about the encoder

Hi! Is the pre-trained encoder for RGB images and depth images the same one?

Question about the flow-matching formulation in the paper

Hi!

Congrats for the great work.

I have a question regarding the flow matching formulation in the paper.
In many parts of the paper you refer to X_0 as the clean image and X_1 as its depth map, aiming to learn the flow between the two distributions. But the training pipeline figure seem to depict that you predict the flow for a noised version of the real image, rather than of the interpolation of the image latent and the depth latent, as stated in equation (4) (and as aligns with the flow matching formulation, to my understanding).
Additionally, in both eq (2) and (4), you also have z in the LHS of the equation, but not on the RHS, so I wasn't sure what was z there.

Could you please clarify your training paradigm? Essentially my question is -- do you define x_t as an interpolation of the depth and the image, or of the image and random noise? If it's the latter- Could you please explain how it aligns with the flow matching formulation, and with eq (4) in the paper?

Thanks in advance,

Michal

Results converted to 3D look very inaccurate

Results converted to 3D look very inaccurate, unfortunately.

Attached is one of our standard test images, converted to 3D using 3D combine, default settings of 0.20, yes, yes.

many things look to be on the wrong plane. The rocks/crystals in the foreground are floating above the ground. The character appears to also be floating above the ground

It's rough on the eyes.

Other images; some outputs are poor and unusable.. other conversions suffer form the 'bulge effect' where the mid-range bulges forward. It looks like a hill where it should be flat.

Some depth maps look fantastic.. but they don't perform when actually put to use.

pandora anaglyph
pandora crosseye
pandora SBS

These were converted using 'binary' colormap output, equivalent to --no_color output that's been inverted to white front / black back...

Here is the original image if you don't believe the results. If you can get better results, I'd be glad to learn how to use the program better, but so far, I'm seeing many inaccuracies.
DAK_Pandora_LowAngleNight_DC_v04MerchHD

MAJOR WARNING - jpg files get overwritten by the output

jpg files get overwritten by the output

The output should add a suffix or something, so it doesn't overwrite the original.

Luckily I was smart enough to use copies of images and not my one and only image.. otherwise it would be overwritten and lost

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.