Giter Site home page Giter Site logo

Comments (7)

jannerm avatar jannerm commented on August 23, 2024
  1. Section 3.2 of the DDPM paper has a great discussion of the different prediction targets and how they are related. In particular:

To summarize, we can train the reverse process mean function approximator $\mu_\theta$ to predict $\tilde{\mu}_t$, or by modifying its parameterization, we can train it to predict $\epsilon$. (There is also the possibility of predicting $x_0$, but we found this to lead to worse sample quality early in our experiments.)

We found that the choice between predicting $\epsilon$ (the added noise) and $x_0$ (the noiseless data) to be less important for trajectories than for images.

  1. (a) At initialization, $x^N$ is Gaussian noise of dimensionality [ horizon x (state_dim + action_dim) ]. You could possibly say that the points already have an order because of their position in $x^N$, but I'm not sure it's meaningful to say "$x_0^N$ precedes $x_1^N$", since they are both just noise anyway. (b) We don't enforce that collision constraint explicitly; the samples respect this by virtue of there being no data that walks through the walls. (c) Yep!

from diffuser.

xmlyqing00 avatar xmlyqing00 commented on August 23, 2024

Thanks for your detailed and quick response! Appreciate!!

  1. I missed this comment in their paper, thanks for pointing out.
  2. I see. For maze planning, I know horizon=128/256/384 (from your Appendix). Can we say state_dim=2 for 2D positions? I'm confused about the definition of action and I don't know which one is better or you used in your experiments, (a) direction to the next point, (b) indices to the next point, (c) positions to the next move, or neither of them LOL. I don't know.
  3. As for the reward part, please correct me if I am wrong:
    a. $O_t = 1$ means current status $(s_t, a_t)$ is valid (not in the wall).
    b. $p=(\tau | O_{1:T} = 1)$ means this sampled trajectory $\tau$ is valid at all moments. And this plan (trajectory) will be executed from the first action.
    c. In Maze 2D $\mathcal{J}_\phi$ is not a network but a deterministic function that returns the reward of $\mathcal{J}(\mu)$ and its gradient during training. I'm confused about how to compute the gradients $\nabla{s_t,a_t} r(s_t,a_t)$ where $(s_t, a_t)=\mu_t$ here. Does $r$ return either 0 or 1 here? I'm not sure what does no data mean?

Sorry that I put more questions here. These are the problems when I try to implement your work in Maze 2D. My main confusions are about action and reward. Really appreciate if you can take some time to answer my questions.

Cheers

from diffuser.

xmlyqing00 avatar xmlyqing00 commented on August 23, 2024

When I read the d4rl dataset about the Maze2d settings, I have some answers to the above questions but have some confusions now.

  1. It seems that the diffusion model learns the correct paths in the maze during training, which implicitly encodes the walls. In the evaluation time, it should be the same maze map. The diffusion model can estimate the correct path between two given points.
  2. The conditions of the maze setting is only the begin and end locations (maybe velocity). The trajectory is state+action, where state is $(x,y,v_x,v_y)$, and action is force $(x, y)$ by the D4RL settings. I'm curious how the diffusion model guarantees the observation + action equals to the next observation.

Thanks.

from diffuser.

yilundu avatar yilundu commented on August 23, 2024

Hi,

The diffusion model implicitly learns from data the bias that the next observation is equivalent to the previous observation and action.

The optimality condition is that both that the dynamics is valid as well as the reward is optimal. The reward function is neural network that is separately trained on rewards collected from the data. We use the gradient of that network to bias planning. Feel free to let us know if you have any other questions

from diffuser.

xmlyqing00 avatar xmlyqing00 commented on August 23, 2024

Thanks @yilundu . I see. The reward function using neural network makes sense. I think the input to reward function is a trajectory horizon * (state_dim + act_dim), and the target is the reward. The network architecture could be a regression network like MLP.

  1. How do we prepare the samples for reward network training? Directly from the D4RL maze dataset, or adding some negative samples like cross the walls.
  2. It's the same to my last questions, do the diffusion network and rewards network associate/bias to one maze map after training? If I have a new maze map, do they need to retrain to fit the new map?

Thanks for your answers. Appreciate.

from diffuser.

yilundu avatar yilundu commented on August 23, 2024
  1. The samples for the reward network training is the same D4RL maze dataset
  2. Yes, you would need to retrain the model on a new map (although if a map changes only slightly, you can add test time constraints to the map).

from diffuser.

xmlyqing00 avatar xmlyqing00 commented on August 23, 2024

I see. Thanks for your answers! Very insightful!

from diffuser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.