Hi, Nice paper and code. They're very clear and easy to understand.

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

The samples for the reward network training is the same D4RL maze dataset <

Questions about self.predict_epsilon and Maze2D settings about diffuser HOT 7 CLOSED

jannerm commented on August 23, 2024

Questions about self.predict_epsilon and Maze2D settings

from diffuser.

Comments (7)

jannerm commented on August 23, 2024

Section 3.2 of the DDPM paper has a great discussion of the different prediction targets and how they are related. In particular:

To summarize, we can train the reverse process mean function approximator $\mu_\theta$ to predict $\tilde{\mu}_t$, or by modifying its parameterization, we can train it to predict $\epsilon$. (There is also the possibility of predicting $x_0$, but we found this to lead to worse sample quality early in our experiments.)

We found that the choice between predicting $\epsilon$ (the added noise) and $x_0$ (the noiseless data) to be less important for trajectories than for images.

(a) At initialization, $x^N$ is Gaussian noise of dimensionality [ horizon x (state_dim + action_dim) ]. You could possibly say that the points already have an order because of their position in $x^N$, but I'm not sure it's meaningful to say "$x_0^N$ precedes $x_1^N$", since they are both just noise anyway. (b) We don't enforce that collision constraint explicitly; the samples respect this by virtue of there being no data that walks through the walls. (c) Yep!

from diffuser.

xmlyqing00 commented on August 23, 2024

Thanks for your detailed and quick response! Appreciate!!

I missed this comment in their paper, thanks for pointing out.
I see. For maze planning, I know horizon=128/256/384 (from your Appendix). Can we say state_dim=2 for 2D positions? I'm confused about the definition of action and I don't know which one is better or you used in your experiments, (a) direction to the next point, (b) indices to the next point, (c) positions to the next move, or neither of them LOL. I don't know.
As for the reward part, please correct me if I am wrong:
a. $O_t = 1$ means current status $(s_t, a_t)$ is valid (not in the wall).
b. $p=(\tau | O_{1:T} = 1)$ means this sampled trajectory $\tau$ is valid at all moments. And this plan (trajectory) will be executed from the first action.
c. In Maze 2D $\mathcal{J}_\phi$ is not a network but a deterministic function that returns the reward of $\mathcal{J}(\mu)$ and its gradient during training. I'm confused about how to compute the gradients $\nabla{s_t,a_t} r(s_t,a_t)$ where $(s_t, a_t)=\mu_t$ here. Does $r$ return either 0 or 1 here? I'm not sure what does no data mean?

Sorry that I put more questions here. These are the problems when I try to implement your work in Maze 2D. My main confusions are about action and reward. Really appreciate if you can take some time to answer my questions.

Cheers

from diffuser.

xmlyqing00 commented on August 23, 2024

When I read the d4rl dataset about the Maze2d settings, I have some answers to the above questions but have some confusions now.

It seems that the diffusion model learns the correct paths in the maze during training, which implicitly encodes the walls. In the evaluation time, it should be the same maze map. The diffusion model can estimate the correct path between two given points.
The conditions of the maze setting is only the begin and end locations (maybe velocity). The trajectory is state+action, where state is $(x,y,v_x,v_y)$, and action is force $(x, y)$ by the D4RL settings. I'm curious how the diffusion model guarantees the observation + action equals to the next observation.

Thanks.

from diffuser.

yilundu commented on August 23, 2024

Hi,

The diffusion model implicitly learns from data the bias that the next observation is equivalent to the previous observation and action.

The optimality condition is that both that the dynamics is valid as well as the reward is optimal. The reward function is neural network that is separately trained on rewards collected from the data. We use the gradient of that network to bias planning. Feel free to let us know if you have any other questions

from diffuser.

xmlyqing00 commented on August 23, 2024

Thanks @yilundu . I see. The reward function using neural network makes sense. I think the input to reward function is a trajectory horizon * (state_dim + act_dim), and the target is the reward. The network architecture could be a regression network like MLP.

How do we prepare the samples for reward network training? Directly from the D4RL maze dataset, or adding some negative samples like cross the walls.
It's the same to my last questions, do the diffusion network and rewards network associate/bias to one maze map after training? If I have a new maze map, do they need to retrain to fit the new map?

Thanks for your answers. Appreciate.

from diffuser.

yilundu commented on August 23, 2024

The samples for the reward network training is the same D4RL maze dataset
Yes, you would need to retrain the model on a new map (although if a map changes only slightly, you can add test time constraints to the map).

from diffuser.

xmlyqing00 commented on August 23, 2024

I see. Thanks for your answers! Very insightful!

from diffuser.

Questions about self.predict_epsilon and Maze2D settings about diffuser HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent