from safepo.common.lagrange import Lagrange nu = 1.0 nu_lr = 0.1

In our implementation process, we've referred to <a href="https://github.com/thu-ml/ti

Doubt about the updating method of Lagrange Multipliers about safe-policy-optimization HOT 6 CLOSED

pku-alignment commented on June 9, 2024

Doubt about the updating method of Lagrange Multipliers

from safe-policy-optimization.

Comments (6)

Gaiejj commented on June 9, 2024

Why do FOCOPS and CUP also utilize the Adam optimizer? Given that both CUP and FOCOPS, as first-order optimization algorithms, also have a substantial dependence on hyperparameters, we believe that implementing the Adam optimizer as opposed to the original SGD optimizer could provide a smoother operation, thereby enhancing the algorithm's performance.

As for supporting the original implementation in the future we're considering introducing the original implementation, that is, the SGD optimizer, as an option in our code and will disclose the ablation test results to the community while updating our code accordingly.

from safe-policy-optimization.

lijie9527 commented on June 9, 2024

In this case, can I consider that FOCOPS and CUP have no difference in handling cost constraints compared to Lagrangian methods such as PPO-Lagrangian, and their main difference is the way of updating actors?

from safe-policy-optimization.

Gaiejj commented on June 9, 2024

Sure, in code implementation, these three algorithms bear striking similarities. Their difference, indeed, lies solely in the actor-update process.

from safe-policy-optimization.

lijie9527 commented on June 9, 2024

The last question is about TRPO class algorithms, such as TRPO, TRPO-Lagrangian, CPO, should they use multiple epochs of full batch or multiple epochs of mini batch to update the critic networks, I found that most of the TRPO class algorithms on the internet use multiple epochs of full batch to update the critic, while most of the PPO class algorithms utilize mini batch. In your implementation, you uniformly use multiple epochs of mini-batch to update the critic, is it because it is more effective and fair to try to preserve the comparison with the first-order methods of the PPO class?

Also, I found that using multiple epochs of full batch to update the critic of TRPO-like algorithms, the training time is much faster than multiple epochs of mini-batch because the number of updates is much lower, is it possible to adopt multiple epochs of full batch to update the critic in TRPO-based algorithms?

from safe-policy-optimization.

Gaiejj commented on June 9, 2024

In our implementation process, we've referred to Tianshou and Stable-Baselines and have employed multiple mini-batches for multiple rounds of critic updates. We've experimented with using full-batch updates for the critic in previous tests, but its performance didn't quite measure up to the mini-batch approach.

from safe-policy-optimization.

lijie9527 commented on June 9, 2024

I will further verify the effectiveness of updating the critic network for full batch, and thank you very much for your patient answer, which has solved my long-standing question.

from safe-policy-optimization.

Doubt about the updating method of Lagrange Multipliers about safe-policy-optimization HOT 6 CLOSED

Comments (6)

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent