moral_rl's Introduction

Multi-Objective Reinforced Active Learning

Dependencies

wandb
tqdm
pytorch >= 1.7.0
numpy >= 1.20.0
scipy >= 1.1.0
pycolab == 1.2

Weights and Biases

Our code depends on for visualizing and logging results during training. As a result, we call wandb.init(), which will prompt to add an API key for linking the training runs with your personal wandb account. This can be done by pasting the WANDB_API_KEY into the respective box when running the code for the first time.

Environments

Our gridworlds (Emergency: randomized_v2.py, Delivery: randomized_v3.py) build on the game engine with a custom wrapper to provide similar functionality as the gym . This engine comes with a user interface and any environment can be played in the console using python environment.py with arrow keys and w, a, s, d as controls.

Training

There are four training scripts for

manually training a PPO agent on custom rewards (ppo_train.py),
training AIRL on a single expert dataset (airl_train.py),
active MORL with custom/automatic preferences (moral_train.py) and
training DRLHP with custom/automatic preferences (drlhp_train.py).

When using automatic preferences, a desired ratio can be passed as an argument. For example,

python moral_train.py --ratio a b c

will run MORAL using a (real-valued) ratio of a:b:c among the three explicit objectives in Delivery.

Hyperparameters

Hyperparameters are passed as arguments to wandb.init() and can be changed by modifying the respective training files.

moral_rl's People

Contributors

Stargazers

Watchers

moral_rl's Issues

Backbone RL algorithm

Hello! Thanks for sharing the codes.

I am just a beginner in RL. And I am wondering if the active learning method can be built upon other RL methods, such as SAC. If so, can you please provide any advice on how to modify your codes?

Why are log action probabilities compared with advantages?

I am working with this code for my bachelor's thesis, and I am confused as of why for computing classification accuracy the code compares advantages (which can be both negative and positive) with log action probabilities (which are always negative).

Relevant snippet (update_discriminator() in airl.py):

class_predictions = torch.cat([torch.log(action_probabilities).unsqueeze(1), advantages], dim=1)
# Compute Loss function
loss = criterion(class_predictions, labels)
# Compute Accuracies
label_predictions = torch.argmax(class_predictions, dim=1)  # takes a max between probabilities and advantages
predicted_fake = (label_predictions[labels == 0] == 0).float()
predicted_expert = (label_predictions[labels == 1] == 1).float()

It seems to me that such code will be unable to predict a 0 for label_predictions for good actions, as their advantage will be more than 0, but the log probability cant be more than 0, as log(1) = 0.

From my experiments, I also observe that 'Fake accuracy' is consistently lower than 'Real accuracy'.

Could you please point me to an explanation? Thank you in advance.

Recommend Projects

mlpeschl / moral_rl Goto Github PK

moral_rl's Introduction

Multi-Objective Reinforced Active Learning

Dependencies

Weights and Biases

Environments

Training

Hyperparameters

moral_rl's People

Contributors

Stargazers

Watchers

Forkers

moral_rl's Issues

Backbone RL algorithm

Why are log action probabilities compared with advantages?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent