Giter Site home page Giter Site logo

lm-human-preferences's Introduction

Status: Archive (code is provided as-is, no updates expected)

Status: All references to gs://lm-human-preferences/ were updated to https://openaipublic.blob.core.windows.net/lm-human-preferences, as we migrated from GCP to Azure. The code provided as is may no longer work. Pull requests welcome

lm-human-preferences

This repository contains code for the paper Fine-Tuning Language Models from Human Preferences. See also our blog post.

We provide code for:

  • Training reward models from human labels
  • Fine-tuning language models using those reward models

It does not contain code for generating labels. However, we have released human labels collected for our experiments, at gs://lm-human-preferences/labels. For those interested, the question and label schemas are simple and documented in label_types.py.

The code has only been tested using the smallest GPT-2 model (124M parameters).

Instructions

This code has only been tested using Python 3.7.3. Training has been tested on GCE machines with 8 V100s, running Ubuntu 16.04, but development also works on Mac OS X.

Installation

  • Install pipenv.

  • Install tensorflow: Install CUDA 10.0 and cuDNN 7.6.2, then pipenv install tensorflow-gpu==1.13.1. The code may technically run with tensorflow on CPU but will be very slow.

  • Install gsutil

  • Clone this repo. Then:

    pipenv install
    
  • (Recommended) Install horovod to speed up the code, or otherwise substitute some fast implementation in the mpi_allreduce_sum function of core.py. Make sure to use pipenv for the install, e.g. pipenv install horovod==0.18.1.

Running

The following examples assume we are aiming to train a model to continue text in a physically descriptive way. You can read launch.py to see how the descriptiveness experiments and others are defined.

Note that we provide pre-trained models, so you can skip directly to RL fine-tuning or even to sampling from a trained policy, if desired.

Training a reward model

To train a reward model, use a command such as

experiment=descriptiveness
reward_experiment_name=testdesc-$(date +%y%m%d%H%M)
pipenv run ./launch.py train_reward $experiment $reward_experiment_name

This will save outputs (and tensorboard event files) to the directory /tmp/save/train_reward/$reward_experiment_name. The directory can be changed via the --save_dir flag.

Finetuning a language model

Once you have trained a reward model, you can finetune against it.

First, set

trained_reward_model=/tmp/save/train_reward/$reward_experiment_name

or if using our pretrained model,

trained_reward_model=gs://lm-human-preferences/runs/descriptiveness/reward_model

Then,

experiment=descriptiveness
policy_experiment_name=testdesc-$(date +%y%m%d%H%M)
pipenv run ./launch.py train_policy $experiment $policy_experiment_name --rewards.trained_model $trained_reward_model --rewards.train_new_model 'off'

This will save outputs (and tensorboard event files) to the directory /tmp/save/train_policy/$policy_experiment_name. The directory can be changed via the --save_dir flag.

Both steps at once

You can run a single command to train a reward model and then finetune against it

experiment=descriptiveness
experiment_name=testdesc-$(date +%y%m%d%H%M)
pipenv run ./launch.py train_policy $experiment $experiment_name

In this case, outputs are in the directory /tmp/save/train_policy/$policy_experiment_name, and the reward model is saved to a subdirectory reward_model. The directory can be changed via the --save_dir flag.

Sampling from a trained policy

Specify the policy to load:

save_dir=/tmp/save/train_policy/$policy_experiment_name

or if using our pretrained model,

save_dir=gs://lm-human-preferences/runs/descriptiveness

Then run:

pipenv run ./sample.py sample --save_dir $save_dir --savescope policy

Note that this script can run on less than 8 GPUs. You can pass the flag --mpi 1, for exapmle, if you only have one GPU.

LICENSE

MIT

Citation

Please cite the paper with the following bibtex entry:

@article{ziegler2019finetuning,
  title={Fine-Tuning Language Models from Human Preferences},
  author={Ziegler, Daniel M. and Stiennon, Nisan and Wu, Jeffrey and Brown, Tom B. and Radford, Alec and Amodei, Dario and Christiano, Paul and Irving, Geoffrey},
  journal={arXiv preprint arXiv:1909.08593},
  url={https://arxiv.org/abs/1909.08593},
  year={2019}
}

lm-human-preferences's People

Contributors

karthik-rangarajan avatar leondz avatar wuthefwasthat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lm-human-preferences's Issues

The installation steps doesn't work for me

Had to install(mac) mpi, google-cloud storage, ftfy, etc... still doesn't work!!!!
Could please update the installation and usage doc? On usage: A simple 1 page notebook to start experiments would be of great help, if you really want others to try out. Thanks!

PPO training

Hi, thanks much for sharing. I have a quick question, when training policy using PPO, here,

outputs = self.policy.analyze_responses_op(rollouts['queries'], rollouts['responses'])
), aren't you using same policy as to generate rollout (
rollouts = self.policy.respond(queries, length=self.hparams.task.response_length)
) ?; except for training you are dividing logits by Temperature. Maybe I am missing sth but for PPO, to calculate loss we need to have pi(theta)/pi(old) as in (
ratio = tf.exp(logprob - old_logprob)
), but it seems the old and new policy are the same, I really appreciate your answer

Trouble with accessing bucket / Google credentials

I am running this on Google Colab and initially ran into the error following error in the line bucket = client.get_bucket(bucket_name) in gcs.py (bucket_name is gpt-2):

google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started

Starts at:

File "/content/gdrive/My Drive/Glasgow/MSc Project/code/2496762_cluster_files/lm-human-preferences-master/lm_human_preferences/utils/gcs.py", line 75, in get_blob
bucket = client.get_bucket(bucket_name)

After adding my own credentials I get another error at the same place:

google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/storage/v1/b/gpt-2?projection=noAcl: [email protected] does not have storage.buckets.get access to the Google Cloud Storage bucket.

Do I have to apply for a permission or are there publicly available credentials that should be used?

Thanks in advance,
Alexander

EDIT:

Could this be related to "role" when generating a credential key?

How to liberate the gpt2 from reference model?

Hi,

We know that KL is used in the loss as a constraint for the difference between the original gpt2 and the active gpt2 which produces responses for rewards feedbacks.
How can I can tune the parameters to mitigate this constraint? I mean I want the active gpt2 can deviate much from the original reference gpt2, as I find in my experiments that the rewards do not improve as expected, possibly due to this constraint.
I am new to PPO. Hoping for some suggestions.

Thanks.

Permission Denied on Google Cloud Storage

Hi,
I have set the gcloud auth application-default login and logged in with my email,
but I kept getting the permission denied error code.
截圖 2023-08-16 下午10 16 47
And I couldn't understand most of the posts when searching on Google. 😭

About the calculated returns for value loss

How can I understand returns = advantages + values in

returns = advantages + values

In the original A3C paper, the loss for policy and the value function is calculated as follows (As I know, PPO shares the same mechanism):
image
(as I think) which means that the loss of value function should just minimize the advantage function in this code.

Would you please help me understand it? @WuTheFWasThat

Got an error that I can't trace

Hi all, I'm getting an error that is difficult to trace - any advice?

Traceback (most recent call last):
File "./sample.py", line 73, in
sample=launch_sample,
File "/Users/mysterefrank/deep_collective_fun/lm-human-preferences/lm_human_preferences/utils/launch.py", line 65, in main
fire.Fire(_Commands)
File "/Users/mysterefrank/.local/share/virtualenvs/lm-human-preferences-9-VNjZ2b/lib/python3.7/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/Users/mysterefrank/.local/share/virtualenvs/lm-human-preferences-9-VNjZ2b/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/Users/mysterefrank/.local/share/virtualenvs/lm-human-preferences-9-VNjZ2b/lib/python3.7/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "./sample.py", line 69, in launch_sample
launch.launch('sample', partial(sample_policy, **kwargs), mode=mode, mpi=mpi)
File "/Users/mysterefrank/deep_collective_fun/lm-human-preferences/lm_human_preferences/utils/launch.py", line 13, in launch
subprocess.check_call(['mpiexec', '-n', str(mpi), 'python', '-c', 'import sys; import pickle; pickle.loads(open("/tmp/pickle_fn", "rb").read())()'], stderr=subprocess.STDOUT)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/subprocess.py", line 347, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpiexec', '-n', '1', 'python', '-c', 'import sys; import pickle; pickle.loads(open("/tmp/pickle_fn", "rb").read())()']' returned non-zero exit status 1.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.