The rewardedsoups from alexrame

Number of samples for inference/evaluation

Interesting paper!

And I would like to ask how many samples are you using for inference across different datasets?
The default number seems to be 200 in the args_utis.py, is that used for all of the different datasets? As it seems to have a big impact on evaluation performances.

Many thanks!

About finetuning LLAMA for summerization task

Hi all,

I was following the README in the llama folder, having run these commands

python3 train_ppo.py --task summary --dataset_name news --reward_models Tristan/gpt2_reward_summarization --output_folder ${folder_r1}
python3 train_ppo.py --task summary --dataset_name news --reward_models CogComp/bart-faithful-summary --dataset_name news-detector --reward_formats '1-0' --output_folder ${folder_r2}

python3 inference_rewardedsoups.py --task summary --dataset_name news --peft_names ${folder_r1} ${folder_r2}

The results I got are as below:

d[ 0.0 ] = [{'LABEL_0': 1.0334709184616804, 'n': 'grs'}, {'HALLUCINATED': 1.3255301121249794, 'FAITHFUL': -0.5046869936399162, 'n': 'bfsd'}, {'length': 200}]
d[ 0.1 ] = [{'LABEL_0': 1.0334709184616804, 'n': 'grs'}, {'HALLUCINATED': 1.3255301121249794, 'FAITHFUL': -0.5046869936399162, 'n': 'bfsd'}, {'length': 200}]
d[ 0.2 ] = [{'LABEL_0': 1.0334709184616804, 'n': 'grs'}, {'HALLUCINATED': 1.3255301121249794, 'FAITHFUL': -0.5046869936399162, 'n': 'bfsd'}, {'length': 200}]
d[ 0.3 ] = [{'LABEL_0': 1.0334709184616804, 'n': 'grs'}, {'HALLUCINATED': 1.3255301121249794, 'FAITHFUL': -0.5046869936399162, 'n': 'bfsd'}, {'length': 200}]
d[ 0.4 ] = [{'LABEL_0': 1.0334709184616804, 'n': 'grs'}, {'HALLUCINATED': 1.3255301121249794, 'FAITHFUL': -0.5046869936399162, 'n': 'bfsd'}, {'length': 200}]
d[ 0.5 ] = [{'LABEL_0': 1.0334709184616804, 'n': 'grs'}, {'HALLUCINATED': 1.3255301121249794, 'FAITHFUL': -0.5046869936399162, 'n': 'bfsd'}, {'length': 200}]
d[ 0.6 ] = [{'LABEL_0': 1.0334709184616804, 'n': 'grs'}, {'HALLUCINATED': 1.3255301121249794, 'FAITHFUL': -0.5046869936399162, 'n': 'bfsd'}, {'length': 200}]
d[ 0.7 ] = [{'LABEL_0': 1.0334709184616804, 'n': 'grs'}, {'HALLUCINATED': 1.3255301121249794, 'FAITHFUL': -0.5046869936399162, 'n': 'bfsd'}, {'length': 200}]
d[ 0.8 ] = [{'LABEL_0': 1.0334709184616804, 'n': 'grs'}, {'HALLUCINATED': 1.3255301121249794, 'FAITHFUL': -0.5046869936399162, 'n': 'bfsd'}, {'length': 200}]
d[ 0.9 ] = [{'LABEL_0': 1.0334709184616804, 'n': 'grs'}, {'HALLUCINATED': 1.3255301121249794, 'FAITHFUL': -0.5046869936399162, 'n': 'bfsd'}, {'length': 200}]
d[ 1.0 ] = [{'LABEL_0': 1.0334709184616804, 'n': 'grs'}, {'HALLUCINATED': 1.3255301121249794, 'FAITHFUL': -0.5046869936399162, 'n': 'bfsd'}, {'length': 200}]

May I kindly ask for hints about the reason for these results? They get the same scores with different objective weights. Was it caused by not tuning the number of epochs for fine-tuning?

Kind regards,
Ethan

alexrame / rewardedsoups Goto Github PK

rewardedsoups's People

Contributors

Stargazers

Watchers

Forkers

rewardedsoups's Issues

Number of samples for inference/evaluation

About finetuning LLAMA for summerization task

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent