CoderUJB

This is the official repository for CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios, accepted to the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) 2024.

CoderUJB (Unified Java Benchmark): A new benchmark designed to evaluate LLMs across diverse Java programming tasks that are executable and reflective of actual development scenarios, acknowledging Java’s prevalence in real-world software production.

Install

Install codeujb.

# create a new conda environment
conda create -n ujb python=3.10
conda activate ujb

# clone and install codeujb
git clone https://github.com/ZZR0/ISSTA24-CoderUJB.git
cd ISSTA24-CoderUJB
pip install -r requirements.txt
pip install -e .

For more details packages version, please refer to requirements.txt.

Refer to defects4j repository for install execution environment.

CodeUJB

Evaluate a model on CodeUJB

Step 1. Generate model answers to CodeUJB questions

We support three backbones for generating CodeUJB answers: hf, openai and tgi.

# generate answers with huggingface `transformers` backbone.
python code_ujb/generate_hf.py \
    --model-path $model_name_or_path \
    --model-id $run_id \
    --gen-mode $gen_mode \
    --bench-name $dataset \
    --num-samples $num_samples \
    --save-generations-path ./log/$run_id/$dataset/generations-$gen_mode.json

# generate answers with openai API backbone.

export OPENAI_API_BASE=''
export OPENAI_API_KEY=''

python code_ujb/generate_api.py \
    --model-path $run_id \
    --model-id $run_id \
    --gen-mode $gen_mode \
    --bench-name $dataset \
    --num-samples $num_samples \
    --parallel 8 \
    --save-generations-path ./log/$run_id/$dataset/generations-$gen_mode.json

# If `model-id` not in OpenAI model list, `generate_api.py` will generate answers with Text Generation Inference backbone.
# Please refer to [Text Generation Inference](https://github.com/huggingface/text-generation-inference) for deploying your TGI server first.

export TGI_API_URL_${run_id//-/_}=http://127.0.0.1:8081,http://127.0.0.1:8082 # The Text Generation Inference API URL.

python code_ujb/generate_api.py \
    --model-path $run_id \
    --model-id $run_id \
    --gen-mode $gen_mode \
    --bench-name $dataset \
    --num-samples $num_samples  \
    --parallel 32 \
    --save-generations-path ./log/$run_id/$dataset/generations-$gen_mode.json

Arguments:

[model-path] is the path to the weights, which can be a local folder or a Hugging Face repo ID. If you using generate_api.py, it should be the same as model ID.
[model-id] is a name you give to the model.
[gen-mode] have two options: complete for model without instruction-finetuning and chat for model with instruction-finetuning.
[bench-name] is the name of the dataset you want to evaluate. There five datasets in CodeUJB: codeujbrepair, codeujbcomplete, codeujbtestgen, codeujbtestgenissue, codeujbdefectdetection.
[num-samples] is the number of samples for each coding question you want to generate.
[save-generations-path] is the path to save the generated answer.
[parallel] is the number of parallel API calls. e.g.,

python code_ujb/generate_api.py --model-path gpt-3.5-turbo --model-id gpt-3.5-turbo --gen-mode chat --bench-name codeujbcomplete --num-samples 10 --save-generations-path log/gpt-3.5-turbo/codeujbcomplete/generations-chat.jsonl

The answers will be saved to log/gpt-3.5-turbo/codeujbcomplete/generations-chat.jsonl.

Step 2. Evaluate model answers of CodeUJB

Please make sure you have installed defects4j first.

python3 code_ujb/evaluate.py \
    --model-path $model_name_or_path \
    --model-id $run_id \
    --gen-mode $gen_mode \
    --bench-name $dataset \
    --num-samples $num_samples \
    --load-generations-path ./log/$run_id/$dataset/generations-$gen_mode.json \
    --eval-output-path ./log/$run_id/$dataset/evaluation-$gen_mode.json

Arguments:

[load-generations-path] is the path to the generated answer.
[eval-output-path] is the path to save the evaluation results.

e.g.,

python code_ujb/evaluate.py --model-path gpt-3.5-turbo --model-id gpt-3.5-turbo --gen-mode chat --bench-name codeujbcomplete --num-samples 10 --load-generations-path log/gpt-3.5-turbo/codeujbcomplete/generations-chat.jsonl --eval-output-path ./log/gpt-3.5-turbo/codeujbcomplete/evaluation-chat.json

The evaluation results will be saved to ./log/gpt-3.5-turbo/codeujbcomplete/evaluation-chat.json

QuickStart Scripts

# generate and evaluate with openai api, please setting the Openai API key first.
# export OPENAI_API_BASE=''
# export OPENAI_API_KEY=''
./scripts/run_code_ujb.sh api_gen chat multiplepython gpt-3.5-turbo gpt-3.5-turbo
./scripts/run_code_ujb.sh eval chat multiplepython gpt-3.5-turbo gpt-3.5-turbo

# generate with ray inference
./scripts/run_code_ujb.sh local_gen chat multiplepython $HOME/models/deepseekcoder-instruct-7b deepseekcoder-instruct-7b
./scripts/run_code_ujb.sh eval chat multiplepython $HOME/models/deepseekcoder-instruct-7b deepseekcoder-instruct-7b

# generate with tgi inference
./scripts/run_code_ujb.sh tgi_gen chat multiplepython $HOME/models/deepseekcoder-instruct-7b deepseekcoder-instruct-7b
./scripts/run_code_ujb.sh eval chat multiplepython $HOME/models/deepseekcoder-instruct-7b deepseekcoder-instruct-7b

zzr0 / issta24-coderujb Goto Github PK

issta24-coderujb's Introduction

CoderUJB

Contents

Install

CodeUJB

Evaluate a model on CodeUJB

Step 1. Generate model answers to CodeUJB questions

Step 2. Evaluate model answers of CodeUJB

QuickStart Scripts

issta24-coderujb's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent