Hi, thanks for your great work. I test the performance of magicoder,

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

the evaluation result of magicoder is not aligned with the result on paper about magicoder HOT 7 CLOSED

ise-uiuc commented on September 20, 2024

the evaluation result of magicoder is not aligned with the result on paper

from magicoder.

Comments (7)

UniverseFly commented on September 20, 2024

Hi @maowayne123, in our paper we use greedy decoding for all HumanEval experiments, which means the temperature is 0. The exact hyper parameters are

    --temperature 0.0
    --top_p 1.0
    --max_new_tokens 512
    --n_problems_per_batch 16
    --n_samples_per_problem 1
    --n_batches 1

The choice of n_problems_per_batch will only slightly affect the results due to floating points round-off (related discussion: https://discuss.pytorch.org/t/results-of-forward-pass-are-different-with-different-batch-size/162277).

from magicoder.

maowayne123 commented on September 20, 2024

Hi @maowayne123, in our paper we use greedy decoding for all HumanEval experiments, which means the temperature is 0. The exact hyper parameters are
    --temperature 0.0
    --top_p 1.0
    --max_new_tokens 512
    --n_problems_per_batch 16
    --n_samples_per_problem 1
    --n_batches 1
The choice of n_problems_per_batch will only slightly affect the results due to floating points round-off (related discussion: https://discuss.pytorch.org/t/results-of-forward-pass-are-different-with-different-batch-size/162277).

Hi, thanks for your reply. I try your hyperparameters. This time I got much better result, but the result still not align with the paper result (75.6% vs. 76.8%). do u have any idea for it? the script I used this time is:

python experiments/text2code.py \
    --model_key deepseek-ai/deepseek-coder-6.7b-base \
    --dataset humaneval \
    --save_path output_dir/mc_6_7_ds_flash_attn.jsonl \
    --n_batches 1 \
    --n_problems_per_batch 2 \
    --n_samples_per_problem 1 \
    --model_name_or_path ~/weight/magicoders-s-ds-6.7/ \
    --top_p 1.0 \
    --max_new_tokens 512 \
    --temperature 0.0

BTW, currently the BFloat16 is enable. This may cause the precision loss. Moreover, flash attention can result in better result (75.6% vs. 75.0%). Thank you!

from magicoder.

maowayne123 commented on September 20, 2024

I found a more embarrassing things. The deepseeker can reach 81.7% acc. there must be something wrong.

The script I used is:

python experiments/text2code.py \
    --model_key deepseek-ai/deepseek-coder-6.7b-base \
    --dataset humaneval \
    --save_path output_dir/ds_6_7_ds.jsonl \
    --n_batches 1 \
    --n_problems_per_batch 2 \
    --n_samples_per_problem 1 \
    --model_name_or_path ~/weight/deepseek-coder-6.7b/ \
    --top_p 1.0 \
    --max_new_tokens 512 \
    --temperature 0.0

the test script i used is:

docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples output_dir/ds_6_7_ds.jsonl

the result i got is:

Base
{'pass@1': 0.8170731707317073}
Base + Extra
{'pass@1': 0.7621951219512195}

from magicoder.

UniverseFly commented on September 20, 2024

@maowayne123 Thanks for reporting this interesting finding. I suppose you were running the DeepSeek-Coder Instruct version? We didn't run it using the script but instead report the EvalPlus Leaderboard results. If your finding is true, the prompt could be the factor. Also, since DeepSeek-Coder does not release its instruction data, we cannot analyze if there are contamination issues.

from magicoder.

UniverseFly commented on September 20, 2024

And the results discrepancy can be due to your choice of n_problems_per_batch, which influences cuBLAS's optimization strategy. If you experiment with different values, you may see different results. We choose 16 for a faster evaluation.

from magicoder.

answers111 commented on September 20, 2024

Hello, thank you for such an impressing work.
I'm trying to reproduce the result score reported in your paper. Using the script you provided in the 'experiment' folder, I got

Base
{'pass@1': 0.6402439024390244}
Base + Extra
{'pass@1': 0.573170731707317}

for Magicoder-DS-6.7B and

Base
{'pass@1': 0.7439024390243902}
Base + Extra
{'pass@1': 0.6890243902439024}

for Magicoder-S-DS-6.7B which is not aligned with the result on paper.

the commond i used is

python experiments/text2code.py \
    --model_key deepseek-ai/deepseek-coder-6.7b-base \
    --dataset humaneval \
    --save_path ds_6_7_ds.jsonl \
    --n_batches 1 \
    --n_problems_per_batch 16 \
    --n_samples_per_problem 1 \
    --model_name_or_path ise-uiuc/Magicoder-DS-6.7B \
    --top_p 1.0 \
    --max_new_tokens 512 \
    --temperature 0.0

I'm confuse about it. Would be grateful if you can provide some extra information. Thank u.

from magicoder.

UniverseFly commented on September 20, 2024

Hi @maowayne123 and @answers111, I deeply appreciate your efforts in running Magicoder evaluations and raising relevant issues. I wrote the documentation on how to 100% reproduce the paper results on HumanEval(+) and MBPP(+) here.
I am also attaching the generated samples here: magicoder-evalplus-results.tar.gz.

Upon investigation, the reasons for the discrepancy are two folds: package versions and batch size. During the experiments, we just chose the batch size that can best utilize available GPUs (including 16, 24, and 28). We actually didn't pay too much attention to the randomness incurred by its choice. Choosing a different value can slightly improve/worsen the results. I also drafted a separate Limitations section in the README based on your helpful comments.

Anyways, great thanks to your participations in the discussion. I will close this issue for now but feel free to reopen it in case of any questions. We will further improve Magicoder in the near future and provide more comprehensive evaluation!

from magicoder.

the evaluation result of magicoder is not aligned with the result on paper about magicoder HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent