Giter Site home page Giter Site logo

Comments (7)

UniverseFly avatar UniverseFly commented on September 20, 2024

Hi @maowayne123, in our paper we use greedy decoding for all HumanEval experiments, which means the temperature is 0. The exact hyper parameters are

    --temperature 0.0
    --top_p 1.0
    --max_new_tokens 512
    --n_problems_per_batch 16
    --n_samples_per_problem 1
    --n_batches 1

The choice of n_problems_per_batch will only slightly affect the results due to floating points round-off (related discussion: https://discuss.pytorch.org/t/results-of-forward-pass-are-different-with-different-batch-size/162277).

from magicoder.

maowayne123 avatar maowayne123 commented on September 20, 2024

Hi @maowayne123, in our paper we use greedy decoding for all HumanEval experiments, which means the temperature is 0. The exact hyper parameters are

    --temperature 0.0
    --top_p 1.0
    --max_new_tokens 512
    --n_problems_per_batch 16
    --n_samples_per_problem 1
    --n_batches 1

The choice of n_problems_per_batch will only slightly affect the results due to floating points round-off (related discussion: https://discuss.pytorch.org/t/results-of-forward-pass-are-different-with-different-batch-size/162277).

Hi, thanks for your reply. I try your hyperparameters. This time I got much better result, but the result still not align with the paper result (75.6% vs. 76.8%). do u have any idea for it? the script I used this time is:

python experiments/text2code.py \
    --model_key deepseek-ai/deepseek-coder-6.7b-base \
    --dataset humaneval \
    --save_path output_dir/mc_6_7_ds_flash_attn.jsonl \
    --n_batches 1 \
    --n_problems_per_batch 2 \
    --n_samples_per_problem 1 \
    --model_name_or_path ~/weight/magicoders-s-ds-6.7/ \
    --top_p 1.0 \
    --max_new_tokens 512 \
    --temperature 0.0

BTW, currently the BFloat16 is enable. This may cause the precision loss. Moreover, flash attention can result in better result (75.6% vs. 75.0%). Thank you!

from magicoder.

maowayne123 avatar maowayne123 commented on September 20, 2024

I found a more embarrassing things. The deepseeker can reach 81.7% acc. there must be something wrong.

The script I used is:

python experiments/text2code.py \
    --model_key deepseek-ai/deepseek-coder-6.7b-base \
    --dataset humaneval \
    --save_path output_dir/ds_6_7_ds.jsonl \
    --n_batches 1 \
    --n_problems_per_batch 2 \
    --n_samples_per_problem 1 \
    --model_name_or_path ~/weight/deepseek-coder-6.7b/ \
    --top_p 1.0 \
    --max_new_tokens 512 \
    --temperature 0.0

the test script i used is:

docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples output_dir/ds_6_7_ds.jsonl

the result i got is:

Base
{'pass@1': 0.8170731707317073}
Base + Extra
{'pass@1': 0.7621951219512195}

from magicoder.

UniverseFly avatar UniverseFly commented on September 20, 2024

@maowayne123 Thanks for reporting this interesting finding. I suppose you were running the DeepSeek-Coder Instruct version? We didn't run it using the script but instead report the EvalPlus Leaderboard results. If your finding is true, the prompt could be the factor. Also, since DeepSeek-Coder does not release its instruction data, we cannot analyze if there are contamination issues.

from magicoder.

UniverseFly avatar UniverseFly commented on September 20, 2024

And the results discrepancy can be due to your choice of n_problems_per_batch, which influences cuBLAS's optimization strategy. If you experiment with different values, you may see different results. We choose 16 for a faster evaluation.

from magicoder.

answers111 avatar answers111 commented on September 20, 2024

Hello, thank you for such an impressing work.
I'm trying to reproduce the result score reported in your paper. Using the script you provided in the 'experiment' folder, I got

Base
{'pass@1': 0.6402439024390244}
Base + Extra
{'pass@1': 0.573170731707317}

for Magicoder-DS-6.7B and

Base
{'pass@1': 0.7439024390243902}
Base + Extra
{'pass@1': 0.6890243902439024}

for Magicoder-S-DS-6.7B which is not aligned with the result on paper.

the commond i used is

python experiments/text2code.py \
    --model_key deepseek-ai/deepseek-coder-6.7b-base \
    --dataset humaneval \
    --save_path ds_6_7_ds.jsonl \
    --n_batches 1 \
    --n_problems_per_batch 16 \
    --n_samples_per_problem 1 \
    --model_name_or_path ise-uiuc/Magicoder-DS-6.7B \
    --top_p 1.0 \
    --max_new_tokens 512 \
    --temperature 0.0 

I'm confuse about it. Would be grateful if you can provide some extra information. Thank u.

from magicoder.

UniverseFly avatar UniverseFly commented on September 20, 2024

Hi @maowayne123 and @answers111, I deeply appreciate your efforts in running Magicoder evaluations and raising relevant issues. I wrote the documentation on how to 100% reproduce the paper results on HumanEval(+) and MBPP(+) here.
I am also attaching the generated samples here: magicoder-evalplus-results.tar.gz.

Upon investigation, the reasons for the discrepancy are two folds: package versions and batch size. During the experiments, we just chose the batch size that can best utilize available GPUs (including 16, 24, and 28). We actually didn't pay too much attention to the randomness incurred by its choice. Choosing a different value can slightly improve/worsen the results. I also drafted a separate Limitations section in the README based on your helpful comments.

Anyways, great thanks to your participations in the discussion. I will close this issue for now but feel free to reopen it in case of any questions. We will further improve Magicoder in the near future and provide more comprehensive evaluation!

from magicoder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.