Comments (7)
Hi @maowayne123, in our paper we use greedy decoding for all HumanEval experiments, which means the temperature is 0. The exact hyper parameters are
--temperature 0.0
--top_p 1.0
--max_new_tokens 512
--n_problems_per_batch 16
--n_samples_per_problem 1
--n_batches 1
The choice of n_problems_per_batch
will only slightly affect the results due to floating points round-off (related discussion: https://discuss.pytorch.org/t/results-of-forward-pass-are-different-with-different-batch-size/162277).
from magicoder.
Hi @maowayne123, in our paper we use greedy decoding for all HumanEval experiments, which means the temperature is 0. The exact hyper parameters are
--temperature 0.0 --top_p 1.0 --max_new_tokens 512 --n_problems_per_batch 16 --n_samples_per_problem 1 --n_batches 1
The choice of
n_problems_per_batch
will only slightly affect the results due to floating points round-off (related discussion: https://discuss.pytorch.org/t/results-of-forward-pass-are-different-with-different-batch-size/162277).
Hi, thanks for your reply. I try your hyperparameters. This time I got much better result, but the result still not align with the paper result (75.6% vs. 76.8%). do u have any idea for it? the script I used this time is:
python experiments/text2code.py \
--model_key deepseek-ai/deepseek-coder-6.7b-base \
--dataset humaneval \
--save_path output_dir/mc_6_7_ds_flash_attn.jsonl \
--n_batches 1 \
--n_problems_per_batch 2 \
--n_samples_per_problem 1 \
--model_name_or_path ~/weight/magicoders-s-ds-6.7/ \
--top_p 1.0 \
--max_new_tokens 512 \
--temperature 0.0
BTW, currently the BFloat16 is enable. This may cause the precision loss. Moreover, flash attention can result in better result (75.6% vs. 75.0%). Thank you!
from magicoder.
I found a more embarrassing things. The deepseeker can reach 81.7% acc. there must be something wrong.
The script I used is:
python experiments/text2code.py \
--model_key deepseek-ai/deepseek-coder-6.7b-base \
--dataset humaneval \
--save_path output_dir/ds_6_7_ds.jsonl \
--n_batches 1 \
--n_problems_per_batch 2 \
--n_samples_per_problem 1 \
--model_name_or_path ~/weight/deepseek-coder-6.7b/ \
--top_p 1.0 \
--max_new_tokens 512 \
--temperature 0.0
the test script i used is:
docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples output_dir/ds_6_7_ds.jsonl
the result i got is:
Base
{'pass@1': 0.8170731707317073}
Base + Extra
{'pass@1': 0.7621951219512195}
from magicoder.
@maowayne123 Thanks for reporting this interesting finding. I suppose you were running the DeepSeek-Coder Instruct version? We didn't run it using the script but instead report the EvalPlus Leaderboard results. If your finding is true, the prompt could be the factor. Also, since DeepSeek-Coder does not release its instruction data, we cannot analyze if there are contamination issues.
from magicoder.
And the results discrepancy can be due to your choice of n_problems_per_batch
, which influences cuBLAS's optimization strategy. If you experiment with different values, you may see different results. We choose 16
for a faster evaluation.
from magicoder.
Hello, thank you for such an impressing work.
I'm trying to reproduce the result score reported in your paper. Using the script you provided in the 'experiment' folder, I got
Base
{'pass@1': 0.6402439024390244}
Base + Extra
{'pass@1': 0.573170731707317}
for Magicoder-DS-6.7B and
Base
{'pass@1': 0.7439024390243902}
Base + Extra
{'pass@1': 0.6890243902439024}
for Magicoder-S-DS-6.7B which is not aligned with the result on paper.
the commond i used is
python experiments/text2code.py \
--model_key deepseek-ai/deepseek-coder-6.7b-base \
--dataset humaneval \
--save_path ds_6_7_ds.jsonl \
--n_batches 1 \
--n_problems_per_batch 16 \
--n_samples_per_problem 1 \
--model_name_or_path ise-uiuc/Magicoder-DS-6.7B \
--top_p 1.0 \
--max_new_tokens 512 \
--temperature 0.0
I'm confuse about it. Would be grateful if you can provide some extra information. Thank u.
from magicoder.
Hi @maowayne123 and @answers111, I deeply appreciate your efforts in running Magicoder evaluations and raising relevant issues. I wrote the documentation on how to 100% reproduce the paper results on HumanEval(+) and MBPP(+) here.
I am also attaching the generated samples here: magicoder-evalplus-results.tar.gz.
Upon investigation, the reasons for the discrepancy are two folds: package versions and batch size. During the experiments, we just chose the batch size that can best utilize available GPUs (including 16, 24, and 28). We actually didn't pay too much attention to the randomness incurred by its choice. Choosing a different value can slightly improve/worsen the results. I also drafted a separate Limitations section in the README based on your helpful comments.
Anyways, great thanks to your participations in the discussion. I will close this issue for now but feel free to reopen it in case of any questions. We will further improve Magicoder in the near future and provide more comprehensive evaluation!
from magicoder.
Related Issues (20)
- How do you test HE after fine-tuned on CodeLLama? HOT 1
- Wrong MBPP pass@1 of CodeLlama-Python HOT 1
- Evaluation codes not found HOT 3
- Magicoder-S-DS-6.7B gguf, gptq models are not working. Please could you provide quantize models. LM studio, text generation web ui is not is not working. HOT 1
- can you consider adding my explanation on how to use magicoder in text-generation-webui HOT 1
- Confusion about the code of train HOT 4
- can I use vLLM to load the models? HOT 1
- any environment requirement for the model, doesn't work in MacAir M1 (16G) HOT 1
- How to write prompt for code completion task HOT 1
- small models HOT 1
- The model outputs nothing but "\n" HELP! 😭 HOT 4
- Training data format for Magicoder-OSS-Instruct-75K HOT 4
- So many impressive experiments ! Are there any experiments with neftune ? HOT 1
- The correctness of solution HOT 1
- used Dilated attenton instead of Vanilla Attention in Llama model and fine-tuen the model ,
- How do you set the 'stop_words' parameter
- Optimizer selection HOT 2
- Text-gen prompt template? HOT 3
- Possibility for a Mixture-of-Experts Model? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from magicoder.