ise-uiuc / magicoder Goto Github PK

View Code? Open in Web Editor NEW

1.9K 26.0 167.0 2.46 MB

Magicoder: Source Code Is All You Need

Home Page: https://arxiv.org/abs/2312.02120

License: MIT License

Python 100.00%

ai4code large-language-models llm llm4code

magicoder's Introduction

🎩 Magicoder: Source Code Is All You Need

We are thrilled that Magicoder and OSS-Instruct have inspired many amazing projects, including:

Contact: Yuxiang Wei, Zhe Wang, Yifeng Ding, Jiawei Liu, Lingming Zhang.

About

🎩Magicoder is a model family empowered by 🪄OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets for generating low-bias and high-quality instruction data for code.
🪄OSS-Instruct mitigates the inherent bias of the LLM-synthesized instruction data by empowering them with a wealth of open-source references to produce more diverse, realistic, and controllable data.

Important

Magicoder-S-DS-6.7B outperforms gpt-3.5-turbo-1106 and Gemini Ultra on HumanEval (76.8 vs. [72.6 and 74.4])!
Find more detailed comparisons with other SOTA models on the 🏆 EvalPlus Leaderboard 🏆!

🎩 Models

Model	Checkpoint	Size	HumanEval (+)	MBPP (+)	License
Magicoder-CL-7B	🤗 HF Link	7B	60.4 (55.5)	64.2 (52.6)	Llama2
Magicoder-S-CL-7B	🤗 HF Link	7B	70.7 (66.5)	68.4 (56.6)	Llama2
Magicoder-DS-6.7B	🤗 HF Link	6.7B	66.5 (60.4)	75.4 (61.9)	DeepSeek
Magicoder-S-DS-6.7B	🤗 HF Link	6.7B	76.8 (70.7)	75.7 (64.4)	DeepSeek

👀 Demo

Online Gradio Demo

Quickly try out our Magicoder Playground powered by gradio! Huge thanks to AK(@_akhaliq) and the Hugging Face team for their support!

Local Gradio Demo

We follow WizardCoder and provide the script to build a local demo server. You can launch your local gradio demo as following:

cd demo
CUDA_VISIBLE_DEVICES=0 python magicoder_demo.py \
   --base_model "ise-uiuc/Magicoder-S-DS-6.7B" \
   --device "cuda:0" \
   --port 8080

📚 Dataset

Magicoder-OSS-Instruct-75K: generated through OSS-Instruct using gpt-3.5-turbo-1106 and used to train both Magicoder and Magicoder-S series.
Magicoder-Evol-Instruct-110K: decontaminated and redistributed from theblackcat102/evol-codealpaca-v1, used to further finetune Magicoder series and obtain Magicoder-S models.

🚀 Quick Start

from transformers import pipeline
import torch

MAGICODER_PROMPT = """You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
{instruction}

@@ Response
"""

instruction = "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."

prompt = MAGICODER_PROMPT.format(instruction=instruction)
generator = pipeline(
    model="ise-uiuc/Magicoder-S-DS-6.7B",
    task="text-generation",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
result = generator(prompt, max_length=2048, num_return_sequences=1, temperature=0.0)
print(result[0]["generated_text"])

This code snippet will generate the following output:

Here is a simple Python implementation of a TODO list API:

```python
class TodoList:
    def __init__(self):
        self.todo_list = []

    def add_task(self, task):
        if not isinstance(task, str):
            raise ValueError("Task must be a string")
        self.todo_list.append(task)

    def remove_task(self, task):
        if task not in self.todo_list:
            raise ValueError("Task not found in the list")
        self.todo_list.remove(task)

    def get_tasks(self):
        return self.todo_list

    def update_task(self, old_task, new_task):
        if old_task not in self.todo_list:
            raise ValueError("Old task not found in the list")
        if not isinstance(new_task, str):
            raise ValueError("New task must be a string")
        index = self.todo_list.index(old_task)
        self.todo_list[index] = new_task

    def clear_list(self):
        self.todo_list = []
```

This API allows you to add tasks, remove tasks, get all tasks, update tasks, and clear the list. It also raises exceptions for invalid operations.

You can use this API like this:

```python
todo = TodoList()
todo.add_task("Buy groceries")
todo.add_task("Finish project")
print(todo.get_tasks())  # Output: ['Buy groceries', 'Finish project']
todo.update_task("Buy groceries", "Buy fruits")
print(todo.get_tasks())  # Output: ['Buy fruits', 'Finish project']
todo.remove_task("Finish project")
print(todo.get_tasks())  # Output: ['Buy fruits']
todo.clear_list()
print(todo.get_tasks())  # Output: []
```

📝 Citation

@article{wei2023magicoder,
  title={Magicoder: Source Code Is All You Need},
  author={Wei, Yuxiang and Wang, Zhe and Liu, Jiawei and Ding, Yifeng and Zhang, Lingming},
  journal={arXiv preprint arXiv:2312.02120},
  year={2023}
}

🙏 Acknowledgements

We thank AK(@_akhaliq) and the Hugging Face team for their support in the Magicoder Playground! We also thank the following amazing projects that truly inspired us:

WizardCoder: Evol-Instruct
DeepSeek-Coder: Base model for Magicoder-DS
CodeLlama: Base model for Magicoder-CL
StarCoder: Data decontamination

⚠️ Important Note

Bias, Risks, and Limitations: Magicoders may sometimes make errors, produce misleading contents, or struggle to manage tasks that are not related to coding.
Usage: Magicoder models are trained on the synthetic data generated by OpenAI models. Please pay attention to OpenAI's terms of use when using the models and the datasets. Magicoders will not compete with any OpenAI's commercial product.

⭐️ Star History

magicoder's People

Contributors

Stargazers

Watchers

Forkers

codeaudit techthiyanes fdoperezi zhutony touristshaun firobeid veerumehta markandrewdemarest clic-ethiopia ollyuk evdcush mascobot f901107 waywardspooky spencerx jaynivaan ziyu-deep dattgoswami lliwcwill rbrus liuchaoxd ognz ahmedhzm alcidesmorales stmnk lingming anynamefornow adrianyhw lisabetkush gazettekissez-v countryroyal4 cyogf portestu48 adsorcept-o rightpop-centhart ionidad16 scorpionte-l chirpik75 babixzbabydeckfunk stonestone-j answerie-github kongoniiparkel somberconspiracy-chuddle mensurgia-heardoom dantegpt minglueue-quayleflea ratchapter-higherit ai-ld helpfulpt-jungster educate-w kittyzena-ferdywaves undshing-l spidererio-chicktips mbrukman captainn-icyness gsuabinnow soundnsmilefibrouppl gutsybotstamaha truechilled95righthaja santapakwiqque buffar-m readerenesgotiz gossipak76 fisherno-timeat glorydoll37plentyeat thedevilshabbydogger bozdw79 ravensterspinti waybeauty-r pink-black-bear chunsj adambear sm-da amtech koolgool lemmaleisa signalprime expert68 utkarshx joonebabat1 joonemaman1 o-rtuyet-4 googlegas hydrogeohc dragon28 marvinmw josephrp olkab amrats1007 imoneoi pakloong thargyi74 brekooname universefly sarassc zedgit22 songuyenerza arkboy1224 manicme sbrendon98

magicoder's Issues

The templates used in reproducing the eval results: why adding the instruction again after "### Response: "?

There is an input format mismatch between the eval and training process. Do you intend to emphasize the problem before the model generates its output?

When doing the Humaneval(+) eval, the compiled inputs are as follows, eg.:


@@ Instruction
Write a solution to the following problem:
```python
def fib(n: int):
    """Return n-th Fibonacci number.
    >>> fib(10)
    55
    >>> fib(1)
    1
    >>> fib(8)
    21
    """

@@ Response

def fib(n: int):
    """Return n-th Fibonacci number.
    >>> fib(10)
    55
    >>> fib(1)
    1
    >>> fib(8)
    21
    """```

But in the code files about data processing and training, the instruction data would be compiled as:

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
Write a solution to the following coding problem:
{problem}

@@ Response
{response}

There is no such **_rephrasing/emphasizing_** in the training data of Magicoder. 
From the eval results this mismatch seems not to bring obvious negative effects, but did you deliberately do so?

the evaluation result of magicoder is not aligned with the result on paper

Hi, thanks for your great work.

I test the performance of magicoder, however the performance is not align with the result on the paper(68.9 vs. 76.8). I guess it is because I used different hyperparameter for inference, eg, --top_p 1, --temperature 1 and so on. I will be grateful if author can provide the specific hyperparameter for inference. Thank u. i list the script i used below:

I try to evaluate the magicoder on the script you provided in the 'experiment' folder by the following commond:

python experiments/text2code.py \
    --model_key deepseek-ai/deepseek-coder-6.7b-base \
    --dataset humaneval \
    --save_path output_dir/mc_6_7_ds \
    --n_batches 1 \
    --n_problems_per_batch 1 \
    --n_samples_per_problem 1 \
    --model_name_or_path ~/weight/magicoders-s-ds-6.7/ \
    --top_p 1 \
    --max_new_tokens 4096 \
    --temperature 1

then I use the the the following commond for evalplus:

docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples output_dir/mc_6_7_ds.jsonl

Finally, i got the result:

Base
{'pass@1': 0.6890243902439024}
Base + Extra
{'pass@1': 0.6158536585365854}

How do you test HE after fine-tuned on CodeLLama？

We use 110k data to fine-tune CodeLLama and test the HE and the pass@1 is 60.6%. I wonder some settings are not same between our experiments.

How do you set the 'stop_words' parameter

can you consider adding my explanation on how to use magicoder in text-generation-webui

After experimenting with text-generation-webui by oobabooga I found the following things
-Magicoder models are all instruct only models (no chat/chat-instruct)
-you need to create a new custom template under parameters/instruction template tab
-you also need to change these values under parameters/generation tab (max-new-tokens=1024,top_p=0.9,top_k=50,repetition-penality=1,repetetion-penality-range=1024)
-copy the content in the text file to custom template/instruction template
instruction-template-magicode.txt

I took the generation parameters from deepseek-coder/demo/app.py
The instruction template is edited from airoborous-v1.2 after comparing it with its prompt template

Confusion about the code of train

First of all, thank you for your amazing work!
I'm attempting to replicate the training process, and I have a question regarding the train.py file. In your paper, you mentioned using two A100-80G GPUs, but I couldn't find any mention of multiprocessing or distributed training in your code. I'm curious if you used deepspeed for training? If not, could you provide guidance on modifying the code to make it compatible with a multi-GPU setup?
Thanks once again!

The correctness of solution

how about the quality of the python code implementation in solutions?
Can they be used directly to train the model?

So many impressive experiments ! Are there any experiments with neftune ?

any environment requirement for the model, doesn't work in MacAir M1 (16G)

try to use magicoder with ollama on MacAir M1 (16G), it works for other model, but when I run this, got error

...
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 10922.67 MiB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 1083.07 MiB
llama_new_context_with_model: max tensor size =   102.54 MiB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3648.58 MiB, ( 3649.20 / 10922.67)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  8192.00 MiB, offs =            0
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =     0.03 MiB, offs =   8589918208, (11841.23 / 10922.67)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =  1080.02 MiB, (12921.25 / 10922.67)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_graph_compute: command buffer 0 failed with status 5
GGML_ASSERT: /tmp/ollama-20231213-4188-jpu97j/llm/llama.cpp/gguf/ggml-metal.m:1623: false
2023/12/23 16:46:59 llama.go:451: signal: abort trap
2023/12/23 16:46:59 llama.go:459: error starting llama runner: llama runner process has terminated
2023/12/23 16:46:59 llama.go:525: llama runner stopped successfully

googled, is it similar to ggerganov/llama.cpp#2048 ?

Not sure whether it can be tuned to work on this Mac, if not, it is better to add limitation (or requirement) in the README

Got same problem that model only return lots of '\n'

I got the same problem when use the quick start script to run an inference task.
Just the same as this issue: #22
what should i do to solve this problem?

Are the training loss and validation loss recorded?

Hi, Dear:

Thank you very much for your code. I am reproducing your training process. I wonder what your training loss and validation loss are during the training process, and I want to align them with your training process on dataset Magicoder-OSS-Instruct-75K and datasets--ise-uiuc--Magicoder-Evol-Instruct-110K

Thx

How to write prompt for code completion task

I've run prompt for python code completion using:
prompt_template = f"""Write a solution to the following problem:

{code}
```"""

"""
but the LLM result got nothing new in the  input prompt code part, just generated some other informations.

Is it normal to take more than one hour to get the humanevalplus results?

I ran the humaneval(+) evaluation on an A100 GPU using the original code in the experiments folder.
The model is ise-uiuc/Magicoder-S-DS-6.7B.

It's extremely time-consuming. How can I make it faster?

The model outputs nothing but "\n" HELP! 😭

I followed the demo instruction in github and ran the model in jupyter, but just got nothing but "\n".
Can anyone tell me what's wrong?

Does this model support fill-in-the-middle?

Fill-in-the-middle can allow the model to make better completions. It would be nice if this model supports it.

Data collection and generation

Thanks for releasing your research codes to everyone.
The variables here I found a bit difficult to figure out what they are for.
Can you please explain them?
And in general, could you please add a more comprehensive readme about the data collection/generation parts of the code base?
Thanks!

used Dilated attenton instead of Vanilla Attention in Llama model and fine-tuen the model ,

i want to ask if I can replace the Dilated attention with Attention used in the based model and do the fine-tuning, the idea behind this is to reduce the complexity of Attention and increase the Windows context, does DeepSeek use Llama 2 as a based model the same arch which means, I can load the Checkpoint of layers of the model such Normlayer and feedforward or I need to re-factor the LLM model from Scratch !!
or there's any method to adapt weight or Shared Weight

Wrong MBPP pass@1 of CodeLlama-Python

Hello,

I observed that in your paper, the MBPP pass@1 for CodeLlama-Python is listed as 57.6% (see Table 1). But, according to the original CodeLlama paper ([1; Table 2]), it's actually 47.6%. Could you please update this?

Reference:
[1] https://arxiv.org/pdf/2308.12950.pdf

HuggingFace Playground has failed

Hello,
I have used the huggingface playground for this model in the past but now it has a runtime error. Should I be expecting that a fix is coming?

Thank You!

Possibility for a Mixture-of-Experts Model?

With the recent release of Mixtral 8x7B, there's a lot of hope and excitement around open-source MoE models.

It would be very interesting to see how a narrowly focused MoE model performs.

can I use vLLM to load the models?

A scaling law of instruction-code-data would be very interesting...

Just started reproducing Magicoder and could not help wondering, would a bigger OSS-Instruct dataset work better and how much better?
PS. There are 12,000,000 files in Python inside bigcode/Starcoderdata, with only 40K/12M being used.

Text-gen prompt template?

Sorry if this was mentioned before, but is there a stock prompt template in ooba's text-gen that works with this?

small models

thanks for the awesome work!!! My question is, why not deepseek-coder 1b ???? :((

this enables amazing apps on edge devices, like in small robots decision makers. would be a great model if using chain of code and Programmatic Reasoning.

https://arxiv.org/abs/2312.03052
https://sites.google.com/view/chain-of-code

Quantised Finetuning on 22GB*4 GPUs

Hello
I am trying to finetune CodeLlama-Python-hf on 4 GPUs with 22GB memory. Using the training processes mentioned in magiccoder readme gives error that CUDA is out of memory.

How can I quantise the model or optimise the memory usage to load in my machine?

`OSS-Instruct` dataset generation instructions

Could you please include the instructions to reproduce the OSS-Instruct dataset in README?

I've found the generation script here https://github.com/ise-uiuc/magicoder/blob/main/src/magicoder/generate_data.py

catastrophic forgetting problem

Hi dear:

Thanks for your open source, but when i finetuned (whatever full parameters or LoRa ) on my dataset, catastrophic forgetting kept coming up (decrease in performance on the humaneval), i do not know how to solve it, do you have any tops?

Training data format for Magicoder-OSS-Instruct-75K

Hi, thx for the work!

I was wondering how you format the OSS75k data for training? Is it in the alpaca format like:

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
{instruction} # problem column of the OSS75k dataset

@@ Response
# solution column of the OSS75k dataset

Thx

Evaluation codes not found

Hello,

Thanks for providing such an amazing repo for LLMs to generate codes.
I am impressive that magicoder got a great result in HumanEval however I can't find the evaluation code for this.
It's great if the evaluation code is made available.

Optimizer selection

Hi, thx for the brilliant work!

I am curious about the decision to use Adafactor as the optimizer for Magicoder. Have other options been explored or tried in this context? 🤔

8台A40机器上复现magicoder-S-DS-6.7B的结果

因为README-DEV.md脚本直接使用accelerate提示训练内存不足，故修改为deepspeed-stage1启动，其余参数均为默认。因是8卡迭代步长缩小了1/4。

经过实验后我发现：

训练速度大幅降低
1阶段和2截断训练效果均无法达到60%

想咨询下机器不同，且增加了deepspeed有可能让结果变差这么多吗？

Achieved close performance of MagicoderS by finetuning only with `evol-codealpaca-v1`.

Thanks to your amazing tutorial, we reproduced the training process and experiments in the paper. The model finetuned by ourselves achieved the performance close to yours. For HumanEval(+), we got 57.32% / 52.44% pass@1 for Magicoder and 70.12% / 67.07% for MagicoderS.
Moreover, we conducted ablation studies to clarify the contribution of OSS-instruct with evol-instruct in the training process of MagicoderS.

We got close performance of MagicoderS (70.12% / 65.24% for HumanEval(+), similar results for DS-1000) by finetuning ONLY with evol-codealpaca-v1 dataset under the same training setting mentioned in the tutorial.
We got even worse results (66.46% / 62.20% for HumanEval(+)) by swaping the training order of oss-instruct-75k and evol-codealpaca-v1.

We noticed that oss-instruct-75k was generated by base model gpt-3.5-turbo-1106, whereas evol-codealpaca-v1 was generated by gpt-4, so the experiments of MagicoderS may be unfair. I think there should be additional evidence to show the contribution of OSS-instruct when it is combined with other data generation methods.

Any plans for a 33b fine tune?

Awesome models! Great job guys! :)
I am wondering if you plan to also finetune on top of deepseek-coder-33b-instruct as well? I wonder how high the evaluations will go with that model :)

Max Token = ?

After reading through the page i dint see any mention of any max token for Magicoder-S-DS-6.7B & Magicoder-DS-6.7B (is it safe to assume its the same as Deepseek Coder = 16k)

Magicoder-S-DS-6.7B gguf, gptq models are not working. Please could you provide quantize models. LM studio, text generation web ui is not is not working.