microsoft / pycodegpt Goto Github PK

View Code? Open in Web Editor NEW

236.0 15.0 40.0 1.06 MB

A pre-trained GPT model for Python code completion and generation

License: MIT License

Python 96.74% Shell 2.83% Jupyter Notebook 0.43%

pycodegpt's Introduction

PyCodeGPT

A pre-trained GPT model for Python code completion and generation

What is it?

PyCodeGPT is efficient and effective GPT-Neo-based model for python code generation task, which is similar to OpenAI Codex, Github Copliot, CodeParrot, AlphaCode.

Training Data

Due to the small size of public released dataset, we proposed to collect data from GitHub from scratch. We first crawled 1.2M python-related repositories hosted by GitHub. Then, we used these repository URLs to download all contents of each repository from GitHub. After that, we got 60M raw python files under 1MB with a total size of 330GB. Finally, we carefully designed various strategies of data cleaning to get about 96GB data for training. Please refer to the following table for the details.

Model	Repositories	Size and file after filtering
CodeParrot	0.56M	12GB (compressed), 5.4M
Codex	54M	159GB
PyCodeGPT	1.2M	96GB, 13M

Pretrained models

we aims to train median-large pre-trained models (model size with 110M) based on GPT-Neo:

PyCodeGPT-110M: derived from GPT-Neo 125M with a vocabulary size of 32K.

PyCodeGPT-110M is available on HuggingFace.

Evaluation

Install requirements (python 3.7)

$ pip install -r requirements.txt

Install HumanEval

Note that you can successfully evaluate your model after uncommenting 58th line of human-eval/human_eval/execution.py

$ git clone https://github.com/openai/human-eval
$ pip install -e human-eval

Run eval_human_eval.py to generate programs
- Arguments
  - model_name_or_path : Path to the model checkpoint to be evaluated.
  - output_dir : Path to save generated programs
  - num_completions : The number of program to be generated
  - temperature : Temperature for sampling
  - top_p : p value for nucleus sampling
  - max_new_tokens : Maximum number of generated token
- Example usage
```
$ python eval_human_eval.py \
	--model_name_or_path PyCodeGPT-110M/ \
	--output_dir results/ \
	--num_completions 100 \
	--temperature 0.2 \
	--top_p 0.95 \
	--max_new_tokens 100 \
	--gpu_device 0
```

Evaluate functional correctness

$ evaluate_functional_correctness <samples_path>
# Example
$ evaluate_functional_correctness results/human_eval.t0.2.p0.95.l100.n100.samples.jsonl

Here's our evaluation result on HumanEval dataset:

Note: our model can have a comparable accuracy with Codex of similar model size.

Model	Pass@1	Pass@10	Pass@100
PyCodeGPT-110M	8.32%	13.53%	18.3%

GPT-Neo 125M	0.75%	1.88%	2.97%
GPT-Neo 1.3B	4.97%	7.47%	16.3%
GPT-Neo 2.7B	6.41%	11.27%	21.37%
GPT-J 6B	11.62%	15.74%	27.74%

TabNine	2.58%	4.35%	7.59%

CodeParrot 110M	3.80%	6.57%	12.78%
CodeParrot 1.5B	3.58%	8.03%	14.96%

Codex 12M	2.00%	3.62%	8.58%
Codex 25M	3.21%	7.1%	12.89%
Codex 42M	5.06%	8.8%	15.55%
Codex 85M	8.22%	12.81%	22.4%
Codex 300M	13.17%	20.37%	36.27%
Codex 679M	16.22%	25.7%	40.95%
Codex 2.5B	21.36%	35.42%	59.5%
Codex 12B	28.81%	46.81%	72.31%

Pretrained Decoder-only 13M (AlphaCode)	1.5%	3.6%	8.6%
Pretrained Decoder-only 29M (AlphaCode)	3.4%	5.8%	11.2%
Pretrained Decoder-only 55M (AlphaCode)	4.2%	8.2%	16.9%
Pretrained Decoder-only 89M (AlphaCode)	4.3%	12.2%	20.0%
Pretrained Decoder-only 302M (AlphaCode)	11.6%	18.8%	31.8%
Pretrained Decoder-only 685M (AlphaCode)	14.2%	24.4%	38.8%
Pretrained Decoder-only 1.1B (AlphaCode)	17.1%	28.2%	45.3%

PolyCoder 160M	2.13%	3.35%	4.88%
PolyCoder 400M	2.96%	5.29%	11.59%
PolyCoder 2.7B	5.59%	9.84%	17.68%

Reference

If you want to use the models, you need to cite our following paper:

@inproceedings{CERT,
  title={{CERT}: Continual Pre-training on Sketches for Library-oriented Code Generation},
  author={Zan, Daoguang and Chen, Bei and Yang, Dejian and Lin, Zeqi and Kim, Minsu and Guan, Bei and Wang, Yongji and Chen, Weizhu and Lou, Jian-Guang},
  booktitle={The 2022 International Joint Conference on Artificial Intelligence},
  year={2022}
}

pycodegpt's People

Contributors

Stargazers

Watchers

Forkers

test-mass-forker-org-1 nashid isabella232 jiashenggu zdaoguang to-be-architect kamelkaouech hemanthkumarak kpdriscoll6 mf1832146 hamsterjw3 joydeba darenr rampall levuminhhuy nurb432 sorokinvld sekosan simplelp trentbrucegithub manu87ds quantsdynamics moorthi07 jv-ai elkayvee megahanga ayoubjadouli denysmiller ikshovon ljniox ddyuqing bozorgmehr alignment-lab-ai dennis-nedry-from-jurassic-park mlkmehrad sponkmonk tongjiaming jishnu-h terryyz storminstakk

pycodegpt's Issues

Training data

Have you released the training data that is used to train APIRetriever?

Run eval on CPU instead of GPU

Use --gpu-device as -1

python eval_human_eval.py \
	--model_name_or_path PyCodeGPT-110M/ \
	--output_dir results/ \
	--num_completions 100 \
	--temperature 0.2 \
	--top_p 0.95 \
	--max_new_tokens 100 \
	--gpu_device -1

pass@1 = 1.0 for HumanEval, pass@1 = 0.0 for TorchDataEval

I am trying to validate evaluation of apicoder.

I simply make a perfect evaluation file by "completion" field in the evaluation file same as "canonical_solutions" in the problem file.
However, all of the examples in TorchDataEval failed with "result": "failed: 'NoneType' object is not callable" error, while HumanEval pass all examples.
Any suggestion to solve this issue?

I attach the 2 examples in problem & evaluation files for HumanEval & TorchDataEval datasets for reference.

HumanEval

Problem file

{"task_id": "HumanEval/0", "prompt": "from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    \"\"\"\n", "entry_point": "has_close_elements", "canonical_solution": "    for idx, elem in enumerate(numbers):\n        for idx2, elem2 in enumerate(numbers):\n            if idx != idx2:\n                distance = abs(elem - elem2)\n                if distance < threshold:\n                    return True\n\n    return False\n", "test": "\n\nMETADATA = {\n    'author': 'jt',\n    'dataset': 'test'\n}\n\n\ndef check(candidate):\n    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True\n    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False\n    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True\n    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False\n    assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True\n    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True\n    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False\n\n"}
{"task_id": "HumanEval/1", "prompt": "from typing import List\n\n\ndef separate_paren_groups(paren_string: str) -> List[str]:\n    \"\"\" Input to this function is a string containing multiple groups of nested parentheses. Your goal is to\n    separate those group into separate strings and return the list of those.\n    Separate groups are balanced (each open brace is properly closed) and not nested within each other\n    Ignore any spaces in the input string.\n    >>> separate_paren_groups('( ) (( )) (( )( ))')\n    ['()', '(())', '(()())']\n    \"\"\"\n", "entry_point": "separate_paren_groups", "canonical_solution": "    result = []\n    current_string = []\n    current_depth = 0\n\n    for c in paren_string:\n        if c == '(':\n            current_depth += 1\n            current_string.append(c)\n        elif c == ')':\n            current_depth -= 1\n            current_string.append(c)\n\n            if current_depth == 0:\n                result.append(''.join(current_string))\n                current_string.clear()\n\n    return result\n", "test": "\n\nMETADATA = {\n    'author': 'jt',\n    'dataset': 'test'\n}\n\n\ndef check(candidate):\n    assert candidate('(()()) ((())) () ((())()())') == [\n        '(()())', '((()))', '()', '((())()())'\n    ]\n    assert candidate('() (()) ((())) (((())))') == [\n        '()', '(())', '((()))', '(((())))'\n    ]\n    assert candidate('(()(())((())))') == [\n        '(()(())((())))'\n    ]\n    assert candidate('( ) (( )) (( )( ))') == ['()', '(())', '(()())']\n"}
...

Evaluation file

{"task_id": "HumanEval/0", "completion": "    for idx, elem in enumerate(numbers):\n        for idx2, elem2 in enumerate(numbers):\n            if idx != idx2:\n                distance = abs(elem - elem2)\n                if distance < threshold:\n                    return True\n\n    return False\n"}
{"task_id": "HumanEval/1", "completion": "    result = []\n    current_string = []\n    current_depth = 0\n\n    for c in paren_string:\n        if c == '(':\n            current_depth += 1\n            current_string.append(c)\n        elif c == ')':\n            current_depth -= 1\n            current_string.append(c)\n\n            if current_depth == 0:\n                result.append(''.join(current_string))\n                current_string.clear()\n\n    return result\n"}
...

TorchDataEval

Problem file

{"task_id": "TorchDataEval/0", "prompt": "from torchdata.datapipes.iter import IterableWrapper\ndatapipe = IterableWrapper([1,2,3])\n# How to augument the datapipe by repeating it six times.\nnew_datapipe =", "entry_point": "none", "canonical_solution": [" Cycler(datapipe, 6)", " datapipe.cycle(6)"], "test": "\n\nMETADATA = {\n    'author': 'msra-v-dazan',\n    'dataset': 'test',\n    'type': 'Cycler'\n}\n\n\ndef check():\n    assert list(new_datapipe) == [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]\n\n"}
{"task_id": "TorchDataEval/1", "prompt": "from torchdata.datapipes.iter import IterableWrapper\n\ndp = IterableWrapper(['a', 'b', 'c'])\n# Assign indexs to the datepipe object.\nnew_dp =", "entry_point": "none", "canonical_solution": [" dp.enumerate()", " Enumerator(dp)"], "test": "\n\nMETADATA = {\n    'author': 'msra-v-dazan',\n    'dataset': 'test',\n    'type': 'Enumerator'\n}\n\n\ndef check():\n    assert list(new_dp) == [(0, 'a'), (1, 'b'), (2, 'c')]\n\n"}
...

Evaluation file

{"task_id": "TorchDataEval/0", "completion": " datapipe.cycle(6)"}
{"task_id": "TorchDataEval/1", "completion": " dp.enumerate()"}
...

No module named 'nl2code.dynamic_block_dataset' when run_generating_codes.sh

Hello.
When I executing run_generating_codes.sh, the following error occurs.

 File "/data/gmkim/PyCodeGPT/apicoder/CodeGenAPI/eval_private.py", line 13, in <module>
    from nl2code.modeling_codegen import CodeGenForCausalLM
  File "/data/gmkim/PyCodeGPT/apicoder/CodeGenAPI/nl2code/__init__.py", line 3, in <module>
    from .code_dataset import CodeBlockDataset, CodeDatasetCallBack
  File "/data/gmkim/PyCodeGPT/apicoder/CodeGenAPI/nl2code/code_dataset.py", line 14, in <module>
    from .dynamic_block_dataset import DynamicBlockDataset
ModuleNotFoundError: No module named 'nl2code.dynamic_block_dataset'

I couldn't find dynamic_block_dataset from this repository. Could you suggest the solution?

Merge this pull request

Huggingface / others web app request

(Feature request) it would be better to deploy a web app using this code just like in the case of visual chatgpt so as to make this easily accessible ...

microsoft / pycodegpt Goto Github PK

pycodegpt's Introduction

PyCodeGPT

What is it?

Training Data

Pretrained models

Evaluation

Reference

pycodegpt's People

Contributors

Stargazers

Watchers

Forkers

pycodegpt's Issues

HumanEval

Problem file

Evaluation file

TorchDataEval

Problem file

Evaluation file

Recommend Projects

Recommend Topics

Recommend Org