Giter Site home page Giter Site logo

salesforce / codetf Goto Github PK

View Code? Open in Web Editor NEW
1.4K 21.0 95.0 10.94 MB

CodeTF: One-stop Transformer Library for State-of-the-art Code LLM

License: Apache License 2.0

Python 100.00%
ai4code ai4se code-generation code-intelligence code-understanding transformers code-learning-datasets code-representation-learning human-eval tree-sitter

codetf's Introduction



license python downloads

Technical Report, Documentation, Examples,

CodeTF - A One-stop Transformer Library for State-of-the-art Code LLM

Table of Contents

Introduction

CodeTF is a one-stop Python transformer-based library for code large language models (Code LLMs) and code intelligence, provides a seamless interface for training and inferencing on code intelligence tasks like code summarization, translation, code generation and so on. It aims to facilitate easy integration of SOTA CodeLLMs into real-world applications.

In addition to the core LLMs's features for code, CodeTF offers utilities for code manipulation across various languages, including easy extraction of code attributes. Using tree-sitter as its core AST parser, it enables parsing of attributes such as function names, comments, and variable names. Pre-built libraries for numerous languages are provided, eliminating the need for complicated parser setup. CodeTF thus ensures a user-friendly and accessible environment for code intelligence tasks.

The current version of the library offers:

  • Fast Model Serving: We support an easy-to-use interface for rapid inferencing with pre-quantized models (int8, int16, float16). CodeTF handles all aspects of device management, so users do not have to worry about that aspect. If your model is large, we offer advanced features such as weight sharding across GPUs to serve the models more quickly.
  • Fine-Tuning Your Own Models: We provide an API for quickly fine-tuning your own LLMs for code using SOTA techniques for parameter-efficient fine-tuning (HuggingFace PEFT) on distributed environments.
  • Supported Tasks: nl2code, code summarization, code completion, code translation, code refinement, clone detection, defect prediction.
  • Datasets+: We have preprocessed well-known benchmarks (Human-Eval, MBPP, CodeXGLUE, APPS, etc.) and offer an easy-to-load feature for these datasets.
  • Model Evaluator: We provide interface to evaluate models on well-known benchmarks (e.g. Human-Eval) on popular metrics (e.g., pass@k) with little effort (~15 LOCs).
  • Pretrained Models: We supply pretrained checkpoints of state-of-the-art foundational language models of code (CodeBERT, CodeT5, CodeGen, CodeT5+, Incoder, StarCoder, etc.).
  • Fine-Tuned Models: We furnish fine-tuned checkpoints for 8+ downstream tasks.
  • Utility to Manipulate Source Code: We provide utilities to easily manipulate source code, such as user-friendly AST parsers (based on tree-sitter) in 15+ programming languages, to extract important code features, such as function name, identifiers, etc.

The following table shows the supported models with sizes and the tasks that the models support. This is a continuing effort and we are working on further growing the list.

Model Size Tasks
CodeT5 Base, Base-multi-sum, Base-translate-cs, Base-translate-java, Base-sum, Base-clone, Base-defect Pretrained, NL to Code, Refine, Translation (CS to Java, Java to CS), Summarization (Python, Go, PHP, JavaScript, Java, Ruby), Clone detection, Defect prediction
CodeT5+ Plus-instruct-16B, Plus-16B, Plus-6B, Plus-2B, Plus-770M-python, Plus-770M, Plus-220M Pretrained, NL to Code, Refine , Defect prediction
CodeGen Mono: 350M, 2B, 6B, 1B, 3.7B, 7B, 16B
Multi: 350M, 2B, 6B
NL: 350M, 2B
Pretrained
StarCoder 15.5B Pretrained
SantaCoder 1.1B Pretrained
GPT-NeoX 20B Pretrained
GPT-Neo 1.3B Pretrained
GPT-J 6B Pretrained
Incoder 6B Pretrained
CodeParrot Small-python (110M), Small-multi(110M), 1.5B Pretrained
CodeBERT CodeBERT-base, UnixCoder-base, CodeBERTa-small Pretrained

Installation Guide

  1. (Optional) Creating conda environment
conda create -n codetf python=3.8
conda activate codetf
  1. Install from PyPI:
pip install salesforce-codetf
  1. Alternatively, build CodeTF from source:
git clone https://github.com/salesforce/CodeTF.git
cd CodeTF
pip install -e .

Additionally, to make sure the quantization feature works well, also install these dependencies:

pip install -q -U git+https://github.com/huggingface/transformers.git
pip install -q -U git+https://github.com/huggingface/peft.git
pip install -q -U git+https://github.com/huggingface/accelerate.git

For some models, such as StarCoder, it is required to log in Huggingface. Please obtain the HuggingFace token and login:

huggingface-cli login

Getting Started

Inferencing Pipeline

Getting started with CodeTF is simple and quick with our model loading pipeline function load_model_pipeline(). Here's an example showing how to load codet5+ model and perform inference on code generation task:

from codetf.models import load_model_pipeline

code_generation_model = load_model_pipeline(model_name="codet5", task="pretrained",
            model_type="plus-770M-python", is_eval=True,
            load_in_8bit=True, load_in_4bit=False, weight_sharding=False)
            
result = code_generation_model.predict(["def print_hello_world():"])
print(result)

There are a few notable arguments that need to be considered:

  • model_name: the name of the model, currently support codet5 and causal-lm.
  • model_type: type of model for each model name, e.g. base, codegen-350M-mono, j-6B, etc.
  • load_in_8bit and load_in_4bit: inherit the dynamic quantization feature from Huggingface Quantization.
  • weight_sharding: our advance feature that leverages HuggingFace Sharded Checkpoint to split a large model in several smaller shards in different GPUs. Please consider using this if you are dealing with large models.

Model Zoo

You might want to view all of the supported models. To do this, you can use the model_zoo():

from codetf.models import model_zoo
print(model_zoo)
# ============================================================================================================
# Architectures                  Types                           Tasks
# ============================================================================================================
# causallm                       codegen-350M-mono              pretrained
#                                codegen-350M-multi             pretrained
#                                codegen-350M-nl                pretrained
#                                codegen-2B-mono                pretrained
#                                codegen-2B-multi               pretrained
#                                codegen-2B-nl                  pretrained
#                                codegen-6B-mono                pretrained
#                                codegen-6B-nl                  pretrained
#                                codegen-6B-multi               pretrained
#                                starcoder-15.5B                pretrained
#                                gpt-neox-20B                   pretrained
#                                gpt-neo-1.3B                   pretrained
#                                gpt-j-6B                       pretrained
#                                incoder-6B                     pretrained
#                                codegen2-1B                    pretrained
#                                codegen2-3.7B                  pretrained
#                                codegen2-7B                    pretrained
#                                codegen2-16B                   pretrained
# codet5                         base-multi-sum                 pretrained
#                                base                           nl2code
#                                base                           refine
#                                base                           translate_cs_java
#                                base                           translate_java_cs
#                                base                           sum_python
#                                base                           sum_go
#                                base                           sum_php
#                                base                           sum_javascript
#                                base                           sum_java
#                                base                           sum_ruby
#                                base                           clone
#                                base                           defect
#                                plus-instruct-16B              pretrained
#                                plus-16B                       pretrained
#                                plus-6B                        pretrained
#                                plus-2B                        pretrained
#                                plus-770M-python               pretrained
#                                plus-770M                      pretrained
#                                plus-220M                      pretrained
# bert                           codebert-base                  pretrained
#                                unixcoder-base                 pretrained
#                                codeberta-small                pretrained

Fine-Tuning Pipeline

Want to train a custom LLM for code? We've got you covered. Below is an example using the Seq2SeqTrainer to fine-tune a CodeT5+ pretrained model, along with our dataset utilities, make it easy to fine-tune your models using the CodeXGLUE dataset. Here's an example:

from codetf.trainer.codet5_trainer import CodeT5Seq2SeqTrainer
from codetf.data_utility.codexglue_dataset import CodeXGLUEDataset
from codetf.models import load_model_pipeline
from codetf.performance.evaluation_metric import EvaluationMetric
from codetf.data_utility.base_dataset import CustomDataset

model_class = load_model_pipeline(model_name="codet5", task="pretrained",
            model_type="plus-220M", is_eval=True)

dataset = CodeXGLUEDataset(tokenizer=model_class.get_tokenizer())
train, test, validation = dataset.load(subset="text-to-code")

train_dataset= CustomDataset(train[0], train[1])
test_dataset= CustomDataset(test[0], test[1])
val_dataset= CustomDataset(validation[0], validation[1])

evaluator = EvaluationMetric(metric="bleu", tokenizer=model_class.tokenizer)

# peft can be in ["lora", "prefixtuning"]
trainer = CodeT5Seq2SeqTrainer(train_dataset=train_dataset, 
                                validation_dataset=val_dataset, 
                                peft="lora",
                                pretrained_model_or_path=model_class.get_model(),
                                tokenizer=model_class.tokenizer)
trainer.train()

Comparing to this script from StarCoder, which requires ~300 LOCs to fine-tune a model, we only need 14 LOCs to do the same !!!

Evaluate on Well-Known Benchmarks

Planning to reproduce the results of well-known benchmarks like Human-Eval, but struggling with not achieving the same numbers as reported in the original papers? Worried about the complicated evaluation process? Don't worry, we've got you covered with an intuitive, easy-to-use interface. Here's a sample snippet demonstrating how to evaluate Human Eval using pass@k (k=[1,10,100]) as the metric:

from codetf.models import load_model_pipeline
from codetf.data_utility.human_eval_dataset import HumanEvalDataset
from codetf.performance.model_evaluator import ModelEvaluator

os.environ["HF_ALLOW_CODE_EVAL"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "true"

model_class = load_model_pipeline(model_name="causal-lm", task="pretrained",
            model_type="codegen-350M-mono", is_eval=True,
            load_in_8bit=True, weight_sharding=False)

dataset = HumanEvalDataset(tokenizer=model_class.get_tokenizer())
prompt_token_ids, prompt_attention_masks, references= dataset.load()

problems = TensorDataset(prompt_token_ids, prompt_attention_masks)

evaluator = ModelEvaluator(model_class)
avg_pass_at_k = evaluator.evaluate_pass_k(problems=problems, unit_tests=references)
print("Pass@k: ", avg_pass_at_k)

Comparing to this script from HuggingFace, which requires ~230 LOCs to evaluate on pass@k, we only need 14 LOCs to do the same !!!

Loading Preprocessed Data

CodeTF provides the Dataset utility for several well-known datasets, such as CodeXGLUE, Human Eval, MBPP, and APPS. The following is an example of how to load the CodeXGLUE dataset:

from codetf.data_utility.codexglue_dataset import CodeXGLUEDataset
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("Salesforce/codet5-base", use_fast=True)
dataset = CodeXGLUEDataset(tokenizer=tokenizer)
train, test, validation = dataset.load(subset="text-to-code")

The train, test, validation are returned in form of Pytorch tensor to provide the flexilbity for the users to wrap it into higher-lever wrapper for their own use cases.

Code Utilities

In addition to providing utilities for LLMs, CodeTF also equips users with tools for effective source code manipulation. This is crucial in the code intelligence pipeline, where operations like parsing code into an Abstract Syntax Tree (AST) or extracting code attributes (such as function names or identifiers) are often required (CodeT5). These tasks can be challenging to execute, especially when setup and multi-language support is needed. Our code utility interface offers a streamlined solution, facilitating easy parsing and attribute extraction from code across 15+ languages.

AST Parser in Multiple Languages

CodeTF includes AST parsers compatible with numerous programming languages. Here's an example showcasing the parsing of Apex code into an AST:

from codetf.code_utility.apex.apex_code_utility import ApexCodeUtility

apex_code_utility = ApexCodeUtility()

sample_code = """
    public class SampleClass {    
        public Integer myNumber;
        
        **
        * This is a method that returns the value of myNumber.
        * @return An integer value
        */
        public Integer getMyNumber() {
            // Return the current value of myNumber
            return this.myNumber;
        }
    }
"""
ast = apex_code_utility.parse(sample_code)

# This will print the tree-sitter AST object
print(ast)

Then you can traverse the tree using the interface from py-tree-sitter

root_node = ast.root_node
assert root_node.type == 'module'
assert root_node.start_point == (1, 0)
assert root_node.end_point == (3, 13)

There are also other utilities for Java, Python, etc, that can perform the same operations.

Extract Code Attributes

CodeTF provides an interface to easily extract code attributes. The following is a sample for extracting the function name of a Python function:

code_attributes = apex_code_utility.get_code_attributes(sample_code)
print(code_attributes)

This will print: {'class_names': ['AccountWithContacts'], 'method_names': ['getAccountsWithContacts'], 'comments': [], 'variable_names': ['acc', 'accounts', 'con', 'System', 'debug', 'Contacts', 'Id', 'Name', 'Account', 'Email', 'LastName']}

Remove Comments

There are other existing utilities, such as removing comments from code:

new_code_snippet = apex_code_utility.remove_comments(sample_code)
print(new_code_snippet)

This will print:

public class SampleClass {    
        public Integer myNumber;
        public Integer getMyNumber() {
            return this.myNumber;
        }
    }

Note that this is an ongoing process, we will add more features to extract complicated code attributes in the future. More examples can be found here.

More Examples

You can find more examples for each use case:

Notes

  • CodeTF is designed to complement and enhance the capabilities of HuggingFace Transformers, rather than replace it. It serves as a specialized layer specifically tailored for code intelligence tasks, such as fine-tuning language models with code-specific features and evaluating on well-known code intelligence benchmarks. If users require more customization, they are encouraged to write their own training code from scratch.
  • CodeTF leverages the powerful functionality provided by Accelerate for both inference and training. With Accelerate, users do not need to manually manage GPUs or CPU devices for most operations, allowing for a streamlined and efficient workflow.

Ethical and Responsible Use

CodeTF, while powerful, does not guarantee infallible code intelligence capabilities. Users may encounter inaccuracies or biases, possibly leading to misinterpretations or undesired behaviors. Risks include the generation of insecure code, propagation of poor coding practices, or inadvertent revelation of sensitive data. We strongly advise users to examine the pretrained models and system before practical adoption. CodeTF facilitates effective code analysis, prediction, and debugging, promoting reproducible research and development. We encourage its responsible use for enhancing software quality and developer productivity.

However, misuse can lead to unethical outcomes such as unauthorized code manipulation, privacy breaches, or insecure coding practices. Users should familiarize themselves with guidelines for responsible AI before using CodeTF. Our commitment is to continually refine the library by identifying and mitigating potential biases and inappropriate behaviors. Users should review the models and system before practical implementation, and contribute towards refining the library to ensure ethical usage.

Technical Report and Citing CodeTF

You can find more details in our technical report.

If you're using CodeTF in your research or applications, please cite using this BibTeX:

@misc{nghi2023codetf,
      title={CodeTF: A Transformer-based Library for CodeLLM & Code Intelligence}, 
      author={Nghi D. Q. Bui, Henry Le, Yue Wang, Akhilesh Deepak Gotmare, Junnan Li, Steven Hoi.},
      year={2023},
      eprint={2209.09019},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Contact us

If you have any questions, comments or suggestions, please do not hesitate to contact us at [email protected].

License

Apache License Version 2.0

codetf's People

Contributors

bdqnghi avatar dependabot[bot] avatar paul-b98 avatar v-i-s-h avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

codetf's Issues

[FileNotFoundError] Missing config.yaml

Code:

#filename: file.py
from codetf.models import load_model_pipeline

code_generation_model = load_model_pipeline(model_name="codet5", task="pretrained",
            model_type="plus-220M", is_eval=True,
            load_in_8bit=True, weight_sharding=False)
            
result = code_generation_model.predict(["Generate a complete C program to add two numbers and display the result."])
print(result)

Python3 file.py
FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.local/lib/python3.10/site-packages/codetf/configs/default.yaml

I am using linux kernel 5.15.0-46-generic, Ubuntu 22.04.
Installation mode: pip install salesforce-codetf

GGML

Just wondering how GGML fits into this project, if at all.

MPS support

It would be nice to support Metal Performance Shaders backend for pytorch on macOS

the main changes for inference seem to be:

  • load_in_8bit must be False
  • device_map set to {"":"mps"}

see:

https://github.com/Birch-san/falcon-play/blob/cbf9b2aebe7eef3eea305a511d6cdda17282ca8a/scripts/chat_play.py#L154-L157

https://github.com/Birch-san/falcon-play/blob/cbf9b2aebe7eef3eea305a511d6cdda17282ca8a/scripts/chat_play.py#L175

cc @Birch-san are there any other good docs or resources for adding mps support?

ValueError: Config name is missing.

Hi! I am tring using CodeXGLUE refinement dataset, but I encounter error below:

Traceback (most recent call last):
  File "/Users/zhengyi/Desktop/code/python/llm-projects/CodeT5-fine-tune/lora.py", line 37, in <module>
    main()
  File "/Users/zhengyi/Desktop/code/python/llm-projects/CodeT5-fine-tune/lora.py", line 17, in main
    train, test, validation = dataset.load(subset="code-refinement", config_name="small")
  File "/Users/zhengyi/Desktop/code/python/framework-projects/CodeTF/codetf/data_utility/codexglue_dataset.py", line 19, in load
    return self.load_funcs[subset](*args, **kwargs)
  File "/Users/zhengyi/Desktop/code/python/framework-projects/CodeTF/codetf/data_utility/codexglue_dataset.py", line 80, in load_codexglue_code_refinement_dataset
    dataset = load_dataset(dataset)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/CodeTF/lib/python3.9/site-packages/datasets/load.py", line 1785, in load_dataset
    builder_instance = load_dataset_builder(
  File "/opt/homebrew/Caskroom/miniconda/base/envs/CodeTF/lib/python3.9/site-packages/datasets/load.py", line 1540, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/opt/homebrew/Caskroom/miniconda/base/envs/CodeTF/lib/python3.9/site-packages/datasets/builder.py", line 355, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/opt/homebrew/Caskroom/miniconda/base/envs/CodeTF/lib/python3.9/site-packages/datasets/builder.py", line 487, in _create_builder_config
    raise ValueError(
ValueError: Config name is missing.
Please pick one among the available configs: ['medium', 'small']
Example of usage:
	`load_dataset('code_x_glue_cc_code_refinement', 'medium')`

It seems that the function load_codexglue_code_refinement_dataset in codexglue.py doesn't pass the extra value to load dataset. Hope to fix it like #32. Thanks a lot!

Extract Code Embeddings

I'm interested in performing embeddings of source code files for measuring the semantic similarity of content, but I'm not sure which model and task are better suited for my case as there is no 'representation-learning' or 'feature-extraction' task mentioned.

Furthermore, in the positive case, would it be possible to use the model with the HuggingFace "feature-extraction" pipeline?

Cannot load "starcoder-15.5B" with weight_sharding=True

To reproduce:

model = load_model_pipeline(model_name="causallm", task="pretrained", model_type="starcoder-15.5B", is_eval=True, weight_sharding=True)

The error I get:
Entry Not Found for url: https://huggingface.co/bigcode/starcoder/resolve/main/pytorch_model.bin.

I believe the line that causes the problem is

weights_location = hf_hub_download(checkpoint, "pytorch_model.bin")

ValueError: Config name is missing.

I tried to run the demo example for fine tuning the CodeT5+ Model in the README but set the CodeXGlue dataset from text-to-code to code-to-text. It would be helpful to have the option to set this var.

def load_codexglue_code_to_text_dataset(self):
dataset = self.dataset_config["codexglue_code_to_text"]
dataset = load_dataset(dataset)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 11
      7 model_class = load_model_pipeline(model_name="codet5", task="pretrained",
      8             model_type="plus-220M", is_eval=True)
     10 dataset = CodeXGLUEDataset(tokenizer=model_class.get_tokenizer())
---> 11 train, test, validation = dataset.load(subset="code-to-text")
     13 train_dataset= CustomDataset(train[0], train[1])
     14 test_dataset= CustomDataset(test[0], test[1])

File [~/projects/edu/master/CodeTF/codetf/data_utility/codexglue_dataset.py:19](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/projects/edu/master/CodeTF/codetf/data_utility/codexglue_dataset.py:19), in CodeXGLUEDataset.load(self, subset)
     17 def load(self, subset):
     18     if subset in self.load_funcs:
---> 19         return self.load_funcs[subset]()
     20     else:
     21         raise ValueError(f'Invalid subset {subset}. Available subsets are: {list(self.load_funcs.keys())}')

File [~/projects/edu/master/CodeTF/codetf/data_utility/codexglue_dataset.py:43](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/projects/edu/master/CodeTF/codetf/data_utility/codexglue_dataset.py:43), in CodeXGLUEDataset.load_codexglue_code_to_text_dataset(self)
     41 def load_codexglue_code_to_text_dataset(self):
     42     dataset = self.dataset_config["codexglue_code_to_text"]
---> 43     dataset = load_dataset(dataset)
     45     train = dataset["train"]
     46     train_code_tensors, _ = self.process_data(train["code"])

File [~/.conda/envs/codetf/lib/python3.8/site-packages/datasets/load.py:1773](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/.conda/envs/codetf/lib/python3.8/site-packages/datasets/load.py:1773), in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   1768 verification_mode = VerificationMode(
   1769     (verification_mode or VerificationMode.BASIC_CHECKS) if not save_infos else VerificationMode.ALL_CHECKS
   1770 )
   1772 # Create a dataset builder
-> 1773 builder_instance = load_dataset_builder(
   1774     path=path,
   1775     name=name,
   1776     data_dir=data_dir,
   1777     data_files=data_files,
   1778     cache_dir=cache_dir,
   1779     features=features,
   1780     download_config=download_config,
   1781     download_mode=download_mode,
   1782     revision=revision,
   1783     use_auth_token=use_auth_token,
   1784     storage_options=storage_options,
   1785     **config_kwargs,
   1786 )
   1788 # Return iterable dataset in case of streaming
   1789 if streaming:

File [~/.conda/envs/codetf/lib/python3.8/site-packages/datasets/load.py:1528](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/.conda/envs/codetf/lib/python3.8/site-packages/datasets/load.py:1528), in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, use_auth_token, storage_options, **config_kwargs)
   1525     raise ValueError(error_msg)
   1527 # Instantiate the dataset builder
-> 1528 builder_instance: DatasetBuilder = builder_cls(
   1529     cache_dir=cache_dir,
   1530     config_name=config_name,
   1531     data_dir=data_dir,
   1532     data_files=data_files,
   1533     hash=hash,
   1534     features=features,
   1535     use_auth_token=use_auth_token,
   1536     storage_options=storage_options,
   1537     **builder_kwargs,
   1538     **config_kwargs,
   1539 )
   1541 return builder_instance

File [~/.conda/envs/codetf/lib/python3.8/site-packages/datasets/builder.py:340](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/.conda/envs/codetf/lib/python3.8/site-packages/datasets/builder.py:340), in DatasetBuilder.__init__(self, cache_dir, config_name, hash, base_path, info, features, use_auth_token, repo_id, data_files, data_dir, storage_options, writer_batch_size, name, **config_kwargs)
    338 if data_dir is not None:
    339     config_kwargs["data_dir"] = data_dir
--> 340 self.config, self.config_id = self._create_builder_config(
    341     config_name=config_name,
    342     custom_features=features,
    343     **config_kwargs,
    344 )
    346 # prepare info: DatasetInfo are a standardized dataclass across all datasets
    347 # Prefill datasetinfo
    348 if info is None:

File [~/.conda/envs/codetf/lib/python3.8/site-packages/datasets/builder.py:469](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/.conda/envs/codetf/lib/python3.8/site-packages/datasets/builder.py:469), in DatasetBuilder._create_builder_config(self, config_name, custom_features, **config_kwargs)
    467 if len(self.BUILDER_CONFIGS) > 1:
    468     example_of_usage = f"load_dataset('{self.name}', '{self.BUILDER_CONFIGS[0].name}')"
--> 469     raise ValueError(
    470         "Config name is missing."
    471         f"\nPlease pick one among the available configs: {list(self.builder_configs.keys())}"
    472         + f"\nExample of usage:\n\t`{example_of_usage}`"
    473     )
    474 builder_config = self.BUILDER_CONFIGS[0]
    475 logger.info(f"No config specified, defaulting to the single config: {self.name}/{builder_config.name}")

ValueError: Config name is missing.
Please pick one among the available configs: ['go', 'java', 'javascript', 'php', 'python', 'ruby']
Example of usage:
	`load_dataset('code_x_glue_ct_code_to_text', 'go')`

UnboundLocalError: local variable 'peft_config' referenced before assignment

I tried to run the demo example for fine tuning the CodeT5+ Model in the README with the peft changed to prefixtuning

from codetf.trainer.codet5_trainer import CodeT5Seq2SeqTrainer
from codetf.data_utility.codexglue_dataset import CodeXGLUEDataset
from codetf.models import load_model_pipeline
from codetf.performance.evaluation_metric import EvaluationMetric
from codetf.data_utility.base_dataset import CustomDataset

model_class = load_model_pipeline(model_name="codet5", task="pretrained",
            model_type="plus-220M", is_eval=True)

dataset = CodeXGLUEDataset(tokenizer=model_class.get_tokenizer())
train, test, validation = dataset.load(subset="text-to-code")

train_dataset= CustomDataset(train[0], train[1])
test_dataset= CustomDataset(test[0], test[1])
val_dataset= CustomDataset(validation[0], validation[1])

evaluator = EvaluationMetric(metric="bleu", tokenizer=model_class.tokenizer)

# peft can be in ["lora", "prefixtuning"]
trainer = CodeT5Seq2SeqTrainer(train_dataset=train_dataset, 
                                validation_dataset=val_dataset, 
                                peft="prefixtuning",
                                pretrained_model_or_path=model_class.get_model(),
                                tokenizer=model_class.tokenizer)
trainer.train()

however, I got the following error:

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[1], line 20
     17 evaluator = EvaluationMetric(metric="bleu", tokenizer=model_class.tokenizer)
     19 # peft can be in ["lora", "prefixtuning"]
---> 20 trainer = CodeT5Seq2SeqTrainer(train_dataset=train_dataset, 
     21                                 validation_dataset=val_dataset, 
     22                                 peft="prefixtuning",
     23                                 pretrained_model_or_path=model_class.get_model(),
     24                                 tokenizer=model_class.tokenizer)
     25 trainer.train()

File [~/.conda/envs/codetf/lib/python3.8/site-packages/codetf/trainer/codet5_trainer.py:45](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/.conda/envs/codetf/lib/python3.8/site-packages/codetf/trainer/codet5_trainer.py:45), in CodeT5Seq2SeqTrainer.__init__(self, train_dataset, validation_dataset, tokenizer, checkpoints_path, pretrained_model_or_path, training_args, evaluator, evaluation_fn, peft)
     43     peft_config = self.get_default_lora_config_for_codet5()
     44 self.model.enable_input_require_grads()
---> 45 self.model = get_peft_model(self.model, peft_config)
     46 self.model.print_trainable_parameters()

UnboundLocalError: local variable 'peft_config' referenced before assignment

The logging and deps are teh same as in #29

ValueError: The batch received was empty

I run the demo in README:

# -*- encoding: utf-8 -*-

from codetf.trainer.causal_lm_trainer import CausalLMTrainer
from codetf.data_utility.codexglue_dataset import CodeXGLUEDataset
from codetf.models import load_model_pipeline
from codetf.performance.evaluation_metric import EvaluationMetric


model_class = load_model_pipeline(
    model_name="causal-lm",
    # model_name="codet5",
    task="pretrained",
    # model_type="starcoder-15.5B",
    model_type="codegen-350M-mono",
    # model_type="base-multi-sum",
    is_eval=False,
    load_in_8bit=False,
    weight_sharding=False,
)


dataloader = CodeXGLUEDataset(tokenizer=model_class.get_tokenizer())
train_dataset, test_dataset, val_dataset = dataloader.load(subset="text-to-code")

# peft can be in ["lora", "prefixtuning"]
trainer = CausalLMTrainer(
    train_dataset=train_dataset,
    validation_dataset=val_dataset,
    peft=None,
    pretrained_model_or_path=model_class.get_model(),
    tokenizer=model_class.get_tokenizer(),
)
trainer.train()


evaluator = EvaluationMetric(metric="bleu", tokenizer=model_class.tokenizer)
# trainer.evaluate(test_dataset=test_dataset)

However, I get a error:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/test//test_finetune.py:36 in <module>                        │
│                                                                                                  │
│   33 │   pretrained_model_or_path=model_class.get_model(),                                       │
│   34 │   tokenizer=model_class.get_tokenizer(),                                                  │
│   35 )                                                                                           │
│ ❱ 36 trainer.train()                                                                             │
│   37                                                                                             │
│   38                                                                                             │
│   39 evaluator = EvaluationMetric(metric="bleu", tokenizer=model_class.tokenizer)                │
│                                                                                                  │
│ /root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/codetf/trainer/base_trainer.py:54 in train                          │
│                                                                                                  │
│    51 │   │   )                                                                                  │
│    52 │                                                                                          │
│    53 │   def train(self):                                                                       │
│ ❱  54 │   │   self.trainer.train()                                                               │
│    55 │                                                                                          │
│    56 │   def evaluate(self, dataset=None):                                                      │
│    57 │   │   self.trainer.evaluate(dataset)                                                     │
│                                                                                                  │
│ /root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/trainer.py:1664 in train  │
│                                                                                                  │
│   1661 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1662 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1663 │   │   )                                                                                 │
│ ❱ 1664 │   │   return inner_training_loop(                                                       │
│   1665 │   │   │   args=args,                                                                    │
│   1666 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1667 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/accelerate/utils/memory.py:124 in      │
│ decorator                                                                                        │
│                                                                                                  │
│   121 │   │   │   if batch_size == 0:                                                            │
│   122 │   │   │   │   raise RuntimeError("No executable batch size found, reached zero.")        │
│   123 │   │   │   try:                                                                           │
│ ❱ 124 │   │   │   │   return function(batch_size, *args, **kwargs)                               │
│   125 │   │   │   except Exception as e:                                                         │
│   126 │   │   │   │   if should_reduce_batch_size(e):                                            │
│   127 │   │   │   │   │   gc.collect()                                                           │
│                                                                                                  │
│ /root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/trainer.py:1940 in        │
│ _inner_training_loop                                                                             │
│                                                                                                  │
│   1937 │   │   │   │   │   with model.no_sync():                                                 │
│   1938 │   │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                  │
│   1939 │   │   │   │   else:                                                                     │
│ ❱ 1940 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                      │
│   1941 │   │   │   │                                                                             │
│   1942 │   │   │   │   if (                                                                      │
│   1943 │   │   │   │   │   args.logging_nan_inf_filter                                           │
│                                                                                                  │
│ /root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/trainer.py:2728 in        │
│ training_step                                                                                    │
│                                                                                                  │
│   2725 │   │   │   `torch.Tensor`: The tensor with training loss on this batch.                  │
│   2726 │   │   """                                                                               │
│   2727 │   │   model.train()                                                                     │
│ ❱ 2728 │   │   inputs = self._prepare_inputs(inputs)                                             │
│   2729 │   │                                                                                     │
│   2730 │   │   if is_sagemaker_mp_enabled():                                                     │
│   2731 │   │   │   loss_mb = smp_forward_backward(model, inputs, self.args.gradient_accumulatio  │
│                                                                                                  │
│ /root/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/trainer.py:2675 in        │
│ _prepare_inputs                                                                                  │
│                                                                                                  │
│   2672 │   │   """                                                                               │
│   2673 │   │   inputs = self._prepare_input(inputs)                                              │
│   2674 │   │   if len(inputs) == 0:                                                              │
│ ❱ 2675 │   │   │   raise ValueError(                                                             │
│   2676 │   │   │   │   "The batch received was empty, your model won't be able to train on it.   │
│   2677 │   │   │   │   f"training dataset contains keys expected by the model: {','.join(self._  │
│   2678 │   │   │   )                                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: The batch received was empty, your model won't be able to train on it. Double-check that 
your training dataset contains keys expected by the model: 
input_ids,past_key_values,attention_mask,token_type_ids,position_ids,head_mask,inputs_embeds,labels,us
e_cache,output_attentions,output_hidden_states,return_dict,labels,label,label_ids.
  0%|                                                                          | 0/10 [00:01<?, ?it/s]

Loading codet5p-6b or other large models required trust_remote_code=True parameter

When trying to execute the following code:

from codetf.models import load_model_pipeline

code_generation_model = load_model_pipeline(model_name="codet5", task="pretrained",
            model_type="plus-6B", is_eval=True,
            load_in_8bit=True, load_in_4bit=False, weight_sharding=False)
            
result = code_generation_model.predict(["def find_largest_number_in_list():"])
print(result)

I get the below error.

ValueError: Loading Salesforce/codet5p-6b requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error.

The Readme example works perfectly fine

Error at loading the MPP dataset

There appears to be an error in the file name that contains the base class used for the MBPPDataset dataset. It seems that the correct name should be "base_dataset" instead of "base_dataloader."

╭─────────────────────────────── Traceback (most recent call last)────────────────────────────────╮
│ /fs04/qe26/PEFT/finetuning.py:3 in <module>                                                      │
│                                                                                                  │
│    1 from codetf.trainer.causal_lm_trainer import CausalLMTrainer                                │
│    2 # from codetf.data_utility.human_eval_dataset import HumanDataset                           │
│ ❱  3 from codetf.data_utility.mpp_dataset import MBPPDataset                                     │
│    4 # from codetf.data_utility.codexglue_dataset import CodeXGLUEDataset                        │
│    5 from codetf.models import load_model_pipeline                                               │
│    6 from codetf.performance.evaluation_metric import EvaluationMetric                           │
│                                                                                                  │
│ /scratch/qe26/crojasca/miniconda/conda/envs/jupyterlab/lib/python3.10/site-packages/codetf/data_ │
│ utility/mpp_dataset.py:6 in <module>                                                             │
│                                                                                                  │
│    3 import torch                                                                                │
│    4 import torch.nn.functional as F                                                             │
│    5 from datasets import load_dataset                                                           │
│ ❱  6 from codetf.data_utility.base_dataloader import BaseDataset                                 │
│    7 # from torch.utils.data import TensorDataset                                                │
│    8                                                                                             │
│    9 class MBPPDataset(BaseDataset):                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'codetf.data_utility.base_dataloader'

Unable to load models `plus-16B` and `plus-6B`

Hi, thank you for your work. I'm trying to use CodeT5+ model types plus-16B and plus-6B. However, when running, I get an error:

ValueError: CodeT5pEncoderDecoderModel does not support "device_map='auto'". To implement support, the modelclass needs to implement the "_no_split_modules" attribute.

The code I'm using is the same as provided in the examples:

from codetf.models import load_model_pipeline

code_generation_model = load_model_pipeline(model_name="codet5", task="pretrained",
            model_type="plus-6B", is_eval=True,
            load_in_8bit=True, load_in_4bit=False, weight_sharding=False)

result = code_generation_model.predict(["def print_hello_world():"])
print(result)

Any ideas on how the issue could be resolved?

I want to fine tuning codetf to generate json rule code.

How can I customize my dataset? this is the snippet of json.

{
"type": "page",
"body": {
"type": "collapse-group",
"activeKey": [
"1"
],
"body": [
{
"type": "collapse",
"key": "1",
"header": "title 1",
"body": "this is content 1"
},
{
"type": "collapse",
"key": "2",
"header": "title 2",
"body": "this is content 2"
},
{
"type": "collapse",
"key": "3",
"header": "title 3",
"body": "this is content 3"
}
]
}
}

Fail to install using poetry

I'm using the following pyproject.toml:

[tool.poetry]
name = "code-tf-test"
version = "0.1.0"
description = ""
authors = ["Alex Levin <[email protected]>"]
readme = "README.md"
packages = [{include = "code_tf_test"}]

[tool.poetry.dependencies]
python = "^3.10"


[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

but when I run poetry add salesforce-codetf==1.0.1.1, I get

Updating dependencies
Resolving dependencies... (1.0s)

The current project's Python requirement (>=3.10,<4.0) is not compatible with some of the required packages Python requirement:
  - numpy requires Python >=3.7,<3.11, so it will not be satisfied for Python >=3.11,<4.0

Because salesforce-codetf (1.0.1.1) depends on numpy (1.21.6) which requires Python >=3.7,<3.11, salesforce-codetf is forbidden.
So, because code-tf-test depends on salesforce-codetf (1.0.1.1), version solving failed.

  • Check your dependencies Python requirement: The Python requirement can be specified via the `python` or `markers` properties

    For numpy, a possible solution would be to set the `python` property to ">=3.10,<3.11"

    https://python-poetry.org/docs/dependency-specification/#python-restricted-dependencies,
    https://python-poetry.org/docs/dependency-specification/#using-environment-markers

I'm using Python 3.10.1

Unable to run example humaneval code

`!pip install sentencepiece
from codetf.models import load_model_pipeline
from codetf.data_utility.human_eval_dataset import HumanEvalDataset
from codetf.performance.model_evaluator import ModelEvaluator
import os

os.environ["HF_ALLOW_CODE_EVAL"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "true"

model_class = load_model_pipeline(model_name="causallm", task="pretrained",
model_type="codegen-350M-mono", is_eval=True,
load_in_8bit=True, weight_sharding=False)

dataset = HumanEvalDataset(tokenizer=model_class.get_tokenizer())
prompt_token_ids, prompt_attention_masks, references = dataset.load()

problems = TensorDataset(prompt_token_ids, prompt_attention_masks)

evaluator = ModelEvaluator(model_class)
avg_pass_at_k = evaluator.evaluate_pass_k(problems=problems, unit_tests=references)
print("Pass@k: ", avg_pass_at_k)`

Above is the code that was used. During execution in Google Colab, I received the error,
in <cell line: 15>:15 │
│ │
│ /usr/local/lib/python3.10/dist-packages/codetf/data_utility/human_eval_dataset.py:29 in load │
│ │
│ 26 │ │ │ unit_test = re.sub(r'METADATA = {[^}]*}', '', unit_test, flags=re.MULTILINE) │
│ 27 │ │ │ references.append(unit_test) │
│ 28 │ │ │
│ ❱ 29 │ │ prompt_token_ids, prompt_attention_masks = self.process_data(prompts, use_max_le │
│ 30 │ │ │
│ 31 │ │ return prompt_token_ids, prompt_attention_masks, references │
│ 32 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: BaseDataset.process_data() got an unexpected keyword argument 'use_max_length'

After looking through the source code I don't seem to see this keyword argument, apart from max_length. Would anyone mind shedding some light on the issue?

`defect` and `refine` tasks for `codet5p-16b`

Hi
Nice project!

Do you have the checkpoint and evaluation conclusions of the defect and refine tasks for codet5p-16b?

Here I tested the codet5 defect and refine task examples in codetf, and found that the effect is not very good.

pass@k seems be incorrectly computed on my end.

Hi, folks, thanks a lot for releasing this amazing project!!!

I was trying to run the evaluation module, with the following code,

import sys
from pathlib import Path
sys.path.append(str(Path(".").absolute().parent))
from codetf.models import load_model_pipeline
from codetf.data_utility.util import EOF_STRINGS, EndOfFunctionCriteria, remove_last_block
from torch.utils.data.dataloader import DataLoader
from transformers import StoppingCriteriaList
from codetf.data_utility.human_eval_dataset import HumanEvalDataset
import torch
import os
from accelerate import Accelerator
import torch
from collections import defaultdict
from tqdm import tqdm
import torch
from evaluate import load
from codetf.performance.model_evaluator import ModelEvaluator
from torch.utils.data import TensorDataset

def main():
    os.environ["HF_ALLOW_CODE_EVAL"] = "1"
    os.environ["TOKENIZERS_PARALLELISM"] = "true"

    model_class = load_model_pipeline(model_name="codet5", task="pretrained",
                model_type="plus-220M", is_eval=True,
                load_in_8bit=True, weight_sharding=False)

    dataset = HumanEvalDataset(tokenizer=model_class.get_tokenizer())
    prompt_token_ids, prompt_attention_masks, references= dataset.load()

    problems = TensorDataset(prompt_token_ids, prompt_attention_masks)
    
    evaluator = ModelEvaluator(model_class)
    avg_pass_at_k = evaluator.evaluate_pass_k(problems=problems, unit_tests=references, \
                                              num_workers=1, k=[1,10,100], batch_size=256, \
                                              num_return_sequences=200, sequences_per_chunk=10)
    print("Pass@k: ", avg_pass_at_k)

if __name__ == "__main__":
    main()

where I changed the evaluation parameters, and also the model to T5p-220M.

Surprisingly, the pass@k was too good to be true, I got

Pass@k:  {'pass@1': 0.09500000000000008, 'pass@10': 0.6403557771996593, 'pass@100': 0.9999992577464678}

which is much higher than 220M results in the T5+ paper.

I wonder is there anything wrong in the evaluation code above (it seems fine to me but I am sorry if I made some silly mistake)?

BTW, I wonder if you guys could kindly give some sample code of evaluating MBPP/APPS by any chance? Thanks a lot!! :)

Support new models

Hi,

Is it possible to have the library dynamically support other models such as GraphCodeBERT, and PLBART?

Thanks.

Functionality to use a different cache directory for storing models

Hello, I am using your library to run inferences on the StarCoder model. I am not able to modify the cache directory as I want it to download to a different mounted disk that has more space. I tried the way of setting environment variable (TRANSFORMERS_CACHE) through Python code but it also didn't work.
Can we have a functionality to pass the cache directory path as well while loading the models using the load_model_pipeline function?

Thank You!

Errors running inference examples

$ python test_inference/test_starcoder_nl2code.py
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/bigcode/starcoder/resolve/main/config.json

huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-647a1d8b-7884a0702c5cae05744608f7)

Repository Not Found for url: https://huggingface.co/bigcode/starcoder/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.

OSError: bigcode/starcoder is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
$ python test_inference/test_codet5_multitask.py
omegaconf.errors.ConfigKeyError: Missing key codet5-base-translate-cs-java
    full_key: codet5-base-translate-cs-java
    object_type=dict

directory tree-sitter-prebuilts not included in pip package

The pip package for salesforce-codetf==1.0.1 doesn't seem to include tree-sitter-prebuilts where the ast parser is searching for Darwin/python.so.

Environment:

Operating System: MacOS Ventura 13.2.1 (22D68)
Python Version: 3.10.9
tree-sitter Version: 0.20.1

Steps to Reproduce:

Perform fresh install of salesforce-codetf=1.0.1
Use PythonCodeUtility
Encountered an error because the .so file for Python is not included in the package

Expected Result:

The necessary .so file for Python should be included in the my python3.10/site-packages/codetf/tree-sitter-prebuilts/Darwin/python.so, or there should be clear instructions on how to generate it.

Actual Result:

The .so file for Python is missing, causing errors when attempting to parse Python code. There is no tree-sitter-prebuilts folder at all.

I would appreciate it if you could provide any advice on how to resolve this issue, or update the package to include the necessary files. Thank you for your time and help!

Output for running initialization of PythonCodeUtility:
OSError: dlopen(/Users/username/Projects/projname/CodeTF/codetf/tree-sitter-prebuilts/Darwin/python.so, 0x0006): tried: '/Users/username/Projects/projname/CodeTF/projname/tree-sitter-prebuilts/Darwin/python.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/username/Projects/projname/CodeTF/codetf/tree-sitter-prebuilts/Darwin/python.so' (no such file), '/Users/username/Projects/projname/CodeTF/codetf/tree-sitter-prebuilts/Darwin/python.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))

AttributeError: 'function' object has no attribute '__func__'

I tried to run the demo example for fine tuning the CodeT5+ Model in the README

from codetf.trainer.codet5_trainer import CodeT5Seq2SeqTrainer
from codetf.data_utility.codexglue_dataset import CodeXGLUEDataset
from codetf.models import load_model_pipeline
from codetf.performance.evaluation_metric import EvaluationMetric
from codetf.data_utility.base_dataset import CustomDataset

model_class = load_model_pipeline(model_name="codet5", task="pretrained",
            model_type="plus-220M", is_eval=True)

dataset = CodeXGLUEDataset(tokenizer=model_class.get_tokenizer())
train, test, validation = dataset.load(subset="text-to-code")

train_dataset= CustomDataset(train[0], train[1])
test_dataset= CustomDataset(test[0], test[1])
val_dataset= CustomDataset(validation[0], validation[1])

evaluator = EvaluationMetric(metric="bleu", tokenizer=model_class.tokenizer)

# peft can be in ["lora", "prefixtuning"]
trainer = CodeT5Seq2SeqTrainer(train_dataset=train_dataset, 
                                validation_dataset=val_dataset, 
                                peft="lora",
                                pretrained_model_or_path=model_class.get_model(),
                                tokenizer=model_class.tokenizer)
trainer.train()

however, I got the following error.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 25
     19 # peft can be in ["lora", "prefixtuning"]
     20 trainer = CodeT5Seq2SeqTrainer(train_dataset=train_dataset, 
     21                                 validation_dataset=val_dataset, 
     22                                 peft="lora",
     23                                 pretrained_model_or_path=model_class.get_model(),
     24                                 tokenizer=model_class.tokenizer)
---> 25 trainer.train()

File [~/.conda/envs/codetf/lib/python3.8/site-packages/codetf/trainer/base_trainer.py:54](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/.conda/envs/codetf/lib/python3.8/site-packages/codetf/trainer/base_trainer.py:54), in BaseTrainer.train(self)
     53 def train(self):
---> 54     self.trainer.train()

File [~/.conda/envs/codetf/lib/python3.8/site-packages/transformers/trainer.py:1645](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/.conda/envs/codetf/lib/python3.8/site-packages/transformers/trainer.py:1645), in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1640     self.model_wrapped = self.model
   1642 inner_training_loop = find_executable_batch_size(
   1643     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1644 )
-> 1645 return inner_training_loop(
   1646     args=args,
   1647     resume_from_checkpoint=resume_from_checkpoint,
   1648     trial=trial,
   1649     ignore_keys_for_eval=ignore_keys_for_eval,
   1650 )

File [~/.conda/envs/codetf/lib/python3.8/site-packages/accelerate/utils/memory.py:132](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/.conda/envs/codetf/lib/python3.8/site-packages/accelerate/utils/memory.py:132), in find_executable_batch_size..decorator(*args, **kwargs)
    130     raise RuntimeError("No executable batch size found, reached zero.")
    131 try:
--> 132     return function(batch_size, *args, **kwargs)
    133 except Exception as e:
    134     if should_reduce_batch_size(e):

File [~/.conda/envs/codetf/lib/python3.8/site-packages/transformers/trainer.py:1756](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/.conda/envs/codetf/lib/python3.8/site-packages/transformers/trainer.py:1756), in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1754         model = self.accelerator.prepare(self.model)
   1755     else:
-> 1756         model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
   1757 else:
   1758     # to handle cases wherein we pass "DummyScheduler" such as when it is specified in DeepSpeed config.
   1759     model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
   1760         self.model, self.optimizer, self.lr_scheduler
   1761     )

File [~/.conda/envs/codetf/lib/python3.8/site-packages/accelerate/accelerator.py:1182](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/.conda/envs/codetf/lib/python3.8/site-packages/accelerate/accelerator.py:1182), in Accelerator.prepare(self, device_placement, *args)
   1180     result = self._prepare_megatron_lm(*args)
   1181 else:
-> 1182     result = tuple(
   1183         self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
   1184     )
   1185     result = tuple(self._prepare_one(obj, device_placement=d) for obj, d in zip(result, device_placement))
   1187 if tpu_should_fix_optimizer or self.mixed_precision == "fp8":
   1188     # 2. grabbing new model parameters

File [~/.conda/envs/codetf/lib/python3.8/site-packages/accelerate/accelerator.py:1183](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/.conda/envs/codetf/lib/python3.8/site-packages/accelerate/accelerator.py:1183), in (.0)
   1180     result = self._prepare_megatron_lm(*args)
   1181 else:
   1182     result = tuple(
-> 1183         self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
   1184     )
   1185     result = tuple(self._prepare_one(obj, device_placement=d) for obj, d in zip(result, device_placement))
   1187 if tpu_should_fix_optimizer or self.mixed_precision == "fp8":
   1188     # 2. grabbing new model parameters

File [~/.conda/envs/codetf/lib/python3.8/site-packages/accelerate/accelerator.py:1022](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/.conda/envs/codetf/lib/python3.8/site-packages/accelerate/accelerator.py:1022), in Accelerator._prepare_one(self, obj, first_pass, device_placement)
   1020     return self.prepare_data_loader(obj, device_placement=device_placement)
   1021 elif isinstance(obj, torch.nn.Module):
-> 1022     return self.prepare_model(obj, device_placement=device_placement)
   1023 elif isinstance(obj, torch.optim.Optimizer):
   1024     optimizer = self.prepare_optimizer(obj, device_placement=device_placement)

File [~/.conda/envs/codetf/lib/python3.8/site-packages/accelerate/accelerator.py:1308](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/paul/projects/edu/master/mdl-ii/src/modeling/~/.conda/envs/codetf/lib/python3.8/site-packages/accelerate/accelerator.py:1308), in Accelerator.prepare_model(self, model, device_placement, evaluation_mode)
   1306 model._original_forward = model.forward
   1307 if self.mixed_precision == "fp16" and is_torch_version(">=", "1.10"):
-> 1308     model.forward = MethodType(torch.cuda.amp.autocast(dtype=torch.float16)(model.forward.__func__), model)
   1309 elif self.mixed_precision == "bf16" and self.distributed_type != DistributedType.TPU:
   1310     model.forward = MethodType(
   1311         torch.autocast(device_type=self.device.type, dtype=torch.bfloat16)(model.forward.__func__), model
   1312     )

AttributeError: 'function' object has no attribute '__func__'

Logging Output:

/home/paul/.conda/envs/codetf/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/paul/.conda/envs/codetf/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /home/paul/.conda/envs/tf/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/paul/.conda/envs/codetf/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
/home/paul/.conda/envs/codetf/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/paul/.conda/envs/codetf did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
Found cached dataset code_x_glue_tc_text_to_code (/home/paul/.cache/huggingface/datasets/code_x_glue_tc_text_to_code/default/0.0.0/059898ce5bb35e72c699c69af37020002b38b251734ddaeedef30ae7e6292717)
100%|██████████| 3/3 [00:00<00:00, 13.97it/s]
trainable params: 884736 || all params: 223766784 || trainable%: 0.3953830788397978
/home/paul/.conda/envs/codetf/lib/python3.8/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

Deps:

Package                  Version
------------------------ ----------
absl-py                  1.4.0
accelerate               0.20.3
aiohttp                  3.8.4
aiosignal                1.3.1
antlr4-python3-runtime   4.9.3
anyio                    3.7.0
argon2-cffi              21.3.0
argon2-cffi-bindings     21.2.0
arrow                    1.2.3
asttokens                2.2.1
async-lru                2.0.2
async-timeout            4.0.2
attrs                    23.1.0
Babel                    2.12.1
backcall                 0.2.0
beautifulsoup4           4.12.2
bitsandbytes             0.39.0
bleach                   6.0.0
certifi                  2023.5.7
cffi                     1.15.1
charset-normalizer       3.1.0
click                    8.1.3
colorama                 0.4.6
comm                     0.1.3
datasets                 2.12.0
debugpy                  1.6.7
decorator                5.1.1
defusedxml               0.7.1
dill                     0.3.6
evaluate                 0.4.0
exceptiongroup           1.1.1
executing                1.2.0
fastjsonschema           2.17.1
filelock                 3.12.1
fqdn                     1.5.1
frozenlist               1.3.3
fsspec                   2023.6.0
huggingface-hub          0.14.1
idna                     3.4
importlib-metadata       6.6.0
importlib-resources      5.12.0
iopath                   0.1.10
ipykernel                6.23.1
ipython                  8.12.2
isoduration              20.11.0
jedi                     0.18.2
Jinja2                   3.1.2
joblib                   1.2.0
json5                    0.9.14
jsonpointer              2.3
jsonschema               4.17.3
jupyter_client           8.2.0
jupyter_core             5.3.0
jupyter-events           0.6.3
jupyter-lsp              2.2.0
jupyter_server           2.6.0
jupyter_server_terminals 0.4.4
jupyterlab               4.0.2
jupyterlab-pygments      0.2.2
jupyterlab_server        2.22.1
lxml                     4.9.2
MarkupSafe               2.1.3
matplotlib-inline        0.1.6
mistune                  2.0.5
multidict                6.0.4
multiprocess             0.70.14
nbclient                 0.8.0
nbconvert                7.4.0
nbformat                 5.9.0
nest-asyncio             1.5.6
nltk                     3.8.1
notebook_shim            0.2.3
numpy                    1.21.6
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
omegaconf                2.3.0
overrides                7.3.1
packaging                23.1
pandas                   1.3.5
pandocfilters            1.5.0
parso                    0.8.3
peft                     0.3.0
pexpect                  4.8.0
pickleshare              0.7.5
Pillow                   9.5.0
pip                      23.0.1
pkgutil_resolve_name     1.3.10
platformdirs             3.5.3
portalocker              2.7.0
prometheus-client        0.17.0
prompt-toolkit           3.0.38
psutil                   5.9.5
ptyprocess               0.7.0
pure-eval                0.2.2
pyarrow                  12.0.0
pycparser                2.21
Pygments                 2.15.1
pyparsing                3.0.7
pyrsistent               0.19.3
python-dateutil          2.8.2
python-json-logger       2.0.7
pytz                     2023.3
PyYAML                   6.0
pyzmq                    25.1.0
regex                    2023.6.3
requests                 2.31.0
responses                0.18.0
rfc3339-validator        0.1.4
rfc3986-validator        0.1.1
rouge-score              0.1.2
sacrebleu                2.3.1
safetensors              0.3.1
salesforce-codetf        1.0.1.1
scikit-learn             1.0.2
scipy                    1.10.1
Send2Trash               1.8.2
setuptools               67.8.0
six                      1.16.0
sniffio                  1.3.0
soupsieve                2.4.1
stack-data               0.6.2
tabulate                 0.9.0
terminado                0.17.1
threadpoolctl            3.1.0
tinycss2                 1.2.1
tokenizers               0.13.3
tomli                    2.0.1
torch                    1.13.1
torchvision              0.14.1
tornado                  6.3.2
tqdm                     4.63.0
traitlets                5.9.0
transformers             4.30.1
tree-sitter              0.20.1
typing_extensions        4.6.3
uri-template             1.2.0
urllib3                  2.0.3
wcwidth                  0.2.6
webcolors                1.13
webencodings             0.5.1
websocket-client         1.5.3
wheel                    0.38.4
xxhash                   3.2.0
yarl                     1.9.2
zipp                     3.15.0

System:

OS: Ubuntu 22.04.2 LTS (WSL)
GPU: RTX 4070 TI
CUDA 11.8

If u need any further information to assist feel free to ask!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.