Giter Site home page Giter Site logo

codebert's Introduction

Code Pretraining Models

This repo contains code pretraining models in the CodeBERT series from Microsoft, including six models as of June 2023.

  • CodeBERT (EMNLP 2020)
  • GraphCodeBERT (ICLR 2021)
  • UniXcoder (ACL 2022)
  • CodeReviewer (ESEC/FSE 2022)
  • CodeExecutor (ACL 2023)
  • LongCoder (ICML 2023)

CodeBERT

This repo provides the code for reproducing the experiments in CodeBERT: A Pre-Trained Model for Programming and Natural Languages. CodeBERT is a pre-trained model for programming language, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).

Dependency

  • pip install torch
  • pip install transformers

Quick Tour

We use huggingface/transformers framework to train the model. You can use our model like the pre-trained Roberta base. Now, We give an example on how to load the model.

import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model.to(device)

NL-PL Embeddings

Here, we give an example to obtain embedding from CodeBERT.

>>> from transformers import AutoTokenizer, AutoModel
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
>>> model = AutoModel.from_pretrained("microsoft/codebert-base")
>>> nl_tokens=tokenizer.tokenize("return maximum value")
['return', 'Ġmaximum', 'Ġvalue']
>>> code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
['def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb']
>>> tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.eos_token]
['<s>', 'return', 'Ġmaximum', 'Ġvalue', '</s>', 'def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb', '</s>']
>>> tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
[0, 30921, 4532, 923, 2, 9232, 19220, 1640, 102, 6, 428, 3256, 114, 10, 15698, 428, 35, 671, 10, 1493, 671, 741, 2]
>>> context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]
torch.Size([1, 23, 768])
tensor([[-0.1423,  0.3766,  0.0443,  ..., -0.2513, -0.3099,  0.3183],
        [-0.5739,  0.1333,  0.2314,  ..., -0.1240, -0.1219,  0.2033],
        [-0.1579,  0.1335,  0.0291,  ...,  0.2340, -0.8801,  0.6216],
        ...,
        [-0.4042,  0.2284,  0.5241,  ..., -0.2046, -0.2419,  0.7031],
        [-0.3894,  0.4603,  0.4797,  ..., -0.3335, -0.6049,  0.4730],
        [-0.1433,  0.3785,  0.0450,  ..., -0.2527, -0.3121,  0.3207]],
       grad_fn=<SelectBackward>)

Probing

As stated in the paper, CodeBERT is not suitable for mask prediction task, while CodeBERT (MLM) is suitable for mask prediction task.

We give an example on how to use CodeBERT(MLM) for mask prediction task.

from transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM, pipeline

model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base-mlm")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base-mlm")

CODE = "if (x is not None) <mask> (x>1)"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

outputs = fill_mask(CODE)
print(outputs)

Results

'and', 'or', 'if', 'then', 'AND'

The detailed outputs are as follows:

{'sequence': '<s> if (x is not None) and (x>1)</s>', 'score': 0.6049249172210693, 'token': 8}
{'sequence': '<s> if (x is not None) or (x>1)</s>', 'score': 0.30680200457572937, 'token': 50}
{'sequence': '<s> if (x is not None) if (x>1)</s>', 'score': 0.02133703976869583, 'token': 114}
{'sequence': '<s> if (x is not None) then (x>1)</s>', 'score': 0.018607674166560173, 'token': 172}
{'sequence': '<s> if (x is not None) AND (x>1)</s>', 'score': 0.007619690150022507, 'token': 4248}

Downstream Tasks

For Code Search and Code Documentation Generation tasks, please refer to the CodeBERT folder.

GraphCodeBERT

This repo also provides the code for reproducing the experiments in GraphCodeBERT: Pre-training Code Representations with Data Flow. GraphCodeBERT is a pre-trained model for programming language that considers the inherent structure of code i.e. data flow, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).

For downstream tasks like code search, clone detection, code refinement and code translation, please refer to the GraphCodeBERT folder.

UniXcoder

This repo will provide the code for reproducing the experiments in UniXcoder: Unified Cross-Modal Pre-training for Code Representation. UniXcoder is a unified cross-modal pre-trained model for programming languages to support both code-related understanding and generation tasks.

Please refer to the UniXcoder folder for tutorials and downstream tasks.

CodeReviewer

This repo also provides the code for reproducing the experiments in CodeReviewer: Pre-Training for Automating Code Review Activities. CodeReviewer is a model pre-trained with code change and code review data to support code review tasks.

Please refer to the CodeReviewer folder for tutorials and downstream tasks.

CodeExecutor

This repo provides the code for reproducing the experiments in Code Execution with Pre-trained Language Models. CodeExecutor is a pre-trained model that learns to predict the execution traces using a code execution pre-training task and curriculum learning.

Please refer to the CodeExecutor folder for details.

LongCoder

This repo will provide the code for reproducing the experiments on LCC datasets in LongCoder: A Long-Range Pre-trained Language Model for Code Completion. LongCoder is a sparse and efficient pre-trained Transformer model for long code modeling.

Please refer to the LongCoder folder for details.

Contact

Feel free to contact Daya Guo ([email protected]), Shuai Lu ([email protected]) and Nan Duan ([email protected]) if you have any further questions.

Contributing

We appreciate all contributions and thank all the contributors!

codebert's People

Contributors

adocquin avatar agoyal0512 avatar aniyaz avatar aravind-gk avatar bminaiev avatar boraelci avatar celbree avatar dongs0104 avatar ekramasif avatar fengzhangyin avatar guoday avatar guody5 avatar jadecxliu avatar lizhmq avatar michaelfu1998-create avatar microsoft-github-operations[bot] avatar microsoftopensource avatar nanduan avatar nashid avatar tangduyu avatar thepurpleowl avatar yichuli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

codebert's Issues

Facing issue while running “Inference and Evaluation” script of the “Code Search” project

I am facing the below error while running “Inference and Evaluation” script of the “Code Search” project.

I am using the below machine configuration and lib version

Ubuntu 20.04
Python 3.7
torch==1.4.0
transformers==2.5.0
filelock more_itertools

09/21/2020 16:16:09 - INFO - main - Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='../data/codesearch/test/java', dev_file='shared_task_dev_top10_concat.tsv', device=device(type='cpu'), do_eval=False, do_lower_case=False, do_predict=True, do_train=False, eval_all_checkpoints=False, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=1e-05, local_rank=-1, logging_steps=50, max_grad_norm=1.0, max_seq_length=200, max_steps=-1, model_name_or_path='microsoft/codebert-base', model_type='roberta', n_gpu=0, no_cuda=False, num_train_epochs=8.0, output_dir='../data/codesearch/test/java', output_mode='classification', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=32, per_gpu_train_batch_size=32, pred_model_dir='./models/java/checkpoint-best/pytorch_model.bin', save_steps=50, seed=42, server_ip='', server_port='', start_epoch=0, start_step=0, task_name='codesearch', test_file='batch_0.txt', test_result_dir='./results/java/0_batch_result.txt', tokenizer_name='', train_file='train_top10_concat.tsv', warmup_steps=0, weight_decay=0.0)
testing
Traceback (most recent call last):
File "run_classifier.py", line 596, in
main()
File "run_classifier.py", line 589, in main
model = model_class.from_pretrained(args.pred_model_dir)
File "/Users/arun/opt/anacondaDist/anaconda3/lib/python3.8/site-packages/transformers/modeling_utils.py", line 383, in from_pretrained
config, model_kwargs = cls.config_class.from_pretrained(
File "/Users/arun/opt/anacondaDist/anaconda3/lib/python3.8/site-packages/transformers/configuration_utils.py", line 176, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/Users/arun/opt/anacondaDist/anaconda3/lib/python3.8/site-packages/transformers/configuration_utils.py", line 226, in get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/Users/arun/opt/anacondaDist/anaconda3/lib/python3.8/site-packages/transformers/configuration_utils.py", line 315, in _dict_from_json_file
text = reader.read()
File "/Users/arun/opt/anacondaDist/anaconda3/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Zero Division Error when n_gpu=0

When the run.py file is compiled it throws an error Zero Division Error and the program halts without completely running. I think the line #352 in run.py of clonedetection (under GraphCodeBERT) is the issue.

logger.info(" Instantaneous batch size per GPU = %d", args.train_batch_size//args.n_gpu)

Kindly look into it.

CodeDoumentation Inference and Evaluation

When running the script run.py in code2nl
and using the model dowloaded from the drive link

I am getting this error:

Missing key(s) in state_dict: "bias", "encoder.embeddings.word_embeddings.weight", "encoder.embeddings.position_embeddings.weight", "encoder.embeddings.token_type_embeddings.weight", "encoder.embeddings.LayerNorm.weight", "encoder.embeddings.LayerNorm.bias", "encoder.encoder.layer.0.attention.self.query.weight", "encoder.encoder.layer.0.attention.self.query.bias", "encoder.encoder.layer.0.attention.self.key.weight", "encoder.encoder.layer.0.attention.self.key.bias", "encoder.encoder.layer.0.attention.self.value.weight", "encoder.encoder.layer.0.attention.self.value.bias", "encoder.encoder.layer.0.attention.output.dense.weight", "encoder.encoder.layer.0.attention.output.dense.bias", "encoder.encoder.layer.0.attention.output.LayerNorm.weight", "encoder.encoder.layer.0.attention.output.LayerNorm.bias", "encoder.encoder.layer.0.intermediate.dense.weight", "encoder.encoder.layer.0.intermediate.dense.bias", "encoder.encoder.layer.0.output.dense.weight", "encoder.encoder.layer.0.output.dense.bias", "encoder.encoder.layer.0.output.LayerNorm.weight", "encoder.encoder.layer.0.output.LayerNorm.bias", "encoder.encoder.layer.1.attention.self.query.weight", "encoder.encoder.layer.1.attention.self.query.bias", "encoder.encoder.layer.1.attention.self.key.weight", "encoder.encoder.layer.1.attention.self.key.bias", "encoder.encoder.layer.1.attention.self.value.weight", "encoder.encoder.layer.1.attention.self.value.bias", "encoder.encoder.layer.1.attention.output.dense.weight", "encoder.encoder.layer.1.attention.output.dense.bias", "encoder.encoder.layer.1.attention.output.LayerNorm.weight", "encoder.encoder.layer.1.attention.output.LayerNorm.bias", "encoder.encoder.layer.1.intermediate.dense.weight", "encoder.encoder.layer.1.intermediate.dense.bias", "encoder.encoder.layer.1.output.dense.weight", "encoder.encoder.layer.1.output.dense.bias", "encoder.encoder.layer.1.output.LayerNorm.weight", "encoder.encoder.layer.1.output.LayerNorm.bias", "encoder.encoder.layer.2.attention.self.query.weight", "encoder.encoder.layer.2.attention.self.query.bias", "encoder.encoder.layer.2.attention.self.key.weight", "encoder.encoder.layer.2.attention.self.key.bias", "encoder.encoder.layer.2.attention.self.value.weight", "encoder.encoder.layer.2.attention.self.value.bias", "encoder.encoder.layer.2.attention.output.dense.weight", "encoder.encoder.layer.2.attention.output.dense.bias", "encoder.encoder.layer.2.attention.output.LayerNorm.weight", "encoder.encoder.layer.2.attention.output.LayerNorm.bias", "encoder.encoder.layer.2.intermediate.dense.weight", "encoder.encoder.layer.2.intermediate.dense.bias", "encoder.encoder.layer.2.output.dense.weight", "encoder.encoder.layer.2.output.dense.bias", "encoder.encoder.layer.2.output.LayerNorm.weight", "encoder.encoder.layer.2.output.LayerNorm.bias", "encoder.encoder.layer.3.attention.self.query.weight", "encoder.encoder.layer.3.attention.self.query.bias", "encoder.encoder.layer.3.attention.self.key.weight", "encoder.encoder.layer.3.attention.self.key.bias", "encoder.encoder.layer.3.attention.self.value.weight", "encoder.encoder.layer.3.attention.self.value.bias", "encoder.encoder.layer.3.attention.output.dense.weight", "encoder.encoder.layer.3.attention.output.dense.bias", "encoder.encoder.layer.3.attention.output.LayerNorm.weight", "encoder.encoder.layer.3.attention.output.LayerNorm.bias", "encoder.encoder.layer.3.intermediate.dense.weight", "encoder.encoder.layer.3.intermediate.dense.bias", "encoder.encoder.layer.3.output.dense.weight", "encoder.encoder.layer.3.output.dense.bias", "encoder.encoder.layer.3.output.LayerNorm.weight", "encoder.encoder.layer.3.output.LayerNorm.bias", "encoder.encoder.layer.4.attention.self.query.weight", "encoder.encoder.layer.4.attention.self.query.bias", "encoder.encoder.layer.4.attention.self.key.weight", "encoder.encoder.layer.4.attention.self.key.bias", "encoder.encoder.layer.4.attention.self.value.weight", "encoder.encoder.layer.4.attention.self.value.bias", "encoder.encoder.layer.4.attention.output.dense.weight", "encoder.encoder.layer.4.attention.output.dense.bias", "encoder.encoder.layer.4.attention.output.LayerNorm.weight", "encoder.encoder.layer.4.attention.output.LayerNorm.bias", "encoder.encoder.layer.4.intermediate.dense.weight", "encoder.encoder.layer.4.intermediate.dense.bias", "encoder.encoder.layer.4.output.dense.weight",

Token Embedding from CodeBERT

Hi.

I want to obtain source code token embedding and I was wondering if I can use the CodeBERT pre-trained model for this purpose. If so, would you please give me some hints on how I can do it?

How to deal with long code data?

I am using codeBERT for classifying malicious code written in PHP. Some codes in the dataset are really long which far beyond the general MAX_LEN of a sentence for example 256. And setting MAX_LEN a big number would soon result in GPU resources exhausted. So I wonder if there are some fine strategies to deal with it.

Code stuck infinitely when performing Fine-Tuning

When running the fine-tune operation, the script gets stuck at the following warning.

Epoch: 0%| | 0/8 [00:00<?, ?it/s]/home/akash/.local/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:224: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr().
warnings.warn("To get the last learning rate computed by the scheduler, "

How to finetune CodeBERT to do a 4 class classification task.

Hi,

Recently I have been looking and experimenting the clone detection variant of CodeBERT to perform a 4-class classification problem. But it seems the model is only predicting 2 classes despite the task for which I am training the model having 4 classes in the data.jsonl, train.txt, valid.txt etc... Is it possible to use the examples provided to do a multi-class classification problem using CodeBERT or right now, out of the box, it is only able to solve a binary classification (using codedetection folder) problem ?

Thanks a lot

About Data Format

Hi,

I have a data in this format (doctring,code).

I figured out that the data you have used is in the following format:

BooleanValue<CODESPLIT>URL<CODESPLIT>returnType.methodName<CODESPLIT>docString<CODESPLIT>code

In order to use the code for code search, I have the following concerns:

  1. How can I generate BooleanValue for my dataset? Is there any benefit in code search?
  2. Though, I can work to generate 'returnType.methodName'. But, Is there any use of it in code search?
  3. Is there any use of URL in training and evaluating model?

Kindly let me know about it.

Using With Transformers + Embedding Quality Questions.

I tried to run the following with CodeBERT, but from these examples, I was unable to attain good embeddings.
I was also curious to ask how you guys are using embeddings at the current moment- are you taking the pooler output or the mean of the embeddings once they are outputted from the model? I seem to be getting slightly better results taking the mean of the embeddings although the paper suggests (but does not explicitly state) that you guys are using the CLS token.

I tried both but don't seem to be getting any particularly good results. I do, however, seem to be getting some form of semantic meaning from the mean of embeddings but pooler output embeddings are quite bad. I have built a reproducible script below to see if these results are reproducible on your end if the right model is loaded.

# !pip install simpleneighbors[annoy]
# !pip install vectorhub==1.0.8
# !pip install torch==1.7.1
# !pip install transformers==4.1.1

from vectorhub.encoders.text import BaseText2Vec
import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel
class Code2Vec(BaseText2Vec):
    def __init__(self):
        model_name = "microsoft/codebert-base"
        self.tokenizer = RobertaTokenizer.from_pretrained(model_name)
        self.model = RobertaModel.from_pretrained(model_name)

    def encode(self, description, code=None, pooling_method='mean', truncation=True):
        """
        Pooling method is either pooler_output or mean.
        Notes: if it is mean, we can take the last hidden state and add it to the
        model.
        """
        if pooling_method == 'pooler_output':
            return self.model.forward(**self.tokenizer.encode_plus(
                description, code, return_tensors='pt', truncation=truncation
            ))[pooling_method].detach().numpy().tolist()[0]
        elif pooling_method == 'mean':
            return self._vector_operation(self.model.forward(**self.tokenizer.encode_plus(
                description, code, return_tensors='pt', truncation=truncation
            ))['last_hidden_state'].detach().numpy().tolist(), 'mean', axis=1)[0]

model = Code2Vec()
query = "hello world"
code_1 = 'print("hello")'
vec_1 = model.encode(query, code_1)
query = "show all cells"
code_2 = """from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
"""
vec_2 = model.encode(query, code_2)

query = "Install private package in Colab"
code_3 = """import os
from getpass import getpass
import urllib

user = input('User name: ')
password = getpass('Password: ')
password = urllib.parse.quote(password) # your password is converted into url format
https_repo_link = input('Https Repo Link: ') 
end_string = https_repo_link.split('@github.com/')[1]
cmd_string = 'git clone https://{0}:{1}@github.com/{2}'.format(user, password, end_string)
os.system(cmd_string)"""
vec_3 = model.encode(query, code_3)

query = "Download an image"
code_4 = """
def download_image(image_url, output_dir):
    import requests
    r = requests.get(image_url)
    with open(output_dir, 'wb') as f:
        f.write(r.content)
"""
vec_4 = model.encode(query, code_4)

from simpleneighbors import SimpleNeighbors
colors = [
    (code_1, vec_1),
    (code_2, vec_2),
    (code_3, vec_3),
    (code_4, vec_4)
]
sim = SimpleNeighbors(768)
sim.feed(colors)
sim.build()

Now when I started testing it:

# The only good result:
query_vec = model.encode("view all cells in jupyter notebook")
print(list(sim.nearest_matching(query_vec, 1))[0])
# Returns  (only for mean output)
# from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = "all"
# Bad result
query_vec = model.encode("download an image")
print(list(sim.nearest_matching(query_vec, 1))[0])
# Returns print("hello")
# Expected download_image function
# Bad result
query_vec = model.encode("installing packages")
print(list(sim.nearest_matching(query_vec, 1))[0])
# returns print("hello")
# Expected os installation

I couldn't find an example of how to use this in the Transformers package - so please let me know if there is a detail I am missing.

From the above, I expected download_image to be the first one to show but it wasn't.
Is this the expected answer or am I preprocessing it wrong?

Pretraining Instructions

Can you share the instruction on how to train the GraphCodeBert model from scratch? I would like to add more data from different programming languages like C++ and may use the parallel corpus (let's say from geeksforgeeks)

Error running inference with code search

Hello, I did run the process.py and the fine tuning script, they both seemed to go well with everything executing without error, however when I want do the inference I get the following error.

What I do

python3 run_classifier.py \
--model_type roberta \
--model_name_or_path microsoft/codebert-base \
--task_name codesearch \
--do_predict \
--output_dir ./data/codesearch/test/python \
--data_dir ./data/codesearch/test/python \
--max_seq_length 200 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--test_file batch_0.txt \
--pred_model_dir ./models/python/checkpoint-best/ \
--test_result_dir ./results/python/0_batch_result.txt

What I get

04/24/2021 04:45:40 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
Some weights of the model checkpoint at microsoft/codebert-base were not used when initializing RobertaForSequenceClassification: ['pooler.dense.weight', 'pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
04/24/2021 04:45:49 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='./data/codesearch/test/python', dev_file='shared_task_dev_top10_concat.tsv', device=device(type='cpu'), do_eval=False, do_lower_case=False, do_predict=True, do_train=False, eval_all_checkpoints=False, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=1e-05, local_rank=-1, logging_steps=50, max_grad_norm=1.0, max_seq_length=200, max_steps=-1, model_name_or_path='microsoft/codebert-base', model_type='roberta', n_gpu=0, no_cuda=False, num_train_epochs=8.0, output_dir='./data/codesearch/test/python', output_mode='classification', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=32, per_gpu_train_batch_size=32, pred_model_dir='./models/python/checkpoint-best/', save_steps=50, seed=42, server_ip='', server_port='', start_epoch=0, start_step=0, task_name='codesearch', test_file='batch_0.txt', test_result_dir='./results/python/0_batch_result.txt', tokenizer_name='', train_file='train_top10_concat.tsv', warmup_steps=0, weight_decay=0.0)
testing
404 Client Error: Not Found for url: https://huggingface.co/models/python/checkpoint-best//resolve/main/config.json
Traceback (most recent call last):
  File "/home/aton/.local/lib/python3.8/site-packages/transformers/configuration_utils.py", line 458, in get_config_dict
    resolved_config_file = cached_path(
  File "/home/aton/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1165, in cached_path
    output_path = get_from_cache(
  File "/home/aton/.local/lib/python3.8/site-packages/transformers/file_utils.py", line 1336, in get_from_cache
    r.raise_for_status()
  File "/usr/lib/python3/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/models/python/checkpoint-best//resolve/main/config.json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_classifier.py", line 596, in <module>
    main()
  File "run_classifier.py", line 589, in main
    model = model_class.from_pretrained(args.pred_model_dir)
  File "/home/aton/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 975, in from_pretrained
    config, model_kwargs = cls.config_class.from_pretrained(
  File "/home/aton/.local/lib/python3.8/site-packages/transformers/configuration_utils.py", line 401, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/aton/.local/lib/python3.8/site-packages/transformers/configuration_utils.py", line 478, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for './models/python/checkpoint-best/'. Make sure that:

- './models/python/checkpoint-best/' is a correct model identifier listed on 'https://huggingface.co/models'

- or './models/python/checkpoint-best/' is the correct path to a directory containing a config.json file


Any idea of what I could do to fix it ? Thank you very much.

The embedding dimension of CodeBert

Dear authors

Thanks for your fantastic work. The embedding dimension seems be different due to the different codes?
The size of vector in Readme.md is torch.Size([1, 23, 768]), while in other code we get the torch.Size([1, 32, 768]) and torch.Size([1, 52, 768]) . How can we get the fixed dimension vector of embeddings? Thank you.

Best.

Using codesearch for my own queries. (How to instanciate model.py Class)

Hello I managed to finetune codebert for the codesearch task, and I was wondering how I could use the model.bin I just created to perform codesearch on my own natural langage queries.

So the model.bin is a state dictionnary containing all the weight but I don't get how I am supposed to go from that to a working model.

I tried to instanciate it with

model = RobertaModel.from_pretrained("microsoft/codebert-base")

but it returns the following error

RuntimeError: Error(s) in loading state_dict for RobertaModel:
Missing key(s) in state_dict: "embeddings.position_ids", "embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "

etc,etc

And I noticed that there is a Model class in model.py so it is probably the one I mused use but I don't get how am I supposed to instanciate it in order for the model to work ? Can you show me an example or explain me a little bit ?

Thank you very much

999 distractor snippets

Thank You for CodeBERT! A quick question, wherein CodeSearchNet did you find a fixed set
of 999 distractor codes say If I wish to evaluate (MRR) for a model built (Code Search space)?

Using CodeBERT for code based semantic search / clustering

Hi,

I am interested in using CodeBERT for semantic text similarity / clustering on code but my results are rather poor. Here is my process:

Download the data:

mkdir data data/codesearch
cd data/codesearch
gdown https://drive.google.com/uc?id=1xgSR34XO8xXZg4cZScDYj2eGerBE9iGo  
unzip codesearch_data.zip
rm codesearch_data.zip

Grab some examples to embed:

from pathlib import Path

max_instances = 8

valid = Path("data/codesearch/train_valid/python/valid.txt").read_text().split("\n")
code = [ex.split("<CODESPLIT>")[-1] for ex in valid][:max_instances]

Embed the examples

import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel

# Load the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model = model.to(device)

# Prepare the inputs
inputs = tokenizer(
    code, padding=True, truncation=True, return_tensors="pt"
)

# Embed the inputs
for name, tensor in inputs.items():
    inputs[name] = tensor.to(model.device)
sequence_output, _ = model(**inputs, output_hidden_states=False)
embeddings = sequence_output[:, 0, :]

Then I arbitrarily cosine the first inputs embedding with the rest of the inputs embeddings:

from torch.nn import CosineSimilarity

# Perform a cosine based semantic similarity search, using the first function as query 
sim = CosineSimilarity(dim=-1)
cosine = sim(embeddings[0], embeddings[1:])
scores, indices = cosine.topk(5)

print(f"Scores: {scores.tolist()}")
print()
print(f"Query:\n---\n{code[0]}")
print()
topk = '\n'.join([code[i] for i in indices])
print(f"Top K:\n---\n{topk}")

The output:

Scores: [0.9909096360206604, 0.9864522218704224, 0.9837372899055481, 0.9776582717895508, 0.9704807996749878]

Query:
---
def start_transaction ( self , sort , address , price = None , data = None , caller = None , value = 0 , gas = 2300 ) : assert self . _pending_transaction is None , "Already started tx" self . _pending_transaction = PendingTransaction ( sort , address , price , data , caller , value , gas )

Top K:
---
def remove_node ( self , id ) : if self . has_key ( id ) : n = self [ id ] self . nodes . remove ( n ) del self [ id ] # Remove all edges involving id and all links to it. for e in list ( self . edges ) : if n in ( e . node1 , e . node2 ) : if n in e . node1 . links : e . node1 . links . remove ( n ) if n in e . node2 . links : e . node2 . links . remove ( n ) self . edges . remove ( e )
def find_essential_genes ( model , threshold = None , processes = None ) : if threshold is None : threshold = model . slim_optimize ( error_value = None ) * 1E-02 deletions = single_gene_deletion ( model , method = 'fba' , processes = processes ) essential = deletions . loc [ deletions [ 'growth' ] . isna ( ) | ( deletions [ 'growth' ] < threshold ) , : ] . index return { model . genes . get_by_id ( g ) for ids in essential for g in ids }
async def play_now ( self , requester : int , track : dict ) : self . add_next ( requester , track ) await self . play ( ignore_shuffle = True )
def _handleAuth ( fn ) : @ functools . wraps ( fn ) def wrapped ( * args , * * kwargs ) : # auth, , authenticate users, internal from yotta . lib import auth # if yotta is being run noninteractively, then we never retry, but we # do call auth.authorizeUser, so that a login URL can be displayed: interactive = globalconf . get ( 'interactive' ) try : return fn ( * args , * * kwargs ) except requests . exceptions . HTTPError as e : if e . response . status_code == requests . codes . unauthorized : #pylint: disable=no-member logger . debug ( '%s unauthorised' , fn ) # any provider is sufficient for registry auth auth . authorizeUser ( provider = None , interactive = interactive ) if interactive : logger . debug ( 'retrying after authentication...' ) return fn ( * args , * * kwargs ) raise return wrapped
def write_log ( log_path , data , allow_append = True ) : append = os . path . isfile ( log_path ) islist = isinstance ( data , list ) if append and not allow_append : raise Exception ( 'Appending has been disabled' ' and file %s exists' % log_path ) if not ( islist or isinstance ( data , Args ) ) : raise Exception ( 'Can only write Args objects or dictionary' ' lists to log file.' ) specs = data if islist else data . specs if not all ( isinstance ( el , dict ) for el in specs ) : raise Exception ( 'List elements must be dictionaries.' ) log_file = open ( log_path , 'r+' ) if append else open ( log_path , 'w' ) start = int ( log_file . readlines ( ) [ - 1 ] . split ( ) [ 0 ] ) + 1 if append else 0 ascending_indices = range ( start , start + len ( data ) ) log_str = '\n' . join ( [ '%d %s' % ( tid , json . dumps ( el ) ) for ( tid , el ) in zip ( ascending_indices , specs ) ] ) log_file . write ( "\n" + log_str if append else log_str ) log_file . close ( )

Notice that the cosine is very high for the top-5 examples, which is unexpected as these examples are chosen randomly. Manually inspecting them, they don't appear to be very relevant to the query.

My questions:

  • Am I doing something wrong?
  • Is there a better way to do semantic similarity searching/clustering with CodeBERT? Here I am following the canonical pipeline for sentence embeddings.
  • One possible source of error is the tokenization. Am I supposed to use the CodeBERT tokenizer on code, or just text?

Hard-coded language to extract data flow graph for Code Translation use case

Hi, first of all, thank you for providing the code for your experiments. It made it very easy for me to reuse and experiment with it.

I'm trying to fine-tune the GraphCodeBert model for the code-to-code translation use case. While going through the run.py script, I realized that while parsers are initialized for individual languages, during tokenization the language is hard-coded to Java, and only its parser is used

code_tokens,dfg=extract_dataflow(example.source,parsers['java'],'java')

Shouldn't this be the language of the source code examples, or am I missing something?

could I download a pre-train model myself?

The example shows that we can load pre-train model like
RobertaModel.from_pretrained("microsoft/codebert-base")

but could I download first and then load like

config_path = 'pretrain_model/bert_config.json'
checkpoint_path = 'pretrain_model/bert_model.ckpt'
bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path,
                                                    seq_len=None)

Issue "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte"

Hi,

I have run the following script to perform inference and evaluation, but I am getting an error as "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte".

python CodeBERT/codesearch/run_classifier.py \
--model_type roberta \
--model_name_or_path CodeBERT/pretrained_models \
--task_name codesearch \
--do_predict \
--output_dir CodeBERT/data/train_valid/java \
--data_dir CodeBERT/data/train_valid/java \
--max_seq_length 200 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--test_file CodeBERT/data/train_valid/java/test.txt \
--pred_model_dir CodeBERT/models/java/checkpoint-best/pytorch_model.bin \
--test_result_dir results/result.txt

I have taken a small subset of provided files. You can access them as

train.txt
test.txt
valid.txt

I am using transformer library 2.5....

FYI...Following script, I have used for fine-tuning, which is working fine:

python CodeBERT/codesearch/run_classifier.py \
  --model_type roberta \
  --task_name codesearch \
  --do_train \
  --do_eval \
  --eval_all_checkpoints \
  --train_file train.txt \
  --dev_file valid.txt \
  --max_seq_length 200 \
  --per_gpu_train_batch_size 32 \
  --per_gpu_eval_batch_size 32 \
  --learning_rate 1e-5 \
  --num_train_epochs 8 \
  --gradient_accumulation_steps 1 \
  --overwrite_output_dir \
  --data_dir CodeBERT/data/train_valid/java \
  --output_dir CodeBERT/models/java  \
  --model_name_or_path CodeBERT/pretrained_models

Kindly guide me how can I resolve the issue?

What is the provided config for?

The pretrained model comes with a config file, but in all the examples in the readme, this config isn't used and the roberta-base config is used instead. Which of these should actually be used?

Pretraining codeBERT

Was codeBERT trained on all of the 6 programming languages at the same time?
If yes, how? Shuffle everything ? Train on one programming language after the other?
Thank you in advance.

Convert clonedetection example to multitask/multilabel

Hi, right now the GraphCodeBERT clone detection performs binary classification to decide whether 2 pieces of code are semantically equivalent or not.

The problem I am trying to solve is: Given a natural language utterance and two code pieces (A and B) as input to my model, determine whether:

  • both pieces are correct
  • piece A is correct and piece B is wrong
  • piece B is wrong and piece A is correct
  • both pieces are wrong

I tried solving this problem as 4 class classification task in #53 , but the results were not very good, so right now what I am trying to accomplish is to transform it to a multi-class classification problem with a multi-label/multi-task, classifying each input 2 times:

[0,1] -> Whether A is right or wrong.
[0,1] -> Whether B is right or wrong.

Does anyone have any idea on how to accomplish this ?

Thanks a lot

about using the CodeBERT

If I want to realize the code search task of CodeBERT, may I load the weights from Microsoft and skip preprocess and fine-tune? I'm just confused cause training seems to cost a lot of time...

Tf?

Do you have the model in Tensorflow?

About CodeBERT code2nl task

I have a doubt:
the NL-Code Discriminator (essentially RoBERTa) in your paper is trained with bimodal data, but I found that when perform code2nl tasks in the code you gave, you directly use this NL-Code Discriminator as encoder, I don’t know why you can do this. Because in the code2nl task, the input is only code, which is unimodal data, not bimodal data. Can you help me answer my doubts? Thank you very much!

Stuck after training

Hi,

I was running the example described in section CodeBERT/GraphCodeBERT/translation to convert Java to C#.

However, after the training epoch is concluded, the program does not seem to finalize.

a
My terminal is stuck in this state without exiting.

Do you have any idea of what might be going on ?

Can this be caused by a wrong package version (transformers,torch) , does my machine not have enough memory to save the trained model ? (Or if I Ctrl+C the model will be saved without any kind of corruption ?). Even if it is one of the two causes mentioned above I think the program should throw an error instead of being stuck forever after training

At the moment I am using:
transformers 4.8.1
torch 1.5.0

Thanks a lot

Number of epochs parameter is not used in Code2NL

The following is declared in the run.py script for Code2NL,

parser.add_argument("--num_train_epochs", default=3.0, type=float,
                        help="Total number of training epochs to perform.")

It is never used in the script.

Appendix D (Late Fusion) Replication Code

I couldn't find the replication code for Appendix D: Late Fusion and Table 7 of the paper in this repo. Would I be able to do get those results using a particular set of parameters to run_classifier.py?

Giving error on Electra model when used for code2nl task

I was using the script provided for fine-tuning the codeBert for an Electra Model instead of Roberta, and I am getting this error

File "run.py", line 513, in <module>
    main()
  File "run.py", line 320, in main
    loss,_,_ = model(source_ids=source_ids,source_mask=source_mask,target_ids=target_ids,target_mask=target_mask)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/model.py", line 57, in forward
    out = self.decoder(tgt_embeddings,encoder_output,tgt_mask=attn_mask,memory_key_padding_mask=(1-source_mask).bool())
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/transformer.py", line 234, in forward
    memory_key_padding_mask=memory_key_padding_mask)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/transformer.py", line 364, in forward
    key_padding_mask=tgt_key_padding_mask)[0]
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/activation.py", line 987, in forward
    attn_mask=attn_mask)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 4625, in multi_head_attention_forward
    assert embed_dim == embed_dim_to_check
AssertionError

About the paradigm of code-BERT training

The paper has mentioned that the model is inspired by ELECTRA which combining contrastive learning in programming LM.

While in run_classifier::train I can not find the generator and discriminator part.

These four files only describe how to train BERT model (in the context Robert) on the source code corpus.

I'm feeling confused about the training process, pls help.

NL-PL (word/token) embedding

Hi.

I want to obtain NL-PL pair (word and token) embedding and I was wondering if I can use the CodeBERT pre-trained model for this purpose. If so, would you please give me some examples of how I can do it?

error:段错误 (核心已转储)

01/10/2021 09:20:19 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json from cache at /home/minchen/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b
01/10/2021 09:20:19 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt from cache at /home/minchen/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
01/10/2021 09:20:20 - INFO - transformers.modeling_utils - loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/codebert-base/pytorch_model.bin from cache at /home/minchen/.cache/torch/transformers/3416309b564f60f87c1bc2ce8d8a82bb7c1e825b241c816482f750b48a5cdc26.96251fe4478bac0cff9de8ae3201e5847cee59aebbcafdfe6b2c361f9398b349
段错误 (核心已转储)

The error is caused by 'tokenizer = tokenizer_class.from_pretrained(tokenizer_name, do_lower_case=args.do_lower_case)'.
How to solve it?

Code stuck infinitely when performing Fine-Tuning on CodeSearch even with suggested fixes

Hello, I face exactly the same problem as here #25, I tried every fixes your proposed without success.

Let me clarify the situation, I did run the process data script, with only python which was supposed to make 17 batch txt file. I deleted 8 of them in order to free memory so only 9 of them are left but I don't think they are the reason for which the script get stuck. (Correct me if I'm wrong).

My torch and transofmers package are up to date.

When I run

python3 run_classifier.py \
--model_type roberta \
--task_name codesearch \
--do_train \
--do_eval \
--eval_all_checkpoints \
--train_file train.txt \
--dev_file valid.txt \
--max_seq_length 200 \
--per_gpu_train_batch_size 1 \
--per_gpu_eval_batch_size 1 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--gradient_accumulation_steps 1 \
--overwrite_output_dir \
--data_dir ./data/codesearch/train_valid/python \
--output_dir ./models/python  \
--model_name_or_path microsoft/codebert-base

So with batch size one as u required I get the following

- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
04/30/2021 14:41:43 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='./data/codesearch/train_valid/python', dev_file='valid.txt', device=device(type='cuda'), do_eval=True, do_lower_case=False, do_predict=False, do_train=True, eval_all_checkpoints=True, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=1e-05, local_rank=-1, logging_steps=50, max_grad_norm=1.0, max_seq_length=200, max_steps=-1, model_name_or_path='microsoft/codebert-base', model_type='roberta', n_gpu=1, no_cuda=False, num_train_epochs=8.0, output_dir='./models/python', output_mode='classification', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=1, per_gpu_train_batch_size=1, pred_model_dir=None, save_steps=50, seed=42, server_ip='', server_port='', start_epoch=0, start_step=0, task_name='codesearch', test_file='shared_task_dev_top10_concat.tsv', test_result_dir='test_results.tsv', tokenizer_name='', train_file='train.txt', warmup_steps=0, weight_decay=0.0)
04/30/2021 14:41:43 - INFO - __main__ -   Loading features from cached file ./data/codesearch/train_valid/python/cached_train_train_codebert-base_200_codesearch
04/30/2021 14:43:07 - INFO - __main__ -   ***** Running training *****
04/30/2021 14:43:07 - INFO - __main__ -     Num examples = 824342
04/30/2021 14:43:07 - INFO - __main__ -     Num Epochs = 8
04/30/2021 14:43:07 - INFO - __main__ -     Instantaneous batch size per GPU = 1
04/30/2021 14:43:07 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 1
04/30/2021 14:43:07 - INFO - __main__ -     Gradient Accumulation steps = 1
04/30/2021 14:43:07 - INFO - __main__ -     Total optimization steps = 6594736
Epoch:   0%|                                              | 0/8 [00:00<?, ?it/s]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
  warnings.warn("To get the last learning rate computed by the scheduler, "
Epoch:   0%|                                              | 0/8 [05:51<?, ?it/s]
Traceback (most recent call last):
  File "run_classifier.py", line 596, in <module>
    main()
  File "run_classifier.py", line 544, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer, optimizer)
  File "run_classifier.py", line 127, in train
    optimizer.step()
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/optimization.py", line 345, in step
    exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1)
KeyboardInterrupt
[email protected]:~/CodeBERT/CodeBERT/codesearch$ 

It stay stuck on Epoch: 0%| | 0/8 [00:00<?, ?it/s] and output me the thing just above when I keyboard interrupt the whole thing.

The GPU is working while I do it (I don't know if it means it's training but not giving any feedback as I don't wait a long time before interrupting and the previous guy was suggesting that it doesn't), and here is the hardware configuration
image

The GPU is running at 66% while batch size = 1 and was 99% when batch size = 32.

I will try later on with better GPU and more memory to stock the 17 batch files but I would be glad if u could help me as nothing guarantee it will fix it.

Make me know if I can give u more information to help you figure it out I check the issue channel everyday.

(If the codesearch script is intrinsically buggy may I ask you to share a link to an already trained codesearch model so we can download it ? )

Thank you keep me updated.

Process the dataset

Thanks for opening this repository and providing the dataset to download!
I have the following three questions:

  1. In your paper, you mentioned that you used unimodal code to pretrain the task replaced token detection. Is it possible to download the unimodal code dataset? So far I found the data from codeSearchNet repo is bimodal Data.
  2. You also provide here the cleaned codeSearchNet data. Did you use for the pretraining the original codeSearchNet data and only the cleaned one for the task Code Documentation Generation?
  3. In you paper, you also carried out the task on C# using the dataset of CodeNN. In which way do you tokenize the C# code? In their repository, they replaced part of code to tokens like CODE_STRING, CODE_INTEGER. Did you also do such token replacement for C#?

Thanks a lot in advance for answering these questions!

Finetuned models

Do you plan to release models after finetuning?
If yes, could you provide download link?

If you're unwilling to share, could you at least specify how long does it take to train on machine with 4 P40 GPUs?

Some question about the vocabulary size on CodeSearchNet corpus

Hi,
Thanks for your excellent work! it is really interesting.
Recently, I try to do some experiment on CodeSearchNet dataset. After preprocessing, I found the vocabulary is SO LARGE !!
I don't know how to solve it...
I would like to know if there is a chance that let me know how you solve the so large vocabulary.
Thanks a lot.

Issues related to experimentation results

Hi,

I have constructed a new dataset [train.txt, test.txt, valid.txt] with the following format:

1<CODESPLIT>URL<CODESPLIT>returnType.methodName<CODESPLIT>[docString]<CODESPLIT>[code]

I have placed constant values such as “1”, “URL”, and ”returnType.methodName” for the whole dataset.
When I run following script, I have gotten results such as [acc = 1.0, acc_and_f1 = 1.0, and f1 = 1.0]:

python CodeBERT/codesearch/run_classifier.py \
  --model_type roberta \
  --task_name codesearch \
  --do_train \
  --do_eval \
  --eval_all_checkpoints \
  --train_file train.txt \
  --dev_file valid.txt \
  --max_seq_length 200 \
  --per_gpu_train_batch_size 32 \
  --per_gpu_eval_batch_size 32 \
  --learning_rate 1e-5 \
  --num_train_epochs 8 \
  --gradient_accumulation_steps 1 \
  --overwrite_output_dir \
  --data_dir CodeBERT/data/train_valid\
  --output_dir CodeBERT/models  \
  --model_name_or_path CodeBERT/pretrained_models/pretrained_codebert

Following are the learning rate and loss graphs:

searchLosss
searchLR

However, when I run following two scripts, I achieve MRR as 0.0031. I am not sure, why is it like that? Why it is so less MRR value?

python CodeBERT/codesearch/run_classifier.py \
--model_type roberta \
--model_name_or_path CodeBERT/models \
--task_name codesearch \
--do_predict \
--output_dir CodeBERT/data/train_valid \
--data_dir CodeBERT/data/train_valid \
--max_seq_length 200 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--test_file test.txt \
--pred_model_dir CodeBERT/models \
--test_result_dir CodeBERT/results/result.txt

python CodeBERT/codesearch/mrr.py

Secondly, does Table 2 in the paper represent MRR values generated from the above scripts?

Finally, what is the difference between jsonl and text file format data? I guess jsonl format files are used in document generation experiments? For this purpose, I construct jsonl files having the same data but in jsonl format as follows. Only code_tokens and docstring_tokens contain token list of code snippet and natural langunge description. Is it a right approach?

`{"repo": "", "path": "", "func_name": "", "original_string": "", "language": "lang", "code": "", "code_tokens": [], "docstring": "", "docstring_tokens": [], "sha": "", "url": "", "partition": ""}

Kindly, let me know about my concerns.
`

How could i further pre-train CodeBert?

Hi,

How do I go about/get started for further pre training Code Bert on my domain Specific code.

Is there something thing I should look out for?

Can I use the traditional pre training pipeline for codeBert?

Is pre training with MLM enough or should i also do Replaced Token Detection as well, as part of further-pre training?

Any links or suggested reading materials would be hugely appreciated. :D

Training and validation datasets

Do you plan to publish generator code of training and validation datasets?
It seems that you have created 1 positive record for even records in CodeSerachNet; 2 negative and 1 positive for odds records in CodeSearchNet. Am I right?

Model parameters

I would like to ask about the parameter settings of GraphCodeBERT, such as the number of Transformer layers, the hidden size and the number of self-attention heads.

Model Parameters

What are the parameter settings of codeBERT? (the number of layers, the hidden size, and the number of self-attention heads)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.