Hi, I've been trying to reproduce the results of Code Documentation Generation but fai

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Cannot reproduce Code Documentation Generation performance about codetrans HOT 5 CLOSED

agemagician commented on September 24, 2024

Cannot reproduce Code Documentation Generation performance

from codetrans.

Comments (5)

matchlesswei commented on September 24, 2024 1

@yuewang-cuhk
The original dataset for SourceSum is from https://github.com/sriniiyer/codenn/tree/master/data/stackoverflow
For the training data, "_silvia.tsv" is the correct one after our preprocessing.

For the testing data, the CodeNN group provided the human annotations for around 100 records for both Csharp and SQL. For example, you could find the sql ones in here: https://github.com/sriniiyer/codenn/tree/master/data/stackoverflow/sql/eval. As described in their paper, the bleu score is calculated for those having the human annotated summaries and the text from stackoverflow. We followed their steps to just evaluate these around 100 records. These are all contained in our tsv files for test. This is mentioned in our paper Page 4:

Iyer et al. (2016) asked human annotators to provide two additional titles for 200 randomly chosen code snippets from the validation and test set for SQL and CSharp code. We followed their preprocessing methods and evaluation using the test dataset annotated by human annotators.

from codetrans.

agemagician commented on September 24, 2024

Hi @yuewang-cuhk ,

Thanks for your interest in our work.

In our research, we have used the original T5 inference function to make the prediction:
https://github.com/google-research/text-to-text-transfer-transformer#decode
Afterward, we used the CodeBert smoothed-BLEU score function to calculate the results:
https://github.com/microsoft/CodeBERT/tree/master/CodeBERT/code2nl

Due to the complexity of T5 and slow inference speed, we decided to convert all our models to the hugging-face library, which is much faster and easier for researchers.

The reason for the difference in the smoothed BLEU results, due to the configuration of the beam search, which was used in T5 library compared to the hugging-face library.

In T5, they used beam search 4 and decode alpha 0.6:
https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/gin/beam_search.gin

To approximately match the same configuration in Hugging-face, you have to adjust the beam search configuration as following:

preds = pipeline(tokenized_input,
                   min_length=1,
                   max_length=1024,
                   num_beams=4,
                   temperature=0,
                   length_penalty=0.6
                  )

Here are the expected results of T5 Vs. Hugging-face for the Javascript Code Documentation Generation:

Library/Model	Small	Base	Large
T5 with beam search	17.23	18.25	18.98
HuggingFace without beam search	15.8	16.96	17.67
HuggingFace with beam search	17.1	18.13	18.94
T5 - HuggingFace difference (using beam search)	0.13	0.12	0.04

As you can see, using the correct beam search configuration, you can approximately match the same result as T5 on HuggingFace.
The small insignificant difference between T5 and HuggingFace results due to the different implementation of beam search.
For example, HuggingFace calculates the penalty differently than T5 (based on Mesh TensorFlow):
https://github.com/huggingface/transformers/blob/996a315e76f6c972c854990e6114226a91bc0a90/src/transformers/generation_beam_search.py#L368
https://github.com/tensorflow/mesh/blob/985151bc4e787be3c99174d0d0eee743a4cb8561/mesh_tensorflow/beam_search.py#L261

I have created three Colab examples that should replicate the above results and reproduce it:
https://colab.research.google.com/drive/10PwFRsY8P2uMc3SGr7WRgqQXFxjzbj83?usp=sharing
https://colab.research.google.com/drive/1vc84NthgeLNLxOH6eUqbh_5UIuD-Mh4s?usp=sharing
https://colab.research.google.com/drive/1YvXt5vYL6HJDPW37tWv9f_r-p-TJfqLs?usp=sharing

Simply following the above examples for the rest of the languages/models, you should be able to reproduce our results.

Regarding preprocessing, you don't need to tokenize the source code using tree_sitter for the CodeBert dataset because it is already preprocessed. You only need to do so if you have a new example that you need to predict.

I hope the above explanation answers your questions.

Out of curiosity, why are you reproducing our results?
Are you planning to use it internally in salesforce, or preparing for a new publication, or something else ?

from codetrans.

yuewang-cuhk commented on September 24, 2024

Hi @agemagician, great thanks to your quick and detailed response! We have been able to reproduce the code documentation generation tasks following your instructions. We are planning to compare CodeTrans in our new publication.

By the way, we can only find the provided training set but not the dev and test sets. Could you also kindly share them (tokenized dev and test datasets) to facilitate the easy comparison with your CodeTrans on all downstream tasks? Thanks in advance!

from codetrans.

agemagician commented on September 24, 2024

You are welcome 😃
Sure, we have updated the readme with the datasets links:
https://www.dropbox.com/sh/mzxa2dq30gnot29/AABIf7wPxH5Oe0PZHJ5jPV22a?dl=0

Feel free to send me an email or LinkedIn message, if you want to have a discussion over the new publication.
I and my co-author @matchlesswei will be happy to discuss it.

from codetrans.

yuewang-cuhk commented on September 24, 2024

Hi @agemagician, thanks for sharing these datasets. I've checked them and confirmed that most of them have the same data statistics in the paper except the "SourceSum" task, where only the training set (ends with "_silvia.tsv") has the matched size. Could you help to check that? I print all the data sizes for files in the "SourceSum" folder below:

6252 testC#
6629 testCS_silvia.tsv
2662 testPython
2659 testPython.txt
2783 testPython_silvia.tsv
2932 testSQL
3340 testSQL_silvia.tsv
49801 trainC#
52943 trainCS_silvia.tsv
11461 trainPython
11458 trainPython.txt
12004 trainPython_silvia.tsv
22492 trainSQL
25671 trainSQL_silvia.tsv
6241 valC#
2647 valPython
2651 valPython.txt
2858 valSQL

from codetrans.

Cannot reproduce Code Documentation Generation performance about codetrans HOT 5 CLOSED

Comments (5)

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent