Giter Site home page Giter Site logo

Comments (5)

matchlesswei avatar matchlesswei commented on September 24, 2024 1

@yuewang-cuhk
The original dataset for SourceSum is from https://github.com/sriniiyer/codenn/tree/master/data/stackoverflow
For the training data, "_silvia.tsv" is the correct one after our preprocessing.

For the testing data, the CodeNN group provided the human annotations for around 100 records for both Csharp and SQL. For example, you could find the sql ones in here: https://github.com/sriniiyer/codenn/tree/master/data/stackoverflow/sql/eval. As described in their paper, the bleu score is calculated for those having the human annotated summaries and the text from stackoverflow. We followed their steps to just evaluate these around 100 records. These are all contained in our tsv files for test. This is mentioned in our paper Page 4:

Iyer et al. (2016) asked human annotators to provide two additional titles for 200 randomly chosen code snippets from the validation and test set for SQL and CSharp code. We followed their preprocessing methods and evaluation using the test dataset annotated by human annotators.

from codetrans.

agemagician avatar agemagician commented on September 24, 2024

Hi @yuewang-cuhk ,

Thanks for your interest in our work.

In our research, we have used the original T5 inference function to make the prediction:
https://github.com/google-research/text-to-text-transfer-transformer#decode
Afterward, we used the CodeBert smoothed-BLEU score function to calculate the results:
https://github.com/microsoft/CodeBERT/tree/master/CodeBERT/code2nl

Due to the complexity of T5 and slow inference speed, we decided to convert all our models to the hugging-face library, which is much faster and easier for researchers.

The reason for the difference in the smoothed BLEU results, due to the configuration of the beam search, which was used in T5 library compared to the hugging-face library.

In T5, they used beam search 4 and decode alpha 0.6:
https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/gin/beam_search.gin

To approximately match the same configuration in Hugging-face, you have to adjust the beam search configuration as following:

preds = pipeline(tokenized_input,
                   min_length=1,
                   max_length=1024,
                   num_beams=4,
                   temperature=0,
                   length_penalty=0.6
                  )

Here are the expected results of T5 Vs. Hugging-face for the Javascript Code Documentation Generation:

Library/Model Small Base Large
T5 with beam search 17.23 18.25 18.98
HuggingFace without beam search 15.8 16.96 17.67
HuggingFace with beam search 17.1 18.13 18.94
T5 - HuggingFace difference (using beam search) 0.13 0.12 0.04

As you can see, using the correct beam search configuration, you can approximately match the same result as T5 on HuggingFace.
The small insignificant difference between T5 and HuggingFace results due to the different implementation of beam search.
For example, HuggingFace calculates the penalty differently than T5 (based on Mesh TensorFlow):
https://github.com/huggingface/transformers/blob/996a315e76f6c972c854990e6114226a91bc0a90/src/transformers/generation_beam_search.py#L368
https://github.com/tensorflow/mesh/blob/985151bc4e787be3c99174d0d0eee743a4cb8561/mesh_tensorflow/beam_search.py#L261

I have created three Colab examples that should replicate the above results and reproduce it:
https://colab.research.google.com/drive/10PwFRsY8P2uMc3SGr7WRgqQXFxjzbj83?usp=sharing
https://colab.research.google.com/drive/1vc84NthgeLNLxOH6eUqbh_5UIuD-Mh4s?usp=sharing
https://colab.research.google.com/drive/1YvXt5vYL6HJDPW37tWv9f_r-p-TJfqLs?usp=sharing

Simply following the above examples for the rest of the languages/models, you should be able to reproduce our results.

Regarding preprocessing, you don't need to tokenize the source code using tree_sitter for the CodeBert dataset because it is already preprocessed. You only need to do so if you have a new example that you need to predict.

I hope the above explanation answers your questions.

Out of curiosity, why are you reproducing our results?
Are you planning to use it internally in salesforce, or preparing for a new publication, or something else ?

from codetrans.

yuewang-cuhk avatar yuewang-cuhk commented on September 24, 2024

Hi @agemagician, great thanks to your quick and detailed response! We have been able to reproduce the code documentation generation tasks following your instructions. We are planning to compare CodeTrans in our new publication.

By the way, we can only find the provided training set but not the dev and test sets. Could you also kindly share them (tokenized dev and test datasets) to facilitate the easy comparison with your CodeTrans on all downstream tasks? Thanks in advance!

from codetrans.

agemagician avatar agemagician commented on September 24, 2024

You are welcome ๐Ÿ˜ƒ
Sure, we have updated the readme with the datasets links:
https://www.dropbox.com/sh/mzxa2dq30gnot29/AABIf7wPxH5Oe0PZHJ5jPV22a?dl=0

Feel free to send me an email or LinkedIn message, if you want to have a discussion over the new publication.
I and my co-author @matchlesswei will be happy to discuss it.

from codetrans.

yuewang-cuhk avatar yuewang-cuhk commented on September 24, 2024

Hi @agemagician, thanks for sharing these datasets. I've checked them and confirmed that most of them have the same data statistics in the paper except the "SourceSum" task, where only the training set (ends with "_silvia.tsv") has the matched size. Could you help to check that? I print all the data sizes for files in the "SourceSum" folder below:

6252 testC#
6629 testCS_silvia.tsv
2662 testPython
2659 testPython.txt
2783 testPython_silvia.tsv
2932 testSQL
3340 testSQL_silvia.tsv
49801 trainC#
52943 trainCS_silvia.tsv
11461 trainPython
11458 trainPython.txt
12004 trainPython_silvia.tsv
22492 trainSQL
25671 trainSQL_silvia.tsv
6241 valC#
2647 valPython
2651 valPython.txt
2858 valSQL

from codetrans.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.