Giter Site home page Giter Site logo

clonedetection's Introduction

Clone Detection

Task Definition

Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score.

Updates

2021-9-13: We have update the evaluater script. Since it's a binary classification, we use binary F1 score instead of "macro" F1 score.

Dataset

The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree.

Data Format

  1. dataset/data.jsonl is stored in jsonlines format. Each line in the uncompressed file represents one function. One row is illustrated below.

    • func: the function

    • idx: index of the example

  2. train.txt/valid.txt/test.txt provide examples, stored in the following format: idx1 idx2 label

Data Statistics

Data statistics of the dataset are shown in the below table:

#Examples
Train 901,028
Dev 415,416
Test 415,416

You can get data using the following command.

unzip dataset.zip

Evaluator

We provide a script to evaluate predictions for this task, and report F1 score

Example

python evaluator/evaluator.py -a evaluator/answers.txt -p evaluator/predictions.txt

{'Recall': 0.25, 'Prediction': 0.5, 'F1': 0.3333333333333333}

Input predictions

A predications file that has predictions in TXT format, such as evaluator/predictions.txt. For example:

13653451	21955002	0
1188160	8831513	1
1141235	14322332	0
16765164	17526811	1

Pipeline-GraphCodeBERT

We also provide a pipeline that fine-tunes GraphCodeBERT on this task.

Dependency

  • pip install torch
  • pip install transformers
  • pip install tree_sitter
  • pip sklearn

Tree-sitter (optional)

If the built file "parser/my-languages.so" doesn't work for you, please rebuild as the following command:

cd parser
bash build.sh
cd ..

Fine-tune

We use 4*V100-16G to fine-tune and 10% valid data to evaluate.

mkdir saved_models
python run.py \
    --output_dir=saved_models \
    --config_name=microsoft/graphcodebert-base \
    --model_name_or_path=microsoft/graphcodebert-base \
    --tokenizer_name=microsoft/graphcodebert-base \
    --do_train \
    --train_data_file=dataset/train.txt \
    --eval_data_file=dataset/valid.txt \
    --test_data_file=dataset/test.txt \
    --epoch 1 \
    --code_length 512 \
    --data_flow_length 128 \
    --train_batch_size 16 \
    --eval_batch_size 32 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training \
    --seed 123456 2>&1| tee saved_models/train.log

Inference

We use full test data for inference.

python run.py \
    --output_dir=saved_models \
    --config_name=microsoft/graphcodebert-base \
    --model_name_or_path=microsoft/graphcodebert-base \
    --tokenizer_name=microsoft/graphcodebert-base \
    --do_eval \
    --do_test \
    --train_data_file=dataset/train.txt \
    --eval_data_file=dataset/valid.txt \
    --test_data_file=dataset/test.txt \
    --epoch 1 \
    --code_length 512 \
    --data_flow_length 128 \
    --train_batch_size 16 \
    --eval_batch_size 32 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training \
    --seed 123456 2>&1| tee saved_models/test.log

Evaluation

python evaluator/evaluator.py -a dataset/test.txt -p saved_models/predictions.txt 2>&1| tee saved_models/score.log

Result

The results on the test set are shown as below:

Method Precision Recall F1
Deckard 0.93 0.02 0.03
RtvNN 0.95 0.01 0.01
CDLH 0.92 0.74 0.82
ASTNN 0.92 0.94 0.93
FA-AST-GMN 0.96 0.94 0.95
CodeBERT 0.947 0.934 0.941
GraphCodeBERT 0.948 0.952 0.950

clonedetection's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.