Program classification aims to classify newly added source files into different categories according to their functionalities in the software development process
The dataset we use is OJ datasets, including OJ-Data-1, OJ-Data-2 and OJ-All.
-
resources/dataset/OJ-Data-*/programs.pkl is stored in pickle format. Each row in this file represents one function and its label. One row is illustrated below.
-
id: index of the example
-
Code: the code fragment
-
label index of the example
-
Data statistics of the dataset are shown in the below table:
#Examples | #Program Tasks | |
---|---|---|
OJ-Data-1 | 52,000 | 104 |
OJ-Data-2 | 52,000 | 104 |
OJ-All | 104,000 | 208 |
You can get data using the following command.
import os
import pandas as pd
data_path = '../resources/dataset/OJ-All/programs.pkl'
if os.path.exists(data_path):
data = pd.read_pickle(data_path)
We also provide a pipeline that generates inputs for our model on this task.
- gensim
- networkx
- dgl
- nltk
- numpy
- pandas
- scikit_learn
- torch
If the built file "parser/my-languages.so" doesn't work for you, please rebuild as the following command:
cd parser1
bash build.sh
cd ..
python DataProcess/Pipline.py
We provide a script to train and evaluate our model for this task, and report Accuracy score
python Entry/train_graph_lstm.py
OJ-All
[Epoch: 100/100] Train Loss: 0.0526, Val Loss: 0.0540, Train result: 0.9773022361144506, Test result: 0.9660623692625808