zhangj111 / astnn Goto Github PK

View Code? Open in Web Editor NEW

224.0 224.0 97.0 1.71 MB

License: MIT License

Python 100.00%

astnn's People

Contributors

Stargazers

Forkers

chubbymaggie ying-2016 jerrysheep coder-chenzhi zeovan zzxn leonyang95 yangxingguang liuce1994 shubhamrsangle michaelwayneliu leorpoirot snivyc muhagunghidayat t-ambur eumenide springri fc-h abhidhariwal nengnengwu zisding hazimhanif wsl071134 fioushen lennahammer dl4code jacktj01 eamonliang david-maine landandland tjudoubi pyterwareli hangdj xiaohogn flyboss xiaoqiuxuan lyt-may soloveri zhichaoouyang ningke-li nashid tszdanger embold-raj zhifeiyv666 ilwoof liuyapeng24 ll3lin felsenhower binliang-nlp maximzubkov solemin leonhlj zybillie dangxuanvuong98 j-c-w msl9810 gaotravor zuzu601 abhishek9sharma rroscha zhangxunhui dharanipriya02 ljiahao religiones freytian1996 gentleman1996 eastmountyxz hhyu666 rin-nn spearo2 nayeawon xu-zhiwei xiaochaolee dabaier shiningrain luosword tristan-now sunning2118 wolfram70 hmiaooo musum easy-forks fjgao bnc1010 commissarsilver gohmann5691 brandon82 bianyz 3k-1 esti-tech cashmere0 kitsiosk jiangly027 2041085572 aryanmudgal03

astnn's Issues

A bug about Code Clone Detection in file astnn/clone/pipeline.py

Dear @zhangj111 ,
I tried to run Code-Clone-Detection task by using python pipeline.py --lang c
But an error occurred with info 'FileNotFoundError: [WinError 3] 系统找不到指定的路径。: 'datac/train/'
The correct path should be data/c/train/ instead of datac/train/, which was caused by line 86 of the file astnn/clone/pipeline.py

astnn/clone/pipeline.py

Line 86 in 4e86c90

data_path = self.root+self.language+'/'

To fix the bug, maybe we could change this line of code to:
data_path = self.root+'/'+self.language+'/'
Finally, truly thankful about your innovational paper and model.

'float' object has no attribute 'chidren()' while training for c data.

It got executed for the first time. But when i ran the command again while training the model from scratch i am getting the above error .

import error 'from tree import ASTNode, BlockNode'

Hello, When I execute your code, util.py 'from tree import ASTNode, BlockNode' in the file is marked in red, and there is no BlockNode in the tree module.

EOFError: Ran out of input

When I try to run pipeline.py for Source Code Classification, there's an error says "line 182, in read_pickle
return pickle.load(f)
EOFError: Ran out of input" as bellow:

Could you suggest me how to solve this issue? Thank you very much.

torch 版本的问题

之前安装的是pytorch 版本是0.3.1, clone代码后，运行正常。现在是不是要把pytorch 升级到 1.0.0?

对吗？

questions about programs.pkl file

What is the meaning of label in this file?Thank you.

Astnn as a pure encoder

Hi,

I would like to decouple the encoder(BatchTreeEncoder and gru) part of the code from the classifier. Is this possible? I tried doing it but not able to figure out what should be the output so that the decoder part of the neural network decodes properly. The loss function also might need to change. I tried to flatten the input to a list of integers, created a hash and converted the input to a fixed length list of integers to serve as a comparison of the output to the input. I also changed the neural network architecture to reflect this instead of the labels. I used the MSE as loss function. But not getting a good score. Any ideas to help me here? I would like to then use the encoded output for clustering.

Thanks and Regards,
Jyothi

code clone categories

Dear writer,
I am trying to use my own dataset on your clone code, and I see from your bcb_pair_ids.pkl that the labels are not only 0,1 but 2,3,4,5. Could you please explain that why there are more than 2 categories? Because in my perspective, the label stands for only "cloned" and "not cloned". Your help would be highly appreciated.

I met a problem when executing the command

when I execute the command blow, some exceptions occurred . Does the file of "bcb_pair_ids.pkl" have something wrong ?

python pipeline.py --lang java

Characters level Embedding OR Token based Embedding?

Hello,
I wonder that whether this work using characters level or token level word embedding by Word2Vec ?
I may found an error on line 76 in file 'pipeline.py', this operation will let Word2Vec to learn character level embedding instead of token based embedding.
Finally, It may cause an error on line 95 in function 'tree_to_index'. If I have understand it correctly, this should turn the token in AST node to it's embedding index. However, all tokens are word level and they are not in w2v.wv.vocab at all. Finally, all tokens on AST nodes excepts some leaf tokens are regarded as 'max_token' in else branch.
I may have some misunderstand this file or this problem is caused by gensim versions.

bug

import javalang
from clone.utils import get_blocks_v1


func2 = """
public int test2(int a){
    if(a>3){
        try{
            if(a>10){
                a = 9;
            } 
        }catch(Exception e){
            a = 10;
        }
        
    }
    return a;
}
"""


def tree_to_index(node):
    token = node.token
    # result = [vocab[token].index if token in vocab else max_token]
    result = [token]
    children = node.children
    for child in children:
        result.append(tree_to_index(child))
    return result


def trans2seq(r):
    blocks = []
    get_blocks_v1(r, blocks)
    tree = []
    for b in blocks:
        btree = tree_to_index(b)
        tree.append(btree)
    return tree


tokens = javalang.tokenizer.tokenize(func2)
parser = javalang.parser.Parser(tokens)
tree = parser.parse_member_declaration()

seq = trans2seq(tree)
print(seq)

the output is [['MethodDeclaration', ['Modifier', ['public']], ['BasicType', ['int']], ['test2'], ['FormalParameter', ['BasicType', ['int']], ['a']]], ['IfStatement', ['BinaryOperation', ['>'], ['MemberReference', ['a']], ['Literal', ['3']]]], ['BlockStatement'], ['TryStatement', ['CatchClause', ['CatchClauseParameter', ['Exception'], ['e']], ['StatementExpression', ['Assignment', ['MemberReference', ['a']], ['Literal', ['10']], ['=']]]]], ['End'], ['ReturnStatement', ['MemberReference', ['a']]]]

the statement if(a>10){ a = 9;} disappeared from the output.

Question about preprocessing poj104 dataset into programs.pkl

Hi,

Thanks for your great paper and opening source this repo. The paper is really inspiring and I've learnt a lot about how to deal with RvNN-style structure in torch from your nice implementation .

I am recently playing with the 'poj104' dataset, the one you use for code classification task. In your workflow it's directly loaded from a pickle file and each code sample is processed into an AST by parser.parse (line 20-25 in pipeline.py)

            from pycparser import c_parser
            parser = c_parser.CParser()
            source = pd.read_pickle(self.root+'programs.pkl')

            source.columns = ['id', 'code', 'label']
            source['code'] = source['code'].apply(parser.parse)

However, I find that pycparser CAN'T DIRECTLY parse the code text read from poj104. So could you please tell me how you preprocess the code text into program.pkl file? (Yes, I'm a newbie to both DL & pycparser...)

Thanks in advance!

ModuleNotFoundError: No module named 'pandas.core.internals.managers'; 'pandas.core.internals' is not a package

It seems that the pandas version is not right. But I can use pandas==0.24.0 to fix this problem. Hope to help others.

你好👋

你好，能发我一份关于这篇论文的PPT吗？谢谢

CUDA out of memory using a custom dataset

Hi, I am using a custom dataset where the ASTs are significantly larger and get the following message (this is on a V100 with 32GB memory!)

RuntimeError: CUDA out of memory. Tried to allocate 11.69 GiB (GPU 0; 31.75 GiB total capacity; 11.88 GiB already allocated; 7.13 GiB free; 11.71 GiB cached)

Is there anything I can do after already having set BATCH_SIZE = 1?

Using ASTnn on our own Dataset

Hello,

I am trying to use ASTnn for our research purposes. But, we are facing issues when we need to train model on our own dataset. As we are trying to train it entire repository code, when we are feeding code we are getting errors as we need to preprocess data according to pycparser requirement as they mention on their page.

But, we can't preprocess it as it's manual process where we have to preprocess file one by one, so we are unable to run this on repository.

Can you please explain step by step that how can we run astnn tool on our dataset (specially when running on repository, so having lots of header files and all).

Looking forward for your reply

Exception during the command execution

Dear colleagues,

When I run this command: python pipeline.py --lang java, I have a KeyError exception in dictionary_and_embedding(...) method:

Are dataset files correct?

Thanks in advance.

EMBEDDING_DIM is redundant as embedding dim must match encoding dim

I have noticed that the EMBEDDING_DIM variable is passed through BatchProgramClassifier() and BatchTreeEncoder() but not used when defining the dimensions of the matrix (in this case batch_current) that will be passed as input to self.W_c (line 59 of model.py). Instead the ENCODE_DIM value is used to define the shape of batch_current, and this happens to work as with the configuration you have used they both are 128. However, were you to change the EMBEDDING_DIM it would throw an error.

A potential solution to this would be to use a separate matrix to replace batch_current in the first part of traverse_mul() up to line 59 which uses self.embedding_dim (you'd need to create this variable also) instead of self.encode_dim, and then initialise batch_current in line 59.

request to modify astnn for wider applications

ASTNN didn't worked when i used for error java codes.

i request if you modify your ASTNN to also work for error codes.
and would you add special functionality in your ASTNN that would take a code file and create a mapping of code statement to its corresponding vector, then it would be a great help for us and also more people would be able to use your library for their research work instead of small segment of people who are only aiming for source code classification and code clone detection.

Stack expects a non empty tensor error while trying to add prediction code. I am new to torch, please excuse me

About data processing

We are trying to apply this model on our own dataset, but how to process our data to adapt this model has been a big problem for us.
So I wish to know how to process our data to get the correct format.
We would appreciate it greatly if u are willing to offer some help.

Best wishes.

when trying to use in ubuntu machine getting this error.

after sucessfully installing python 3.6

when doing
$ pip install pandas==0.20.3 gensim==3.5.0 scikit-learn==0.19.1 pycparser==2.18 javalang==0.11.0

getting this error....

将AST树拆分成语句子树的问题

作者您好，因能力有限，我没能找到将AST树拆分成语句子树的代码，请您赐教。另外，您在论文中做了AST-Full，AST-Block，AST-Node的对比实验，如果方便的话，关于这部分的代码您可否分享一下呢？期待您的解答，感谢！

将 torch 升级为version 1.2.0 ，运行不成功

Start training...
[Epoch: 0/ 15] Training Loss: 2.3469, Validation Loss: 0.8280, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.420 s
[Epoch: 1/ 15] Training Loss: 0.4703, Validation Loss: 0.3033, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.467 s
[Epoch: 2/ 15] Training Loss: 0.1991, Validation Loss: 0.1727, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.485 s
[Epoch: 3/ 15] Training Loss: 0.1193, Validation Loss: 0.1204, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.439 s
[Epoch: 4/ 15] Training Loss: 0.0837, Validation Loss: 0.1022, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.594 s
[Epoch: 5/ 15] Training Loss: 0.0635, Validation Loss: 0.0876, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.626 s
[Epoch: 6/ 15] Training Loss: 0.0508, Validation Loss: 0.0778, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 108.131 s
[Epoch: 7/ 15] Training Loss: 0.0422, Validation Loss: 0.0720, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 108.008 s
[Epoch: 8/ 15] Training Loss: 0.0356, Validation Loss: 0.0659, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.627 s
[Epoch: 9/ 15] Training Loss: 0.0304, Validation Loss: 0.0636, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.660 s
[Epoch: 10/ 15] Training Loss: 0.0262, Validation Loss: 0.0610, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.398 s
[Epoch: 11/ 15] Training Loss: 0.0228, Validation Loss: 0.0577, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 106.979 s
[Epoch: 12/ 15] Training Loss: 0.0200, Validation Loss: 0.0586, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 106.910 s
[Epoch: 13/ 15] Training Loss: 0.0178, Validation Loss: 0.0554, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.089 s
[Epoch: 14/ 15] Training Loss: 0.0158, Validation Loss: 0.0549, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.778 s

best_model WASN'T defined after all!

Testing results(Acc): tensor(0, device='cuda:0')
是什么原因呢？一定要是torch 1.0.0吗？

the Code snippet to the Embedded vector

Hi,

As you explained, I have installed your model and trained it with the "ast.pkt" file.
I was wondering if you help me to know, how I can pass a Code snippet to your trained model and get its embedded vector?
(snippet code has been stored in a string variable)

Thank you so much for your help in advance.

Shallow copy when finding best_model

I could be wrong, but it looks like you're not performing a deep copy when you copy model to best_model and vice versa (train.py: line 58, 117 and 127), meaning they're both pointing at the same memory location and both get updated during optimization. In this case it would mean that you're never going to return to an earlier model state that might have had a higher accuracy, only ever the latest. I'm not sure if this is what you intended?

Traverse mul Explanation Needed

Hello @zhangj111, You did such an incredible work with this model. I ran it on a free GPU and the results were really satisfactory. Having read the paper, I wish to know how the model work in depth. I am having a hard time figuring out how the encoder module works, especially the Traverse mul method.

Please can you help me with some explanation of what this piece of codes does exactly?

astnn/clone/model.py

Lines 39 to 54 in edd14c9

    
           index, children_index = [], [] 
        
           current_node, children = [], [] 
        
           for i in range(size): 
        
               # if node[i][0] is not -1: 
        
                   index.append(i) 
        
                   current_node.append(node[i][0]) 
        
                   temp = node[i][1:] 
        
                   c_num = len(temp) 
        
                   for j in range(c_num): 
        
                       if temp[j][0] is not -1: 
        
                           if len(children_index) <= j: 
        
                               children_index.append([i]) 
        
                               children.append([temp[j]]) 
        
                           else: 
        
                               children_index[j].append(i) 
        
                               children[j].append(temp[j])

Thanks.

in your ASTNN library, getting no. of statements not equal to no. of vectors generated for each file using ASTNN.

in your astnn library i am just trying to generate vectors of statements of each code file.
but getting no. of statements in a code file not equal to no. of vectors generated.

i used your astnn code in colab exactly doing the same which is in the file till vector generation part .
(till encodes = encodes.view(self.batch_size, max_len, -1) only)

https://colab.research.google.com/drive/15FC9I4D0MRTjhV4hlDpgrZrNGC_WyzeM?usp=sharing

What is the meaning of id1、id2、label in the bab_pair_ids.pkl?

ImportError: cannot import name 'prepare_data'

when I reached to clone/pipeline.py :from prepare_data import get_sequences as func,I get this error
What can I do?

A revision to solve 'nn.utils.rnn.pack_padded_sequence' issue in model.py

In model.py , I notice that use the method 'pack_padded_sequence' should have a decreasing sequence cause: The sequences should be sorted by length in a decreasing order
otherwise the code will raise RuntimeError: 'lengths' array has to be sorted in decreasing order

I revised the forward function to solve this issue below:

 def forward(self, x):

    lens = [len(item) for item in x]
    max_len = max(lens)
    encodes = []
    for i in range(self.batch_size):
        for j in range(lens[i]):
            encodes.append(x[i][j])

    encodes = self.encoder(encodes, sum(lens))
    seq, start, end = [], 0, 0
    for i in range(self.batch_size):
        end += lens[i]
        seq.append(encodes[start:end])
        if max_len-lens[i]:
            seq.append(self.get_zeros(max_len-lens[i]))
        start = end
    encodes = torch.cat(seq)
    encodes = encodes.view(self.batch_size, max_len, -1)


    lens = torch.LongTensor(lens)

    lens, perm_idx = lens.sort(0, descending=True)
    encodes = encodes[perm_idx]
    _ , unperm_idx = perm_idx.sort(0, descending=False)



    encodes = nn.utils.rnn.pack_padded_sequence(encodes, lens, True)

    # gru
    gru_out, _ = self.bigru(encodes, self.hidden)
    gru_out, _ = nn.utils.rnn.pad_packed_sequence(gru_out, batch_first=True, padding_value=-1e9)

    gru_out = gru_out[unperm_idx]
    # print(gru_out.shape)

    gru_out = torch.transpose(gru_out, 1, 2)
    # pooling
    gru_out = F.max_pool1d(gru_out, gru_out.size(2)).squeeze(2)
    # gru_out = gru_out[:,-1]

    # linear
    y = self.hidden2label(gru_out)
    return y

	index, children_index = [], []
	current_node, children = [], []
	for i in range(size):
	# if node[i][0] is not -1:
	index.append(i)
	current_node.append(node[i][0])
	temp = node[i][1:]
	c_num = len(temp)
	for j in range(c_num):
	if temp[j][0] is not -1:
	if len(children_index) <= j:
	children_index.append([i])
	children.append([temp[j]])
	else:
	children_index[j].append(i)
	children[j].append(temp[j])