zhangj111 / astnn Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Dear @zhangj111 ,
I tried to run Code-Clone-Detection task by using python pipeline.py --lang c
But an error occurred with info 'FileNotFoundError: [WinError 3] 系统找不到指定的路径。: 'datac/train/'
The correct path should be data/c/train/
instead of datac/train/
, which was caused by line 86 of the file astnn/clone/pipeline.py
Line 86 in 4e86c90
data_path = self.root+'/'+self.language+'/'
It got executed for the first time. But when i ran the command again while training the model from scratch i am getting the above error .
Hello, When I execute your code, util.py 'from tree import ASTNode, BlockNode' in the file is marked in red, and there is no BlockNode in the tree module.
之前安装的是pytorch 版本是0.3.1, clone代码后,运行正常。现在是不是要把pytorch 升级到 1.0.0?
对吗?
What is the meaning of label in this file?Thank you.
Hi,
I would like to decouple the encoder(BatchTreeEncoder and gru) part of the code from the classifier. Is this possible? I tried doing it but not able to figure out what should be the output so that the decoder part of the neural network decodes properly. The loss function also might need to change. I tried to flatten the input to a list of integers, created a hash and converted the input to a fixed length list of integers to serve as a comparison of the output to the input. I also changed the neural network architecture to reflect this instead of the labels. I used the MSE as loss function. But not getting a good score. Any ideas to help me here? I would like to then use the encoded output for clustering.
Thanks and Regards,
Jyothi
Dear writer,
I am trying to use my own dataset on your clone code, and I see from your bcb_pair_ids.pkl that the labels are not only 0,1 but 2,3,4,5. Could you please explain that why there are more than 2 categories? Because in my perspective, the label stands for only "cloned" and "not cloned". Your help would be highly appreciated.
Hello,
I wonder that whether this work using characters level or token level word embedding by Word2Vec ?
I may found an error on line 76 in file 'pipeline.py', this operation will let Word2Vec to learn character level embedding instead of token based embedding.
Finally, It may cause an error on line 95 in function 'tree_to_index'. If I have understand it correctly, this should turn the token in AST node to it's embedding index. However, all tokens are word level and they are not in w2v.wv.vocab at all. Finally, all tokens on AST nodes excepts some leaf tokens are regarded as 'max_token' in else branch.
I may have some misunderstand this file or this problem is caused by gensim versions.
import javalang
from clone.utils import get_blocks_v1
func2 = """
public int test2(int a){
if(a>3){
try{
if(a>10){
a = 9;
}
}catch(Exception e){
a = 10;
}
}
return a;
}
"""
def tree_to_index(node):
token = node.token
# result = [vocab[token].index if token in vocab else max_token]
result = [token]
children = node.children
for child in children:
result.append(tree_to_index(child))
return result
def trans2seq(r):
blocks = []
get_blocks_v1(r, blocks)
tree = []
for b in blocks:
btree = tree_to_index(b)
tree.append(btree)
return tree
tokens = javalang.tokenizer.tokenize(func2)
parser = javalang.parser.Parser(tokens)
tree = parser.parse_member_declaration()
seq = trans2seq(tree)
print(seq)
the output is [['MethodDeclaration', ['Modifier', ['public']], ['BasicType', ['int']], ['test2'], ['FormalParameter', ['BasicType', ['int']], ['a']]], ['IfStatement', ['BinaryOperation', ['>'], ['MemberReference', ['a']], ['Literal', ['3']]]], ['BlockStatement'], ['TryStatement', ['CatchClause', ['CatchClauseParameter', ['Exception'], ['e']], ['StatementExpression', ['Assignment', ['MemberReference', ['a']], ['Literal', ['10']], ['=']]]]], ['End'], ['ReturnStatement', ['MemberReference', ['a']]]]
the statement if(a>10){ a = 9;}
disappeared from the output.
Hi,
Thanks for your great paper and opening source this repo. The paper is really inspiring and I've learnt a lot about how to deal with RvNN-style structure in torch from your nice implementation .
I am recently playing with the 'poj104' dataset, the one you use for code classification task. In your workflow it's directly loaded from a pickle file and each code sample is processed into an AST by parser.parse (line 20-25 in pipeline.py)
from pycparser import c_parser
parser = c_parser.CParser()
source = pd.read_pickle(self.root+'programs.pkl')
source.columns = ['id', 'code', 'label']
source['code'] = source['code'].apply(parser.parse)
However, I find that pycparser CAN'T DIRECTLY parse the code text read from poj104. So could you please tell me how you preprocess the code text into program.pkl file? (Yes, I'm a newbie to both DL & pycparser...)
Thanks in advance!
It seems that the pandas version is not right. But I can use pandas==0.24.0 to fix this problem. Hope to help others.
你好,能发我一份关于这篇论文的PPT吗?谢谢
Hi, I am using a custom dataset where the ASTs are significantly larger and get the following message (this is on a V100 with 32GB memory!)
RuntimeError: CUDA out of memory. Tried to allocate 11.69 GiB (GPU 0; 31.75 GiB total capacity; 11.88 GiB already allocated; 7.13 GiB free; 11.71 GiB cached)
Is there anything I can do after already having set BATCH_SIZE = 1
?
Hello,
I am trying to use ASTnn for our research purposes. But, we are facing issues when we need to train model on our own dataset. As we are trying to train it entire repository code, when we are feeding code we are getting errors as we need to preprocess data according to pycparser requirement as they mention on their page.
But, we can't preprocess it as it's manual process where we have to preprocess file one by one, so we are unable to run this on repository.
Can you please explain step by step that how can we run astnn tool on our dataset (specially when running on repository, so having lots of header files and all).
Looking forward for your reply
I have noticed that the EMBEDDING_DIM variable is passed through BatchProgramClassifier() and BatchTreeEncoder() but not used when defining the dimensions of the matrix (in this case batch_current) that will be passed as input to self.W_c (line 59 of model.py). Instead the ENCODE_DIM value is used to define the shape of batch_current, and this happens to work as with the configuration you have used they both are 128. However, were you to change the EMBEDDING_DIM it would throw an error.
A potential solution to this would be to use a separate matrix to replace batch_current in the first part of traverse_mul() up to line 59 which uses self.embedding_dim (you'd need to create this variable also) instead of self.encode_dim, and then initialise batch_current in line 59.
ASTNN didn't worked when i used for error java codes.
i request if you modify your ASTNN to also work for error codes.
and would you add special functionality in your ASTNN that would take a code file and create a mapping of code statement to its corresponding vector, then it would be a great help for us and also more people would be able to use your library for their research work instead of small segment of people who are only aiming for source code classification and code clone detection.
We are trying to apply this model on our own dataset, but how to process our data to adapt this model has been a big problem for us.
So I wish to know how to process our data to get the correct format.
We would appreciate it greatly if u are willing to offer some help.
Best wishes.
作者您好,因能力有限,我没能找到将AST树拆分成语句子树的代码,请您赐教。另外,您在论文中做了AST-Full,AST-Block,AST-Node的对比实验,如果方便的话,关于这部分的代码您可否分享一下呢?期待您的解答,感谢!
Start training...
[Epoch: 0/ 15] Training Loss: 2.3469, Validation Loss: 0.8280, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.420 s
[Epoch: 1/ 15] Training Loss: 0.4703, Validation Loss: 0.3033, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.467 s
[Epoch: 2/ 15] Training Loss: 0.1991, Validation Loss: 0.1727, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.485 s
[Epoch: 3/ 15] Training Loss: 0.1193, Validation Loss: 0.1204, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.439 s
[Epoch: 4/ 15] Training Loss: 0.0837, Validation Loss: 0.1022, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.594 s
[Epoch: 5/ 15] Training Loss: 0.0635, Validation Loss: 0.0876, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.626 s
[Epoch: 6/ 15] Training Loss: 0.0508, Validation Loss: 0.0778, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 108.131 s
[Epoch: 7/ 15] Training Loss: 0.0422, Validation Loss: 0.0720, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 108.008 s
[Epoch: 8/ 15] Training Loss: 0.0356, Validation Loss: 0.0659, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.627 s
[Epoch: 9/ 15] Training Loss: 0.0304, Validation Loss: 0.0636, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.660 s
[Epoch: 10/ 15] Training Loss: 0.0262, Validation Loss: 0.0610, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.398 s
[Epoch: 11/ 15] Training Loss: 0.0228, Validation Loss: 0.0577, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 106.979 s
[Epoch: 12/ 15] Training Loss: 0.0200, Validation Loss: 0.0586, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 106.910 s
[Epoch: 13/ 15] Training Loss: 0.0178, Validation Loss: 0.0554, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.089 s
[Epoch: 14/ 15] Training Loss: 0.0158, Validation Loss: 0.0549, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.778 s
Testing results(Acc): tensor(0, device='cuda:0')
是什么原因呢?一定要是torch 1.0.0吗?
Hi,
As you explained, I have installed your model and trained it with the "ast.pkt" file.
I was wondering if you help me to know, how I can pass a Code snippet to your trained model and get its embedded vector?
(snippet code has been stored in a string variable)
Thank you so much for your help in advance.
I could be wrong, but it looks like you're not performing a deep copy when you copy model
to best_model
and vice versa (train.py: line 58, 117 and 127), meaning they're both pointing at the same memory location and both get updated during optimization. In this case it would mean that you're never going to return to an earlier model state that might have had a higher accuracy, only ever the latest. I'm not sure if this is what you intended?
Hello @zhangj111, You did such an incredible work with this model. I ran it on a free GPU and the results were really satisfactory. Having read the paper, I wish to know how the model work in depth. I am having a hard time figuring out how the encoder module works, especially the Traverse mul method.
Please can you help me with some explanation of what this piece of codes does exactly?
Lines 39 to 54 in edd14c9
Thanks.
in your astnn library i am just trying to generate vectors of statements of each code file.
but getting no. of statements in a code file not equal to no. of vectors generated.
i used your astnn code in colab exactly doing the same which is in the file till vector generation part .
(till encodes = encodes.view(self.batch_size, max_len, -1) only)
https://colab.research.google.com/drive/15FC9I4D0MRTjhV4hlDpgrZrNGC_WyzeM?usp=sharing
when I reached to clone/pipeline.py :from prepare_data import get_sequences as func,I get this error
What can I do?
In model.py
, I notice that use the method 'pack_padded_sequence' should have a decreasing sequence cause: The sequences should be sorted by length in a decreasing order
otherwise the code will raise RuntimeError: 'lengths' array has to be sorted in decreasing order
I revised the forward function to solve this issue below:
def forward(self, x):
lens = [len(item) for item in x]
max_len = max(lens)
encodes = []
for i in range(self.batch_size):
for j in range(lens[i]):
encodes.append(x[i][j])
encodes = self.encoder(encodes, sum(lens))
seq, start, end = [], 0, 0
for i in range(self.batch_size):
end += lens[i]
seq.append(encodes[start:end])
if max_len-lens[i]:
seq.append(self.get_zeros(max_len-lens[i]))
start = end
encodes = torch.cat(seq)
encodes = encodes.view(self.batch_size, max_len, -1)
lens = torch.LongTensor(lens)
lens, perm_idx = lens.sort(0, descending=True)
encodes = encodes[perm_idx]
_ , unperm_idx = perm_idx.sort(0, descending=False)
encodes = nn.utils.rnn.pack_padded_sequence(encodes, lens, True)
# gru
gru_out, _ = self.bigru(encodes, self.hidden)
gru_out, _ = nn.utils.rnn.pad_packed_sequence(gru_out, batch_first=True, padding_value=-1e9)
gru_out = gru_out[unperm_idx]
# print(gru_out.shape)
gru_out = torch.transpose(gru_out, 1, 2)
# pooling
gru_out = F.max_pool1d(gru_out, gru_out.size(2)).squeeze(2)
# gru_out = gru_out[:,-1]
# linear
y = self.hidden2label(gru_out)
return y
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.