Giter Site home page Giter Site logo

Invalid Argument Error? about code2seq HOT 7 CLOSED

tech-srl avatar tech-srl commented on June 6, 2024
Invalid Argument Error?

from code2seq.

Comments (7)

urialon avatar urialon commented on June 6, 2024 1

My understanding of the extraction step is that I specify the target as say, the method name or a caption and in the list of contexts, I can specify any type of component suitable for my problem

That's right! There are several things to notice:

  1. The words in target should be split by |, i.e.: print|bmp|to|file
  2. The 3-tuple type_of_statement|token_1|token_2 should be split by comma (,) rather than |, and each of them internally should be split by |.
  3. The network reads the 1st and 3rd fields as a set of subtokens, and the 2nd field as a sequence (using an LSTM). So I would suggest switch the order and make type_of_statement to be the middle field, and set config.MAX_PATH_LENGTH = 1. So finally it will look like:
    print|bmp|to|file subtoken1|subtoken2|subtoken3,type_of_statement,subtoken4|subtoken5|subtoken6

Where subtoken1|subtoken2|subtoken3 are the components of token_1 in your example,
and subtoken4|subtoken5|subtoken6 are the components of token_2 from your example.
Since type_of_statement is a single value (rather than a sequence of symbols you can set config.MAX_PATH_LENGTH = 1 and training will be faster because the LSTM will not be used.

from code2seq.

urialon avatar urialon commented on June 6, 2024 1

basically yes, see also Section 2 of the code2vec paper, where it is explained more thoroughly:
https://arxiv.org/abs/1803.09473

from code2seq.

PankajB1997 avatar PankajB1997 commented on June 6, 2024

On a related note, could you please explain the role of config.MAX_PATH_LENGTH in a bit more detail? I am not familiar with the model, so still trying to figure out this error, which seems to be related to this constant.

from code2seq.

urialon avatar urialon commented on June 6, 2024

Hi Pankaj,
Did you run the model on a dataset that you preprocessed yourself, i.e., not our preprocessed dataset? Did you preprocess your dataset with a non-default max_path_length value? Or did you decrease the default value in config.MAX_PATH_LENGTH?
In general, config.MAX_PATH_LENGTH in the model should be greater by 1 than the max_path_length value of the preprocessing. This is indeed confusing.

config.MAX_PATH_LENGTH is the number of nodes in each "path".
For legacy reasons, in the JavaExtractor, the max_path_length is the number of edges and is set to 8 by default. This is the reasons that the default value for config.MAX_PATH_LENGTH is: 8+1.

from code2seq.

PankajB1997 avatar PankajB1997 commented on June 6, 2024

Hello, thank you for the response!

Yes, I'm using another dataset for which I wrote another extractor, and then I ran preprocess.sh on just the extracted result (i.e. my self created train.raw.txt, val.raw.txt, test.raw.txt). I guess my mistake is that I did not take into account the max_path_length property in my extraction code.

My understanding of the extraction step is that I specify the target as say, the method name or a caption and in the list of contexts, I can specify any type of component suitable for my problem. My extracted rows deal with code lines individually and are of the form target type_of_statement|token_1|token_2 ..., where type_of_statement is chosen from a set of 25 possible values indicating the type of code statement and tokens are similar to your example.

So just to clarify, how would you suggest me to account for max_path_length in my extraction code?

from code2seq.

PankajB1997 avatar PankajB1997 commented on June 6, 2024

Thank you for your help, this clarified a lot!! :)

from code2seq.

PankajB1997 avatar PankajB1997 commented on June 6, 2024

Btw, wanted to seek understanding on your usage of Abstract Syntax Tree in extraction step. Quoting from the paper:

Given the AST of a code snippet, we consider all pairwise paths between terminals, and represent them as sequences of terminal and nonterminal nodes. We then use these paths with their terminals’ values to represent the code snippet itself.

Does this mean that given the AST, you are extracting all possible terminal-to-terminal paths from the tree and extracting contexts in the form terminal node token, path of intermediate non-terminal nodes, terminal node token?

from code2seq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.