tech-srl / code2seq Goto Github PK

View Code? Open in Web Editor NEW

542.0 15.0 162.0 4.37 MB

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"

Home Page: http://code2seq.org

License: MIT License

Java 11.49% Python 73.83% Shell 3.98% C# 10.69%

code2seq generating sequences from structured representations of code iclr2019

code2seq's Introduction

code2seq

This is an official implementation of the model described in:

Uri Alon, Shaked Brody, Omer Levy and Eran Yahav, "code2seq: Generating Sequences from Structured Representations of Code" [PDF]

Appeared in ICLR'2019 (poster available here)

An online demo is available at https://code2seq.org.

This is a TensorFlow implementation of the network, with Java and C# extractors for preprocessing the input code. It can be easily extended to other languages, since the TensorFlow network is agnostic to the input programming language (see Extending to other languages. Contributions are welcome.

Requirements

python3
TensorFlow 1.12 (install). To check TensorFlow version:

python3 -c 'import tensorflow as tf; print(tf.__version__)'

For a TensorFlow 2.1 implementation by @Kolkir, see: https://github.com/Kolkir/code2seq

For creating a new Java dataset or manually examining a trained model (any operation that requires parsing of a new code example): JDK
For creating a C# dataset: dotnet-core version 2.2 or newer.
pip install rouge for computing rouge scores.

Quickstart

Step 0: Cloning this repository

git clone https://github.com/tech-srl/code2seq
cd code2seq

Step 1: Creating a new dataset from Java sources

To obtain a preprocessed dataset to train a network on, you can either download our preprocessed dataset, or create a new dataset from Java source files.

Download our preprocessed dataset Java-large dataset (~16M examples, compressed: 11G, extracted 125GB)

mkdir data
cd data
wget https://s3.amazonaws.com/code2seq/datasets/java-large-preprocessed.tar.gz
tar -xvzf java-large-preprocessed.tar.gz

This will create a data/java-large/ sub-directory, containing the files that hold training, test and validation sets, and a dict file for various dataset properties.

Creating and preprocessing a new Java dataset

To create and preprocess a new dataset (for example, to compare code2seq to another model on another dataset):

Edit the file preprocess.sh using the instructions there, pointing it to the correct training, validation and test directories.
Run the preprocess.sh file:

bash preprocess.sh

Step 2: Training a model

You can either download an already trained model, or train a new model using a preprocessed dataset.

Downloading a trained model (137 MB)

We already trained a model for 52 epochs on the data that was preprocessed in the previous step. This model is the same model that was used in the paper and the same model that serves the demo at code2seq.org.

wget https://s3.amazonaws.com/code2seq/model/java-large/java-large-model.tar.gz
tar -xvzf java-large-model.tar.gz

Note:

This trained model is in a "released" state, which means that we stripped it from its training parameters.

Training a model from scratch

To train a model from scratch:

Edit the file train.sh to point it to the right preprocessed data. By default, it points to our "java-large" dataset that was preprocessed in the previous step.
Before training, you can edit the configuration hyper-parameters in the file config.py, as explained in Configuration.
Run the train.sh script:

bash train.sh

Step 3: Evaluating a trained model

After config.PATIENCE iterations of no improvement on the validation set, training stops by itself.

Suppose that iteration #52 is our chosen model, run:

python3 code2seq.py --load models/java-large-model/model_iter52.release --test data/java-large/java-large.test.c2s

While evaluating, a file named "log.txt" is written to the same dir as the saved models, with each test example name and the model's prediction.

Step 4: Manual examination of a trained model

To manually examine a trained model, run:

python3 code2seq.py --load models/java-large-model/model_iter52.release --predict

After the model loads, follow the instructions and edit the file Input.java and enter a Java method or code snippet, and examine the model's predictions and attention scores.

Note:

Due to TensorFlow's limitations, if using beam search (config.BEAM_WIDTH > 0), then BEAM_WIDTH hypotheses will be printed, but without attention weights. If not using beam search (config.BEAM_WIDTH == 0), then a single hypothesis will be printed with the attention weights in every decoding timestep.

Configuration

Changing hyper-parameters is possible by editing the file config.py.

Here are some of the parameters and their description:

config.NUM_EPOCHS = 3000

The max number of epochs to train the model.

config.SAVE_EVERY_EPOCHS = 1

The frequency, in epochs, of saving a model and evaluating on the validation set during training.

config.PATIENCE = 10

Controlling early stopping: how many epochs of no improvement should training continue before stopping.

config.BATCH_SIZE = 512

Batch size during training.

config.TEST_BATCH_SIZE = 256

Batch size during evaluation. Affects only the evaluation speed and memory consumption, does not affect the results.

config.SHUFFLE_BUFFER_SIZE = 10000

The buffer size that the reader uses for shuffling the training data. Controls the randomness of the data. Increasing this value might hurt training throughput.

config.CSV_BUFFER_SIZE = 100 * 1024 * 1024

The buffer size (in bytes) of the CSV dataset reader.

config.MAX_CONTEXTS = 200

The number of contexts to sample in each example during training (resampling a different subset of this size every training iteration).

config.SUBTOKENS_VOCAB_MAX_SIZE = 190000

The max size of the subtoken vocabulary.

config.TARGET_VOCAB_MAX_SIZE = 27000

The max size of the target words vocabulary.

config.EMBEDDINGS_SIZE = 128

Embedding size for subtokens, AST nodes and target symbols.

config.RNN_SIZE = 128 * 2

The total size of the two LSTMs that are used to embed the paths if config.BIRNN is True, or the size of the single LSTM if config.BIRNN is False.

config.DECODER_SIZE = 320

Size of each LSTM layer in the decoder.

config.NUM_DECODER_LAYERS = 1

Number of decoder LSTM layers. Can be increased to support long target sequences.

config.MAX_PATH_LENGTH = 8 + 1

The max number of nodes in a path

config.MAX_NAME_PARTS = 5

The max number of subtokens in an input token. If the token is longer, only the first subtokens will be read.

config.MAX_TARGET_PARTS = 6

The max number of symbols in the target sequence. Set to 6 by default for method names, but can be increased for learning datasets with longer sequences.

config.BIRNN = True

If True, use a bidirectional LSTM to encode each path. If False, use a unidirectional LSTM only.

config.RANDOM_CONTEXTS = True

When True, sample MAX_CONTEXT from every example every training iteration. When False, take the first MAX_CONTEXTS only.

config.BEAM_WIDTH = 0

Beam width in beam search. Inactive when 0.

config.USE_MOMENTUM = True

If True, use Momentum optimizer with nesterov. If False, use Adam (Adam converges in fewer epochs; Momentum leads to slightly better results).

Releasing a trained model

If you wish to keep a trained model for inference only (without the ability to continue training it) you can release the model using:

python3 code2seq.py --load models/java-large-model/model_iter52 --release

This will save a copy of the trained model with the '.release' suffix. A "released" model usually takes ~3x less disk space.

Extending to other languages

This project currently supports Java and C# as the input languages.

March 2020 - a code2seq extractor for C++ based on LLVM was developed by @Kolkir and is available here: https://github.com/Kolkir/cppminer.

January 2020 - a code2seq extractor for Python (specifically targeting the Python150k dataset) was contributed by @stasbel. See: https://github.com/tech-srl/code2seq/tree/master/Python150kExtractor.

January 2020 - an extractor for predicting TypeScript type annotations for JavaScript input using code2vec was developed by @izosak and Noa Cohen, and is available here: https://github.com/tech-srl/id2vec.

June 2019 - an extractor for C that is compatible with our model was developed by CMU SEI team. - removed by CMU SEI team.

June 2019 - a code2vec extractor for Python, Java, C, C++ by JetBrains Research is available here: PathMiner.

To extend code2seq to other languages other than Java and C#, a new extractor (similar to the JavaExtractor) should be implemented, and be called by preprocess.sh. Basically, an extractor should be able to output for each directory containing source files:

A single text file, where each row is an example.
Each example is a space-delimited list of fields, where:

The first field is the target label, internally delimited by the "|" character (for example: compare|ignore|case)
Each of the following field are contexts, where each context has three components separated by commas (","). None of these components can include spaces nor commas.

We refer to these three components as a token, a path, and another token, but in general other types of ternary contexts can be considered.

Each "token" component is a token in the code, split to subtokens using the "|" character.

Each path is a path between two tokens, split to path nodes (or other kinds of building blocks) using the "|" character. Example for a context:

my|key,StringExression|MethodCall|Name,get|value

Here my|key and get|value are tokens, and StringExression|MethodCall|Name is the syntactic path that connects them.

Datasets

Java

To download the Java-small, Java-med and Java-large datasets used in the Code Summarization task as raw *.java files, use:

To download the preprocessed datasets, use:

C#

The C# dataset used in the Code Captioning task can be downloaded from the CodeNN repository.

Baselines

Using the trained model

For the NMT baselines (BiLSTM, Transformer) we used the implementation of OpenNMT-py. The trained BiLSTM model is available here: https://code2seq.s3.amazonaws.com/lstm_baseline/model_acc_62.88_ppl_12.03_e16.pt

Test+validation sources and targets:

https://code2seq.s3.amazonaws.com/lstm_baseline/test_expected_actual.txt
https://code2seq.s3.amazonaws.com/lstm_baseline/test_source.txt
https://code2seq.s3.amazonaws.com/lstm_baseline/test_target.txt
https://code2seq.s3.amazonaws.com/lstm_baseline/val_source.txt
https://code2seq.s3.amazonaws.com/lstm_baseline/val_target.txt

The command line for "translating" a "source" file to a "target" is: python3 translate.py -model model_acc_62.88_ppl_12.03_e16.pt -src test_source.txt -output translation_epoch16.txt -gpu 0

This results in a translation_epoch16.txt which we compare to test_target.txt to compute the score. The file test_expected_actual.txt is a line-by-line concatenation of the true reference ("expected") with the corresponding prediction (the "actual").

Creating data for the baseline

We first modified the JavaExtractor (the same one as in this) to locate the methods to train on and print them to a file where each method is a single line. This modification is currently not checked in, but instead of extracting paths, it just prints node.toString() and replaces "\n" with space, where node is the object holding the AST node of type MethodDeclaration.

Then, we tokenized (including sub-tokenization of identifiers, i.e., "ArrayList"-> ["Array","List"]) each method body using javalang, using this script (which can be run on this input example). So a program of:

void methodName(String fooBar) {
    System.out.println("hello world");
}

should be printed by the modified JavaExtractor as:

method name|void (String fooBar){ System.out.println("hello world");}

and the tokenization script would turn it into:

void ( String foo Bar ) { System . out . println ( " hello world " ) ; }

and the label to be predicted, i.e., "method name", into a separate file.

OpenNMT-py can then be trained over these training source and target files.

Citation

code2seq: Generating Sequences from Structured Representations of Code

@inproceedings{
    alon2018codeseq,
    title={code2seq: Generating Sequences from Structured Representations of Code},
    author={Uri Alon and Shaked Brody and Omer Levy and Eran Yahav},
    booktitle={International Conference on Learning Representations},
    year={2019},
    url={https://openreview.net/forum?id=H1gKYo09tX},
}

code2seq's People

Contributors

Stargazers

Watchers

Forkers

erez-aharonov codeaudit xennygrimmato zhihaolzh bzz valour01 pombredanne mloncode pankajb1997 wxmandrew happyxzw abuhamad eladn claudiosv yorhaz40 johndpope seanstapleton hukuda222 haighal ygambhir marcelluz alipourm zhichengshi reraaan chubbymaggie barghouthi mysteraitch feng-fwt natalymr foreverzyh springri saikat107 onlyforecho zkcpku vvshyer zeovan mleszczy afcarl kambehmw huangdengrong stasbel elise11111111111111 jjhenkel petablox knut0815 thibolu fc-h kolkir junkgear foursmall vegarab praeses yijunwu penguin219 dooinee acp129 hangdj richish katyakos qvery-mm atul04 cplands raihan2108 yagyanshbhatia abhirup-dev moshiii greenmonn zaataylor anurag-swarnim-yadav henry199898 mihirrane deeplearni ioanszilagyi vijayantajain jihyunlee96 tanyinyan landandland bot-init lizhuo-1994 rorybyrne bhavyagera10 shreyasingh asll666 lj2lijia yawwq hlibbabii xing-hu anu0473 myutman iisuslik43 ll3lin shinmyung0 lida17000 jirigesi qiushisun aminabedi rmahdav aishwariyarao217 shaileshj2803 mhagglun

code2seq's Issues

I can't preprocess java-small correctly~

when I preprocessed the dataset java-small, I encountered this problem.
（I just use one thread, because I want to map my preprocessed dataset. Using multi-threads can make the dataset mapping disorder）

b'java.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: "..." "..."\n at line 1, column 54.\n\nWas expecting one of:\n\n "!"\n "("\n "+"\n "++"\n "-"\n "--"\n "boolean"\n "byte"\n "char"\n "double"\n "false"\n "float"\n "int"\n "long"\n "new"\n "null"\n "short"\n "super"\n "this"\n "true"\n "void"\n "{"\n ""\n <CHARACTER_LITERAL>\n <FLOATING_POINT_LITERAL>\n \n <INTEGER_LITERAL>\n <LONG_LITERAL>\n <STRING_LITERAL>\n\n\n\tat java.util.concurrent.FutureTask.report(FutureTask.java:122)\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:192)\n\tat JavaExtractor.App.lambda$extractDir$3(App.java:59)\n\tat java.util.ArrayList.forEach(ArrayList.java:1257)\n\tat JavaExtractor.App.extractDir(App.java:57)\n\tat JavaExtractor.App.main(App.java:32)\nCaused by: com.github.javaparser.ParseProblemException: Encountered unexpected token: "..." "..."\n at line 1, column 54.\n\nWas expecting one of:\n\n "!"\n "("\n "+"\n "++"\n "-"\n "--"\n "boolean"\n "byte"\n "char"\n "double"\n "false"\n "float"\n "int"\n "long"\n "new"\n "null"\n "short"\n "super"\n "this"\n "true"\n "void"\n "{"\n ""\n <CHARACTER_LITERAL>\n <FLOATING_POINT_LITERAL>\n \n <INTEGER_LITERAL>\n <LONG_LITERAL>\n <STRING_LITERAL>\n\n\n\tat com.github.javaparser.JavaParser.simplifiedParse(JavaParser.java:242)\n\tat com.github.javaparser.JavaParser.parse(JavaParser.java:210)\n\tat JavaExtractor.FeatureExtractor.parseFileWithRetries(FeatureExtractor.java:73)\n\tat JavaExtractor.FeatureExtractor.extractFeatures(FeatureExtractor.java:45)\n\tat JavaExtractor.ExtractFeaturesTask.extractSingleFile(ExtractFeaturesTask.java:64)\n\tat JavaExtractor.ExtractFeaturesTask.processFile(ExtractFeaturesTask.java:34)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:27)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:16)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\njava.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: "..." "..."\n at line 7, column 2.\n\nWas expecting one of:\n\n ";"\n "<"\n "@"\n "abstract"\n "boolean"\n "byte"\n "char"\n "class"\n "default"\n "double"\n "enum"\n "final"\n "float"\n "int"\n "interface"\n "long"\n "native"\n "private"\n "protected"\n "public"\n "short"\n "static"\n "strictfp"\n "synchronized"\n "transient"\n "void"\n "volatile"\n "{"\n "}"\n \n\n\n\tat java.util.concurrent.FutureTask.report(FutureTask.java:122)\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:192)\n\tat JavaExtractor.App.lambda$extractDir$3(App.java:59)\n\tat java.util.ArrayList.forEach(ArrayList.java:1257)\n\tat JavaExtractor.App.extractDir(App.java:57)\n\tat JavaExtractor.App.main(App.java:32)\nCaused by: com.github.javaparser.ParseProblemException: Encountered unexpected token: "..." "..."\n at line 7, column 2.\n\nWas expecting one of:\n\n ";"\n "<"\n "@"\n "abstract"\n "boolean"\n "byte"\n "char"\n "class"\n "default"\n "double"\n "enum"\n "final"\n "float"\n "int"\n "interface"\n "long"\n "native"\n "private"\n "protected"\n "public"\n "short"\n "static"\n "strictfp"\n "synchronized"\n "transient"\n "void"\n "volatile"\n "{"\n "}"\n \n\n\n\tat com.github.javaparser.JavaParser.simplifiedParse(JavaParser.java:242)\n\tat com.github.javaparser.JavaParser.parse(JavaParser.java:210)\n\tat JavaExtractor.FeatureExtractor.parseFileWithRetries(FeatureExtractor.java:73)\n\tat JavaExtractor.FeatureExtractor.extractFeatures(FeatureExtractor.java:45)\n\tat JavaExtractor.ExtractFeaturesTask.extractSingleFile(ExtractFeaturesTask.java:64)\n\tat JavaExtractor.ExtractFeaturesTask.processFile(ExtractFeaturesTask.java:34)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:27)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:16)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\njava.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: "..." "..."\n at line 53, column 2.\n\nWas expecting one of:\n\n ";"\n "<"\n "@"\n "abstract"\n "boolean"\n "byte"\n "char"\n "class"\n "default"\n "double"\n "enum"\n "final"\n "float"\n "int"\n "interface"\n "long"\n "native"\n "private"\n "protected"\n "public"\n "short"\n "static"\n "strictfp"\n "synchronized"\n "transient"\n "void"\n "volatile"\n "{"\n "}"\n \n\n\n\tat java.util.concurrent.FutureTask.report(FutureTask.java:122)\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:192)\n\tat JavaExtractor.App.lambda$extractDir$3(App.java:59)\n\tat java.util.ArrayList.forEach(ArrayList.java:1257)\n\tat JavaExtractor.App.extractDir(App.java:57)\n\tat JavaExtractor.App.main(App.java:32)\nCaused by: com.github.javaparser.ParseProblemException: Encountered unexpected token: "..." "..."\n at line 53, column 2.\n\nWas expecting one of:\n\n ";"\n "<"\n "@"\n "abstract"\n "boolean"\n "byte"\n "char"\n "class"\n "default"\n "double"\n "enum"\n "final"\n "float"\n "int"\n "interface"\n "long"\n "native"\n "private"\n "protected"\n "public"\n "short"\n "static"\n "strictfp"\n "synchronized"\n "transient"\n "void"\n "volatile"\n "{"\n "}"\n \n\n\n\tat com.github.javaparser.JavaParser.simplifiedParse(JavaParser.java:242)\n\tat com.github.javaparser.JavaParser.parse(JavaParser.java:210)\n\tat JavaExtractor.FeatureExtractor.parseFileWithRetries(FeatureExtractor.java:73)\n\tat JavaExtractor.FeatureExtractor.extractFeatures(FeatureExtractor.java:45)\n\tat JavaExtractor.ExtractFeaturesTask.extractSingleFile(ExtractFeaturesTask.java:64)\n\tat JavaExtractor.ExtractFeaturesTask.processFile(ExtractFeaturesTask.java:34)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:27)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:16)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\njava.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: "..." "..."\n at line 7, column 2.\n\nWas expecting one of:\n\n ";"\n "<"\n "@"\n "abstract"\n "boolean"\n "byte"\n "char"\n "class"\n "default"\n "double"\n "enum"\n "final"\n "float"\n "int"\n "interface"\n "long"\n "native"\n "private"\n "protected"\n "public"\n "short"\n "static"\n "strictfp"\n

How to make decoder step?

Hello @urialon!
Thanks for your works on code2seq, it's amazing!

I have a question about integrating attention into the decoder. As far as I understand after reading the paper, decoder step can be described by this algorithm:

Embedding lookup for a token from the previous step;
Pass this embedded token into LSTM with previous hidden and memory states (result is h_t);
Calculate the attention vector for all paths with respect to h_t;
Apply attention to paths -- c_t;
Concatenate c_t with h_t
Use 2-layer MLP with the tanh activation to get a projection on vocabulary.

Is this correct or I miss something?

And after this, I print params from your model, and for the decoder, I got this:

model/memory_layer/kernel:0 | [320, 320]
model/decoder/attention_wrapper/multi_rnn_cell/cell_0/lstm_cell/kernel:0 | [768, 1280]
model/decoder/attention_wrapper/multi_rnn_cell/cell_0/lstm_cell/bias:0 | [1280]
model/decoder/attention_wrapper/attention_layer/kernel:0 | [640, 320]
model/decoder/dense/kernel:0 | [320, 11319]

Therefore, LSTM cell take input with shape 320 + 128 = 448 (another 320 for hidden states). And I don't understand, how it happens, because if we pass only embedded vector than input should be 128.

So, do you have any ideas where am I wrong? What is the proper way to make decoder step?

MAX_CONTEXTS tied to preprocessing?

Hi @urialon ,

I am trying to do some hyper-parameter tuning, but it seems it is a bit trickier than I thought.

In order to change MAX_CONTEXTS in config.py, do I have to preprocess the data with the same MAX_CONTEXTS value as well?
Does this also apply for WORD_VOCAB_SIZE and PATH_VOCAB_SIZE since I see correspondence in preprocessing and running:
WORD_VOCAB_SIZE == config.MAX_TOKEN_VOCAB_SIZE
PATH_VOCAB_SIZE == config.MAX_PATH_VOCAB_SIZE

model performance on long word sequences

Hi,
I have tested your methods on translating long sequences based ast paths, but I found the performance is much lower than expected, may I ask you some suggestions about this problem or do you have tested on translating long sequences.

Code Captioning Task

Hello!

May I ask you about applying your approach to the code captioning task?

You wrote in the article that used CodeNN's dataset and achieved ~23% bleu score.
Am I right that the part of dataset for training contains 52812 examples of code snippets?
How many epochs had passed before such accuracy was obtained?

Trainable Model? (Similar to Code2Vec)

Hey,

I am a Master student at the HTWG Konstanz University of Applied Sciences in Computer Science.

Currently, I am working on my master thesis, where I am investigating various machine learning methods that deal with the naming of variables and methods.

My idea is to train a model even further, for example to adapt it for a certain domain, which uses other technical terms for the method names.

I saw that you provided a trainable model for code2vec.
Would it be possible for you to provide something like this for code2seq?

Thanks,
Marcel

Error when running the program

Hi,
When I use command "bash train.sh", errors occurred.
I though it was because of the length of but I found they were less than 10.
I cannot find the reason. Can you tell me any possible reason causing this bug please?

Thx!

2020-01-08 23:11:19.264751: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-01-08 23:11:20.352762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:09:00.0
totalMemory: 23.62GiB freeMemory: 14.38GiB
2020-01-08 23:11:20.352827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2020-01-08 23:11:21.006092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-08 23:11:21.006192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2020-01-08 23:11:21.006203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2020-01-08 23:11:21.007392: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13914 MB memory) -> physical GPU (device: 0, name: TITAN RTX, pci bus id:
0000:09:00.0, compute capability: 7.5)
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/sparse_ops.py:1165: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead.
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[{{node IteratorGetNext}} = IteratorGetNextoutput_shapes=[[?,160,11], [?,160], [?,160,5], [?,160], [?,160,1], [?,160,1], [?,160,5], [?,160], [?,160,1], [?,?], [?], [?], [?,160]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT
_INT32, DT_STRING, DT_STRING, DT_INT32, DT_INT32, DT_STRING, DT_INT32, DT_INT64, DT_STRING, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[{{node model/dense/Tensordot/GatherV2/_183}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1,
tensor_name="edge_1720_model/dense/Tensordot/GatherV2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
tensor_name="edge_1720_model/dense/Tensordot/GatherV2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]] [66/1655]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/lxr/baseline/code2seq/code2seq/model.py", line 95, in train
_, batch_loss = self.sess.run([optimizer, train_loss])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[node IteratorGetNext (defined at /home/lxr/baseline/code2seq/code2seq/reader.py:192) = IteratorGetNextoutput_shapes=[[?,160,11], [?,160], [?,160,5], [?,160], [?,160,1], [?,160,1], [?,160,5], [?,160], [?,160,1], [?,?], [?],
[?], [?,160]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_STRING, DT_STRING, DT_INT32, DT_INT32, DT_STRING, DT_INT32, DT_INT64, DT_STRING, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[{{node model/dense/Tensordot/GatherV2/_183}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1,
tensor_name="edge_1720_model/dense/Tensordot/GatherV2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Caused by op 'IteratorGetNext', defined at:
File "code2seq.py", line 36, in
model.train()
File "/home/lxr/baseline/code2seq/code2seq/model.py", line 76, in train
config=self.config)
File "/home/lxr/baseline/code2seq/code2seq/reader.py", line 43, in init
self.output_tensors = self.compute_output()
File "/home/lxr/baseline/code2seq/code2seq/reader.py", line 192, in compute_output
return self.iterator.get_next()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 421, in get_next
name=name)), self._output_types,
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2069, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

OutOfRangeError (see above for traceback): End of sequence
[[node IteratorGetNext (defined at /home/lxr/baseline/code2seq/code2seq/reader.py:192) = IteratorGetNextoutput_shapes=[[?,160,11], [?,160], [?,160,5], [?,160], [?,160,1], [?,160,1], [?,160,5], [?,160], [?,160,1], [?,?], [?],
[?], [?,160]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_STRING, DT_STRING, DT_INT32, DT_INT32, DT_STRING, DT_INT32, DT_INT64, DT_STRING, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[{{node model/dense/Tensordot/GatherV2/_183}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1,
tensor_name="edge_1720_model/dense/Tensordot/GatherV2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expect 1001 fields but have more in record
[[{{node IteratorGetNext_1}} = IteratorGetNextoutput_shapes=[[?,160,11], [?,160], [?,160,5], [?,160], [?,160,1], [?,160,1], [?,160,5], [?,160], [?,160,1], [?,?], [?], [?], [?,160]], output_types=[DT_INT32, DT_INT32, DT_INT32,
DT_INT32, DT_STRING, DT_STRING, DT_INT32, DT_INT32, DT_STRING, DT_INT32, DT_INT64, DT_STRING, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[{{node model_1/LuongAttention/memory_layer/Tensordot/GatherV2_1/_389}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", s
end_device_incarnation=1, tensor_name="edge_487_model_1/LuongAttention/memory_layer/Tensordot/GatherV2_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "code2seq.py", line 36, in
model.train()
File "/home/lxr/baseline/code2seq/code2seq/model.py", line 106, in train
results, precision, recall, f1 = self.evaluate()
File "/home/lxr/baseline/code2seq/code2seq/model.py", line 184, in evaluate
[self.eval_predicted_indices_op, self.eval_true_target_strings_op, self.eval_topk_values],
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expect 1001 fields but have more in record
[[node IteratorGetNext_1 (defined at /home/lxr/baseline/code2seq/code2seq/reader.py:192) = IteratorGetNextoutput_shapes=[[?,160,11], [?,160], [?,160,5], [?,160], [?,160,1], [?,160,1], [?,160,5], [?,160], [?,160,1], [?,?], [?], [?], [?,160]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_STRING, DT_STRING, DT_INT32, DT_INT32, DT_STRING, DT_INT32, DT_INT64, DT_STRING, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[{{node model_1/LuongAttention/memory_layer/Tensordot/GatherV2_1/_389}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_487_model_1/LuongAttention/memory_layer/Tensordot/GatherV2_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Caused by op 'IteratorGetNext_1', defined at:
File "code2seq.py", line 36, in
model.train()
File "/home/lxr/baseline/code2seq/code2seq/model.py", line 106, in train
results, precision, recall, f1 = self.evaluate()
File "/home/lxr/baseline/code2seq/code2seq/model.py", line 148, in evaluate
config=self.config, is_evaluating=True)
File "/home/lxr/baseline/code2seq/code2seq/reader.py", line 43, in init
self.output_tensors = self.compute_output()
File "/home/lxr/baseline/code2seq/code2seq/reader.py", line 192, in compute_output
return self.iterator.get_next()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 421, in get_next
name=name)), self._output_types,
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2069, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Expect 1001 fields but have more in record
[[node IteratorGetNext_1 (defined at /home/lxr/baseline/code2seq/code2seq/reader.py:192) = IteratorGetNextoutput_shapes=[[?,160,11], [?,160], [?,160,5], [?,160], [?,160,1], [?,160,1], [?,160,5], [?,160], [?,160,1], [?,?], [?], [?], [?,160]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_STRING, DT_STRING, DT_INT32, DT_INT32, DT_STRING, DT_INT32, DT_INT64, DT_STRING, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[{{node model_1/LuongAttention/memory_layer/Tensordot/GatherV2_1/_389}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_487_model_1/LuongAttention/memory_layer/Tensordot/GatherV2_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

I can't preprocess java-small

Hi, I had trouble preprocessing java-small dataset.

Extracting paths from validation set...
Finished extracting paths from validation set
Extracting paths from test set...
Finished extracting paths from test set
Extracting paths from training set...
dir: /home/kf/code2seq/java-small/training/liferay-portal was not completed in time
dir: /home/kf/code2seq/java-small/training/cassandra was not completed in time
dir: /home/kf/code2seq/java-small/training/intellij-community was not completed in time
dir: /home/kf/code2seq/java-small/training/presto was not completed in time
dir: /home/kf/code2seq/java-small/training/spring-framework was not completed in time
dir: /home/kf/code2seq/java-small/training/wildfly was not completed in time
dir: /home/kf/code2seq/java-small/training/elasticsearch was not completed in time
dir: /home/kf/code2seq/java-small/training/hibernate-orm was not completed in time
dir: /home/kf/code2seq/java-small/training/gradle was not completed in time
Finished extracting paths from training set
Creating histograms from the training data
subtoken vocab size:  0
node vocab size:  0
target vocab size:  0
File: my_dataset.test.raw.txt
Traceback (most recent call last):
  File "preprocess.py", line 115, in <module>
    max_contexts=int(args.max_contexts), max_data_contexts=int(args.max_data_contexts))
  File "preprocess.py", line 53, in process_file
    print('Average total contexts: ' + str(float(sum_total) / total))
ZeroDivisionError: float division by zero

It happened right after I ran bash preprocess.sh. I will appreciate your kind help.
My English is not good. If I offend you, please forgive me.

Zero division error and Thread Pool Executor timeout

When trying to train a model from scratch I get the following 2 errors consistently across the small dataset. Any suggestions why this must be happening?

Extraction API giving 403 Forbidden

I loaded the pre-trained model and trying the predict in interactive mode the sample function given in Input.java. However, The Extraction API (https://ff655m4ut8.execute-api.us-east-1.amazonaws.com/production/extractmethods) used in the code is giving the 403 Forbidden. Can you please help if I am missing something

Potential bug: when calculating rouge

When the model evaluate, calculating rouge score, I came across the following exceptions:

Traceback (most recent call last): File "code2seq.py", line 41, in <module> model.train() File "model.py", line 101, in train results, precision, recall, f1, rouge = self.evaluate() File "model.py", line 241, in evaluate rouge = files_rouge.get_scores(hyp_path=predicted_file_name, ref_path=ref_file_name, avg=True, ignore_empty=True) File "lib/python3.7/site-packages/rouge/rouge.py", line 47, in get_scores ignore_empty=ignore_empty) File "lib/python3.7/site-packages/rouge/rouge.py", line 98, in get_scores hyps, refs = zip(*hyps_and_refs)

After I checked the pre_file and ref_file, I found that the all the model's outputs are 'PAD' tokens which are space I think. If the pred file was blank lines, the files_rouge.get_scores would throw following exceptions.
Not sure I am correct.

Reproducing numbers from the paper on java-med dataset

Hi,

Thanks for this great work. I'm trying to reproduce the results from the paper for java-med, and I was wondering what values for config.SUBTOKENS_VOCAB_MAX_SIZE and config.TARGET_VOCAB_MAX_SIZE were used? I couldn't find it in the paper or in any existing issue.

Thank you in advance.

Best,
Claudio

Extraction API giving 403 Forbidden

Hi, I loaded the pre-trained model and trying the predict in interactive mode on the sample function given in Input.java. However, the Extraction API (https://ff655m4ut8.execute-api.us-east-1.amazonaws.com/production/extractmethods) used in the code is giving a 403 Forbidden error, like issue #7.

I'd be happy to modify the JavaExtractor to output the required data as in the JSON normally returned by the API. It would be easier to do this if some examples of the response JSON were available.

Thank you!

Could these codes support tensorflow 2.0

I know these codes are based on graphs and sessions and the code2vec project supports tensorflow 2.0, could you give some advice how to make these codes support tensorflow 2.0?
Thank you.

Error while running train.sh on custom dataset

Hi, when I run train.sh on preprocessed custom dataset (Done using preprocess.sh), I encounter the following error. Can you suggest the possible reason for such behavior?

Initalized variables
Started reader...
2020-05-03 22:34:30.310277: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
Average loss at batch 100: 25.005198, throughput: 343 samples/sec
Finished 1 epochs
WARNING:tensorflow:Entity <bound method Reader.process_dataset of <reader.Reader object at 0x7fb71eb9d7b8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: module 'gast' has no attribute 'Num'
Done testing, epoch reached
Traceback (most recent call last):
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
(0) Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[model/gradients/model/embedding_lookup_2_grad/Size/_35]]
(1) Out of range: End of sequence
[[{{node IteratorGetNext}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/content/code2seq/model.py", line 96, in train
_, batch_loss = self.sess.run([optimizer, train_loss])
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
(0) Out of range: End of sequence
[[node IteratorGetNext (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
[[model/gradients/model/embedding_lookup_2_grad/Size/_35]]
(1) Out of range: End of sequence
[[node IteratorGetNext (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'IteratorGetNext':
File "code2seq.py", line 39, in
model.train()
File "/content/code2seq/model.py", line 77, in train
config=self.config)
File "/content/code2seq/reader.py", line 43, in init
self.output_tensors = self.compute_output()
File "/content/code2seq/reader.py", line 192, in compute_output
return self.iterator.get_next()
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/data/ops/iterator_ops.py", line 426, in get_next
name=name)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/gen_dataset_ops.py", line 2518, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "code2seq.py", line 39, in
model.train()
File "/content/code2seq/model.py", line 108, in train
results, precision, recall, f1, rouge = self.evaluate()
File "/content/code2seq/model.py", line 228, in evaluate
files_rouge = FilesRouge(predicted_file_name, ref_file_name)
File "/content/code2seq/rouge/rouge.py", line 13, in init
self.rouge = Rouge(*args, **kwargs)
File "/content/code2seq/rouge/rouge.py", line 72, in init
raise ValueError("Unknown metric '%s'" % m)
ValueError: Unknown metric 'm'

Unexpected TimeoutError

Traceback (most recent call last):
File "code2seq.py", line 40, in
predictor.predict()
File "/Users/wujianwei/Desktop/code2seq/interactive_predict.py", line 35, in predict
predict_lines, pc_info_dict = self.path_extractor.extract_paths(user_input)
File "/Users/wujianwei/Desktop/code2seq/extractor.py", line 26, in extract_paths
raise TimeoutError(response.text)
TimeoutError: {"errorMessage":"2019-10-22T15:45:25.088Z 6b98c48f-5456-4690-9e5f-2b6e84894521 Task timed out after 19.02 seconds"}
-Error message.

Hi, researchers from tech-srl:
I tried to use the default model to predict a long program, but instead of giving me predicted names, the tool gave me the error message above. I'm not sure how to fix it, and I already checked the code. I guess the JSON parser might not be able to parse large programs or I did something wrong?

Best,
Jianwei Wu

Errors when training model from scratch by my preprocessed python data

How can I deal with it?
Does it result from my preprocess python data.

WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/sparse_ops.py:1165: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead.
2019-12-14 19:30:28.721583: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[12] = [3,9] is out of bounds: need 0 <= index < [200,9]
2019-12-14 19:30:28.724205: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[9] = [0,9] is out of bounds: need 0 <= index < [200,9]
2019-12-14 19:30:28.724281: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[11] = [2,9] is out of bounds: need 0 <= index < [200,9]
2019-12-14 19:30:28.730303: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[9] = [0,9] is out of bounds: need 0 <= index < [200,9]
2019-12-14 19:30:28.730911: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[9] = [0,9] is out of bounds: need 0 <= index < [200,9]
2019-12-14 19:30:28.731179: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[15] = [1,9] is out of bounds: need 0 <= index < [200,9]
2019-12-14 19:30:28.731549: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[9] = [0,9] is out of bounds: need 0 <= index < [200,9]
2019-12-14 19:30:29.089994: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[10] = [1,9] is out of bounds: need 0 <= index < [200,9]
2019-12-14 19:30:29.095160: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[15] = [1,9] is out of bounds: need 0 <= index < [200,9]
2019-12-14 19:30:29.099648: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[9] = [0,9] is out of bounds: need 0 <= index < [200,9]
2019-12-14 19:30:33.144993: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[25] = [4,9] is out of bounds: need 0 <= index < [200,9]
2019-12-14 19:30:33.145895: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[75] = [13,9] is out of bounds: need 0 <= index < [200,9]
2019-12-14 19:30:33.345387: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[17] = [2,9] is out of bounds: need 0 <= index < [200,9]
2019-12-14 19:30:33.345840: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[9] = [0,9] is out of bounds: need 0 <= index < [200,9]

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[{{node IteratorGetNext}} = IteratorGetNextoutput_shapes=[[?,200,9], [?,200], [?,200,5], [?,200], [?,200,1], [?,200,1], [?,200,5], [?,200], [?,200,1], [?,?], [?], [?], [?,200]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_
INT32, DT_STRING, DT_STRING, DT_INT32, DT_INT32, DT_STRING, DT_INT32, DT_INT64, DT_STRING, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[{{node model/dense/Tensordot/GatherV2/_183}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1,
tensor_name="edge_1720_model/dense/Tensordot/GatherV2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/lxr/baseline/code2seq/code2seq/model.py", line 95, in train
_, batch_loss = self.sess.run([optimizer, train_loss])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[node IteratorGetNext (defined at /home/lxr/baseline/code2seq/code2seq/reader.py:192) = IteratorGetNextoutput_shapes=[[?,200,9], [?,200], [?,200,5], [?,200], [?,200,1], [?,200,1], [?,200,5], [?,200], [?,200,1], [?,?], [?], [
?], [?,200]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_STRING, DT_STRING, DT_INT32, DT_INT32, DT_STRING, DT_INT32, DT_INT64, DT_STRING, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[{{node model/dense/Tensordot/GatherV2/_183}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1,
tensor_name="edge_1720_model/dense/Tensordot/GatherV2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Caused by op 'IteratorGetNext', defined at:
File "code2seq.py", line 33, in
model.train()
File "/home/lxr/baseline/code2seq/code2seq/model.py", line 76, in train
config=self.config)
File "/home/lxr/baseline/code2seq/code2seq/reader.py", line 43, in init
self.output_tensors = self.compute_output()
File "/home/lxr/baseline/code2seq/code2seq/reader.py", line 192, in compute_output
return self.iterator.get_next()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 421, in get_next
name=name)), self._output_types,
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2069, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

OutOfRangeError (see above for traceback): End of sequence
[[node IteratorGetNext (defined at /home/lxr/baseline/code2seq/code2seq/reader.py:192) = IteratorGetNextoutput_shapes=[[?,200,9], [?,200], [?,200,5], [?,200], [?,200,1], [?,200,1], [?,200,5], [?,200], [?,200,1], [?,?], [?], [
?], [?,200]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_STRING, DT_STRING, DT_INT32, DT_INT32, DT_STRING, DT_INT32, DT_INT64, DT_STRING, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[{{node model/dense/Tensordot/GatherV2/_183}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1,
tensor_name="edge_1720_model/dense/Tensordot/GatherV2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "code2seq.py", line 33, in
model.train()
File "/home/lxr/baseline/code2seq/code2seq/model.py", line 106, in train
results, precision, recall, f1 = self.evaluate()
File "/home/lxr/baseline/code2seq/code2seq/model.py", line 220, in evaluate
output_file.write(str(num_correct_predictions / total_predictions) + '\n')
ZeroDivisionError: division by zero

Questions about input data

Hello,
I want to know more about the structure of input data, hence I run the reader.py with java-small preprocessed data, but there is a bug. Could you give me some advice.
Thank you.

The expection is when we run the "sess.run" in Line 249 in reader.py. Our plateform is Ubuntu 18.04 with GTX 2080, and tensorflow version is 1.15.0;

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expect 5 fields but have more in record
[[{{node IteratorGetNext}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/paullee/.vscode/extensions/ms-python.python-2019.10.44104/pythonFiles/ptvsd_launcher.py", line 43, in
main(ptvsdArgs)
File "/home/paullee/.vscode/extensions/ms-python.python-2019.10.44104/pythonFiles/lib/python/old_ptvsd/ptvsd/main.py", line 432, in main
run()
File "/home/paullee/.vscode/extensions/ms-python.python-2019.10.44104/pythonFiles/lib/python/old_ptvsd/ptvsd/main.py", line 316, in run_file
runpy.run_path(target, run_name='main')
File "/usr/lib/python3.6/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/paullee/ls/code2seq-master/reader.py", line 271, in
target_indices = sess.run(target_index_op)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expect 5 fields but have more in record
[[node IteratorGetNext (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for 'IteratorGetNext':
File "home/paullee/.vscode/extensions/ms-python.python-2019.10.44104/pythonFiles/ptvsd_launcher.py", line 43, in
main(ptvsdArgs)
File "home/paullee/.vscode/extensions/ms-python.python-2019.10.44104/pythonFiles/lib/python/old_ptvsd/ptvsd/main.py", line 432, in main
run()
File "home/paullee/.vscode/extensions/ms-python.python-2019.10.44104/pythonFiles/lib/python/old_ptvsd/ptvsd/main.py", line 316, in run_file
runpy.run_path(target, run_name='main')
File "usr/lib/python3.6/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "usr/lib/python3.6/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "home/paullee/ls/code2seq-master/reader.py", line 222, in
reader = Reader(subtoken_to_index, target_word_to_index, node_to_index, config, False)
File "home/paullee/ls/code2seq-master/reader.py", line 43, in init
self.output_tensors = self.compute_output()
File "home/paullee/ls/code2seq-master/reader.py", line 192, in compute_output
return self.iterator.get_next()
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 426, in get_next
name=name)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_dataset_ops.py", line 2518, in iterator_get_next
output_shapes=output_shapes, name=name)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()
Terminated

Training on Python data leads to nan loss

Hello! I built a python extractor that I used to train code2vec models on python source code. However, when I try to train a code2seq model, the loss ends up being nan thus the model cannot converge.

From the logs:
Average loss at batch 100: nan, throughput: 571 samples/sec

I am preprocessing the data using code2seq's preprocess.sh script with the java extractor swapped out for my python one. Can you provide any insight into what may be going wrong?

Training for Custom Multi Class Classification

Hi, @urialon. Firstly, thank you for the wonderful work! I think the reproducibility is exceptional here, I don't see this very often.

As suggested in Code2Vec issues tech-srl/code2vec#42 and tech-srl/code2vec#26, it is recommended to use Code2Seq model instead of Code2Vec for binary classification. I am planning to do multi-class classification, but I imagine the modifications required should not be too different compared to binary classification. Could I have some general guidelines for this, the questions I have right now are:

What is the best way to approach this? What I have in mind right now is changing:

config.TARGET_VOCAB_MAX_SIZE to the number of target classes I have.
config.MAX_TARGET_PARTS to 1 as every prediction is only supposed to have one class
Change the first field in JavaExtractor to output my desired label
Train the model as specified in the README.md

Finally, I am planning to update JavaParser version in my fork (since I am planning to make some more complex modifications that are easier with newer JavaParser version) and hence the AST paths/hashes will be a little bit different, but still consistent. However, this should not affect the overall performance of the model when I am training from scratch, right?

How can I run ``code documentation'' reported in the paper?

Hi, I want to use the model to conduct a ``Code Documentation'' task, which requires pairs of source code and a brief sentence (instead of subtokens of method name). But it is hard to modify the preprocessing pipeline. So can you help me to achieve this? In my opinion, just replace the method name with the sentence is enough.

Hi, how could I reproduce results for code documentation as described in the paper

Sorry to bother you again. I want to know how could I run code2seq for code documentation. I am trying to test code2seq on my dataset.

Parallelized Training?

Hey!

I'm a master's student at training a model on java-large (I'm just starting by trying to reproduce the results), and, unsurprisingly, it's taking forever on a single GPU. When you trained the model for 52 epochs on java-large, did you distribute training? If not, approximately how long did it take you? And if you did - is there any chance you'd be willing to release your code that does it?

I would really appreciate it! I'm planning to train the model on a Python dataset and would love to be able to train more quickly later on.

Thanks,
Alex

how to process my own data?

As we can see in preprocess.sh:
in line 36
TRAIN_DATA_FILE=${DATASET_NAME}.train.raw.txt
VAL_DATA_FILE=${DATASET_NAME}.val.raw.txt
TEST_DATA_FILE=${DATASET_NAME}.test.raw.txt

Can you tell me the format of train.raw.txt?

And after processing, the dataset is generated in TRAIN_DIR=my_training_dir?

error of preprocess

Extracting paths from validation set...
Finished extracting paths from validation set
Extracting paths from test set...
Finished extracting paths from test set
Extracting paths from training set...
dir: data/train was not completed in time
Finished extracting paths from training set
Creating histograms from the training data
subtoken vocab size: 0
node vocab size: 0
target vocab size: 0
File: 1.test.raw.txt
Traceback (most recent call last):
File "preprocess.py", line 115, in
max_contexts=int(args.max_contexts), max_data_contexts=int(args.max_data_contexts))
File "preprocess.py", line 53, in process_file
print('Average total contexts: ' + str(float(sum_total) / total))
ZeroDivisionError: float division by zero

here is my preprocess.sh:
TRAIN_DIR=data/train
VAL_DIR=data/validation
TEST_DIR=data/tes
DATASET_NAME=1
MAX_DATA_CONTEXTS=1000
MAX_CONTEXTS=200
SUBTOKEN_VOCAB_SIZE=186277
TARGET_VOCAB_SIZE=26347
NUM_THREADS=1
PYTHON=python3.7

Scalability

After training the code does not scale well to other codes. I tried using the model to generate a summary for adding an element to a LinkedList but it failed terribly. Anybody facing the same issues?

Bleu=30.38 not 23.04

Hi, Thanks for making your amazing work easy to reproduce first.

I am reproducing the model and I found the bleu score for java-large-test is 30.38. way better than the paper claimed 23.0
how do I reproduce the 23.04? am I doing something wrong here?

I used the common.sompute_bleu and configured the Perl script:

but I get a better score:

Any hints on this, please?

issue about data preprocess

Hi,
May I ask when I read your codes about reader.py, I doubted the function process_dataset, the sample of the maximum contexts:

safe_limit = tf.cast(tf.maximum(num_contexts_per_example, self.config.MAX_CONTEXTS), tf.int32)
rand_indices = tf.random_shuffle(tf.range(safe_limit))[:self.config.MAX_CONTEXTS]
contexts = tf.gather(all_contexts, rand_indices) # (max_contexts,)

seems will be array out of bounds.

translation on code2seq

Hi,
When I use code2seq model to train a translation task on java dataset, the bleu is not promising, do you have any suggestions on this task?

thanks

Csharp data preprocessing

Hi, I'm trying to use csharp extractor to generate a new preprocessed dataset, but encountered a problem.

I only changed target directories in the original preprocess_csharp.sh.

TRAIN_DIR=csharp-projects/TShock/TShockAPI/CLI
VAL_DIR=csharp-projects/TShock/TShockAPI/CLI
TEST_DIR=csharp-projects/TShock/TShockAPI/CLI
DATASET_NAME=csharp-tshockapi
MAX_DATA_CONTEXTS=1000
MAX_CONTEXTS=200
SUBTOKEN_VOCAB_SIZE=186277
TARGET_VOCAB_SIZE=26347
NUM_THREADS=64
PYTHON=python

However, with any directories containing *.cs files I tried, I always get the not completed in time error.

I use ubuntu 18.04 for the experiment environment, and dotnet sdk (version 3.1) installed. By the way, the preprocessing script for Java works well. Is there anything I have to additionally configure for csharp dataset?

It'll be very appreciated if you provide any suggestion to us :)

Invalid Argument Error?

Hi, thanks for this great work! Can you clarify the below issue? This is the Stack trace for an error I am facing when I run train.sh. Any ideas on why this is happening?

Number of trainable params: 63956672
Initalized variables
Started reader...
2019-03-13 17:19:44.582113: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[305] = [240,9] is out of bounds: need 0 <= index < [300,9]
2019-03-13 17:19:44.585951: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[21] = [12,9] is out of bounds: need 0 <= index < [300,9]
2019-03-13 17:19:44.590843: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[55] = [24,9] is out of bounds: need 0 <= index < [300,9]
2019-03-13 17:19:45.009627: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[125] = [90,9] is out of bounds: need 0 <= index < [300,9]
2019-03-13 17:19:45.012939: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[16] = [7,9] is out of bounds: need 0 <= index < [300,9]
2019-03-13 17:19:45.014349: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[10] = [1,9] is out of bounds: need 0 <= index < [300,9]
2019-03-13 17:19:45.017185: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[296] = [199,9] is out of bounds: need 0 <= index < [300,9]
Traceback (most recent call last):
  File "/home/pankaj/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/pankaj/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/pankaj/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[305] = [240,9] is out of bounds: need 0 <= index < [300,9]
         [[{{node SparseToDense_3}} = SparseToDense[T=DT_STRING, Tindices=DT_INT64, validate_indices=true](StringSplit_3, SparseTensor_7/dense_shape, StringSplit_3:1, NotEqual/y)]]
         [[{{node IteratorGetNext}} = IteratorGetNext[output_shapes=[[?,300,9], [?,300], [?,300,5], [?,300], [?,300,1], [?,300,1], [?,300,5], [?,300], [?,300,1], [?,?], [?], [?], [?,300]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_STRING, DT_STRING, DT_INT32, DT_INT32, DT_STRING, DT_INT32, DT_INT64, DT_STRING, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](IteratorV2)]]
         [[{{node model/dense/Tensordot/GatherV2/_183}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1720_model/dense/Tensordot/GatherV2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "code2seq.py", line 33, in <module>
    model.train()
  File "/home/pankaj/pankaj/repos/code2seq/model.py", line 95, in train
    _, batch_loss = self.sess.run([optimizer, train_loss])
  File "/home/pankaj/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/pankaj/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/pankaj/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/pankaj/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[305] = [240,9] is out of bounds: need 0 <= index < [300,9]
         [[{{node SparseToDense_3}} = SparseToDense[T=DT_STRING, Tindices=DT_INT64, validate_indices=true](StringSplit_3, SparseTensor_7/dense_shape, StringSplit_3:1, NotEqual/y)]]
         [[node IteratorGetNext (defined at /home/pankaj/pankaj/repos/code2seq/reader.py:192)  = IteratorGetNext[output_shapes=[[?,300,9], [?,300], [?,300,5], [?,300], [?,300,1], [?,300,1], [?,300,5], [?,300], [?,300,1], [?,?], [?], [?], [?,300]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_STRING, DT_STRING, DT_INT32, DT_INT32, DT_STRING, DT_INT32, DT_INT64, DT_STRING, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](IteratorV2)]]
         [[{{node model/dense/Tensordot/GatherV2/_183}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1720_model/dense/Tensordot/GatherV2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
2019-03-13 17:19:45.455778: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[10] = [1,9] is out of bounds: need 0 <= index < [300,9]
2019-03-13 17:19:45.456818: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[86] = [30,9] is out of bounds: need 0 <= index < [300,9]
2019-03-13 17:19:45.461185: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[9] = [0,9] is out of bounds: need 0 <= index < [300,9]

Reproducing numbers from the paper on java-small dataset

First of all - thank you for sharing the code of the model and a detailed reproduction instructions!

I tried to reproduce the results from the paper on the java-small dataset using default hyper-parameters from config.py, only changing the batch size to 256 to fit it into the GPU memory, and was able to fetch, preprocess data and train the model.

On validation set, using the best model it got - Precision: 36.24, Recall: 26.89, F1: 30.88
In paper's Table 1 results on java-small are - Precision: 50.64, Recall: 73.40, F1: 43.02

Here is a notebook with all the steps and the output.

Most probably I just have missed something obvious here and would be very grateful if you could help me by pointing out to the right direction in order to reproduce the paper's results.

Thanks in advance!

Question about batch dimension in build_training_graph function

Hi @urialon, I had a quick question about the batch dimension used in the build_training_graph method. I'm new to ML/DL and Tensorflow, but was interested in seeing what research is like, and this seemed like a really cool project. I'm currently annotating the code for the entire project so I can understand how everything fits together.

I understand the concept of batches as used in training, but I'm confused about the batch dimension used in the code here:

# (batch, max_contexts, decoder_size)
batched_contexts = self.compute_contexts(subtoken_vocab=subtoken_vocab, nodes_vocab=nodes_vocab,
                                                     source_input=path_source_indices, nodes_input=node_indices,
                                                     target_input=path_target_indices,
                                                     valid_mask=valid_context_mask,
                                                     path_source_lengths=path_source_lengths,
                                                     path_lengths=path_lengths, path_target_lengths=path_target_lengths)

The reason I'm confused is because the input to this function, input_tensors, represents (based on what I understood) a single processed example from the dataset. So, I don't understand if the shape-related comments you added here represent an implicit batch dimension, meaning that when following the execution of one example during training, I shouldn't really think about that dimension and instead focus on the others OR if this is an explicit dimension. In the latter case, I'm confused as to how this is possible, due to my assumptions about the shape of the input_tensors parameter.

I'm sure you are busy with research, but I was hoping you might be able to explain. I'm sure I must be overlooking something simple.

preprocess error

when i run preprocess.sh, always got error_log.txt message: dir: data/val/libgdx was not completed in time, and my_dataset.val.raw.txt is empty.

I only run one command "${PYTHON} JavaExtractor/extract.py --dir ${VAL_DIR} --max_path_length 8 --max_path_width 2 --num_threads ${NUM_THREADS} --jar ${EXTRACTOR_JAR} > ${VAL_DATA_FILE} 2>> error_log.txt" in preprocess.sh .

There are only over 200 *.java files in data/val folder，I just want to try bash to deal with Java data.

.
and my java verison is

java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b15, mixed mode)

My old computer is configured as follows：

CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.3GHz 2.40 GHz
RAM: 8 GB
OS: win10

i can get some result by run java -cp JavaExtractor/JPredict/target/JavaExtractor-0.0.1-SNAPSHOT.jar JavaExtractor.App --max_path_length 8 --max_path_width 2 --dir JavaExtractor/JPredict/src/main

I'm not sure if it's my configuration error or something else.
thank you very much indeed！

Timeout from EXTRACTION_API for large Input.Java files

Hi,

While running interactive_predict, I get a timeout error (20s) from the extraction API if Input.Java contains multiple functions and happens to be reasonably big. Would you recommend using the JavaExtractor available in code2seq repo for such cases? Also, could you please share details on the json format used by EXTRACTION_API for encoding the output?

Thank you and very much appreciate the work that you are doing!

Output probability

Hi @urialon !

I am trying to use this model for bug prediction. Right now I am testing around with binary classification, before moving on to multi-class classification.

However, it would be nice to get a probability for bug/nobug labels that I have defined in JavaExtractor. I understand, that the model (as most seq2seq models) calculates the conditional probability for output tokens.

...the decoder then generates a sequence of output tokens y = (y1, ..., ym) one token at a
time, hence modeling the conditional probability: p (y1, ..., ym|x1, ..., xn).

Basically, this means that I would like to have the (conditional) probability for the very first output token. Tried to look around in model.py, but could not come up with a way to do it.

How to decode?

Hello!
I want to know your experiment if the decoding phase also generates words one by one?
In your paper, you use AST path to encode. However, there is no detailed introduction in the decoding stage.
Is the word or a certain path generated during decoding, and then go to select the word?

Please replace the uptree and downtree delimiters in the CSharp extractor

code2seq requires the tokens in the path connecting the left and right context to be delimited by '|', but the CSharpExtractor uses '^' and '_' as the uptree and the downtree delimiters, respectively. Please fix this.

Clarifications on Combined Representation

Hi @urialon ,

Sorry for bothering again. I have some quick questions which I tried to make as clear as possible.

Looking at the Code Representation part we see that:

I assume that it is a similar fully connected layer as in Code2Vec model where the dense layer "combines" the terminals and paths to a certain dimension. The question I have.

Assuming that W_in are the weights for the dense layer, why is the dimension 2d_path +
2d_token) × d_hidden? Shouldn't it be something like new dimension d_hidden* x (1d_path + 2d_token)?

*d_hidden being size for the combined representation which is the same as the decoder size - self.config.DECODER_SIZE

I am a bit confused with the notations. Namely, the terminals are marked as v_1 and v_l, but the path to be encoded is also marked as v_1, ..., v_l . Does that mean that path that is to be encoded with LSTM also contains the terminals? This means that the terminals are also in the embedding matrix E^nodes?
Is there any difference between "encoding" and "embedding" here?

Right now I imagine model being something like this (without the attention):
code2seq.pdf

Timeout Error while Preprocessing Java-Small

Hello researcher from tech-srl,
I'm a soon to graduate student, I'm trying to understand how to preprocess data in order to try later code2seq with a new dataset which I would like to build on my own.
I would like to ask you few question:
How could I prevent all the ram of my pc from being saturate during the preprocessing?
I have notified that while the extraction of the training set ( which is actually only a sub folder from the whole java small training folder right now in my case) the memory is totally used and after that the follow error appear.
My pc specs are :

CPU: AMD® Ryzen 5 3600 6-core processor × 12
GPU: NVIDIA GeForce GTX 1660/PCIe/SSE2
RAM: 16 GB
OS: Ubuntu 18.04

Extracting paths from validation set...
Finished extracting paths from validation set
Extracting paths from test set...
Finished extracting paths from test set
Extracting paths from training set...
dir: /home/francesco/Scrivania/Thesis/code2seq/data/noProcessed/java-small/training/elasticsearch was not completed in time
Finished extracting paths from training set
Creating histograms from the training data
subtoken vocab size:  0
node vocab size:  0
target vocab size:  0
File: my_dataset.test.raw.txt
Average total contexts: 220.83893135123765
Average final (after sampling) contexts: 89.03593717130636
Total examples: 57044
Max number of contexts per word: 13272
File: my_dataset.val.raw.txt
Average total contexts: 164.20042778057373
Average final (after sampling) contexts: 70.02675725549405
Total examples: 23844
Max number of contexts per word: 47830
File: my_dataset.train.raw.txt
Traceback (most recent call last):
  File "preprocess.py", line 118, in <module>
    max_contexts=int(args.max_contexts), max_data_contexts=int(args.max_data_contexts))
  File "preprocess.py", line 55, in process_file
    print('Average total contexts: ' + str(float(sum_total) / total))
ZeroDivisionError: float division by zero

I would say thank you in advance for your help.

translate.py

hi, I wanted to try the baselines model, however, I could not find the 'translate.py' to run the model. Could you plz tell me where I can find it? Thank you so much.

Using the trained model
For the NMT baselines (BiLSTM, Transformer) we used the implementation of OpenNMT-py. The trained BiLSTM model is available here: https://code2seq.s3.amazonaws.com/lstm_baseline/model_acc_62.88_ppl_12.03_e16.pt
...

The command line for "translating" a "source" file to a "target" is: python3 translate.py -model model_acc_62.88_ppl_12.03_e16.pt -src test_source.txt -output translation_epoch16.txt -gpu 0

...

error of preprocess train data

code2seq$ head -c 2500 nohup.out

Extracting paths from validation set...
Finished extracting paths from validation set
Extracting paths from test set...
Finished extracting paths from test set
Extracting paths from training set...
b'java.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: "-" "-"\n at line 90, column 21.\n\nWas expecting:\n\n "("\n\n\n\tat java.util.concurrent.FutureTask.report(FutureTask.java:122)\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:192)\n\tat JavaExtractor.App.lambda$extractDir$3(App.java:59)\n\tat java.util.ArrayList.forEach(ArrayList.java:1257)\n\tat JavaExtractor.App.extractDir(App.java:57)\n\tat JavaExtractor.App.main(App.java:32)\nCaused by: com.github.javaparser.ParseProblemException: Encountered unexpected token: "-" "-"\n at line 90, column 21.\n\nWas expecting:\n\n "("\n\n\n\tat com.github.javaparser.JavaParser.simplifiedParse(JavaParser.java:242)\n\tat com.github.javaparser.JavaParser.parse(JavaParser.java:210)\n\tat JavaExtractor.FeatureExtractor.parseFileWithRetries(FeatureExtractor.java:73)\n\tat JavaExtractor.FeatureExtractor.extractFeatures(FeatureExtractor.java:45)\n\tat JavaExtractor.ExtractFeaturesTask.extractSingleFile(ExtractFeaturesTask.java:64)\n\tat JavaExtractor.ExtractFeaturesTask.processFile(ExtractFeaturesTask.java:34)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:27)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:16)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\njava.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Lexical error at line 13, column 24. Encountered: "l" (108), after : "\'p"\n\n\tat java.util.concurrent.FutureTask.report(FutureTask.java:122)\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:192)\n\tat JavaExtractor.App.lambda$extractDir$3(App.java:59)\n\tat java.util.ArrayList.forEach(ArrayList.java:1257)\n\tat JavaExtractor.App.extractDir(App.java:57)\n\tat JavaExtractor.App.main(App.java:32)\nCaused by: com.github.javaparser.ParseProblemException: Lexical error at line 13, column 24. Encountered: "l" (108), after : "\'p"\n\n\tat com.github.javaparser.JavaParser.simplifiedParse(JavaParser.java:242)\n\tat com.github....

and a very long text fulfill with this kind of information.

code2seq$ tail -22 nohup.out
Finished extracting paths from training set
Creating histograms from the training data
subtoken vocab size: 0
node vocab size: 0
target vocab size: 0
File: 1.test.raw.txt
Average total contexts: 874.889587565501
Average final (after sampling) contexts: 97.56637393088936
Total examples: 479955
Max number of contexts per word: 6200535
File: 1.val.raw.txt
Average total contexts: 627.8482423720924
Average final (after sampling) contexts: 92.95679282278428
Total examples: 283564
Max number of contexts per word: 834114
File: 1.train.raw.txt
Traceback (most recent call last):
File "preprocess.py", line 115, in
max_contexts=int(args.max_contexts), max_data_contexts=int(args.max_data_contexts))
File "preprocess.py", line 53, in process_file
print('Average total contexts: ' + str(float(sum_total) / total))
ZeroDivisionError: float division by zero

code2seq$ ll data/1/
total 4604380
drwxrwxr-x 2 malei malei 4096 Jun 2 23:08 ./
drwxrwxrwx 6 malei malei 4096 May 29 23:55 ../
-rw-rw-r-- 1 malei malei 3047633598 Jun 2 23:06 1.test.c2s
-rw-rw-r-- 1 malei malei 0 Jun 2 23:08 1.train.c2s
-rw-rw-r-- 1 malei malei 1667230637 Jun 2 23:08 1.val.c2s

here is my final result of preprocessing data, it works well in processing test and validation data, but for train data, it failed all the time.[sad]

actually the extract might be okey, but for histograms, it failed.

thanks for help!

Questions about tf.errors.OutOfRangeError

Hello, I have tried your codes and now I have a new questions;
I find your code use "tf.errors.OutOfRangeError" to skip the loop in the train and evaluate function, when they will happen? After all the file in the train or test file have been tried once? Or others?
Thank you.

Empty hypothesis when periods are included in dataset

Hello Uri,

I am trying to train the Code2Seq model on the Funcom dataset. I tokenized the dataset by removing all special characters except for periods and commas. When I train a Code2Seq model on this dataset, I get the following error :

Saved after 1 epochs in: models/funcom-test/model_iter1
Finished 1 epochs
Done testing, epoch reached
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
  (0) Out of range: End of sequence
	 [[{{node IteratorGetNext}}]]
  (1) Out of range: End of sequence
	 [[{{node IteratorGetNext}}]]
	 [[IteratorGetNext/_27]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vijayantajain/code/experiments/code2seq/model.py", line 96, in train
    _, batch_loss = self.sess.run([optimizer, train_loss])
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/opt/conda/lib/python3.7/site-packages/tensorflowI have tried this couple of times by changing model configurations, batch-size but still get this error when the comments have periods.
_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
  (0) Out of range: End of sequence
	 [[node IteratorGetNext (defined at /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Out of range: End of sequence
	 [[node IteratorGetNext (defined at /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[IteratorGetNext/_27]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'IteratorGetNext':
  File "code2seq.py", line 39, in <module>
    model.train()
  File "/home/vijayantajain/code/experiments/code2seq/model.py", line 77, in train
    config=self.config)
  File "/home/vijayantajain/code/experiments/code2seq/reader.py", line 43, in __init__
    self.output_tensors = self.compute_output()
  File "/home/vijayantajain/code/experiments/code2seq/reader.py", line 192, in compute_output
    return self.iterator.get_next()
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 426, in get_next
    name=name)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_dataset_ops.py", line 2518, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "code2seq.py", line 39, in <module>
    model.train()
  File "/home/vijayantajain/code/experiments/code2seq/model.py", line 108, in train
    results, precision, recall, f1, rouge = self.evaluate()
  File "/home/vijayantajain/code/experiments/code2seq/model.py", line 230, in evaluate
    hyp_path=predicted_file_name, ref_path=ref_file_name, avg=True, ignore_empty=True)
  File "/opt/conda/lib/python3.7/site-packages/rouge/rouge.py", line 47, in get_scores
    ignore_empty=ignore_empty)
  File "/opt/conda/lib/python3.7/site-packages/rouge/rouge.py", line 105, in get_scores
    return self._get_avg_scores(hyps, refs)
  File "/opt/conda/lib/python3.7/site-packages/rouge/rouge.py", line 145, in _get_avg_scores
    sc = fn(hyp, ref, exclusive=self.exclusive)
  File "/opt/conda/lib/python3.7/site-packages/rouge/rouge.py", line 53, in <lambda>
    "rouge-1": lambda hyp, ref, **k: rouge_score.rouge_n(hyp, ref, 1, **k),
  File "/opt/conda/lib/python3.7/site-packages/rouge/rouge_score.py", line 253, in rouge_n
    raise ValueError("Hypothesis is empty.")
ValueError: Hypothesis is empty.

When I check pred.txt in the models directory I see that some lines are empty which is most likely causing the error.

When I remove all special characters in the Funcom dataset, including periods and comma, and train again I do not get this error.

Any idea on why the model would not predict anything for some examples if there are periods and commas in the dataset?

Thanks!
VJ

Java parser invoke by default...

We already trained model for Python150kExtractor, but it continuously using java parser and throwing this error?
We are using :
predictor=InteractivePredictor(config, model)
predictor.predict()
and got this error...(phew)
**'{"errorMessage":"Encountered u_nexpected token: \"import\" \"import\"\n at line 1, column 20.\n\nWas expecting one of:\n\n \";\"\n \"<\"\n \"@\"\n \"abstract\"\n \"boolean\"\n \"byte\"\n \"char\"\n \"class\"\n \"default\"\n \"double\"\n \"enum\"\n \"final\"\n \"float\"\n \"int\"\n \"interface\"\n \"long\"\n \"native\"\n \"private\"\n \"protected\"\n \"public\"\n \"short\"\n_ \"static\"\n \"strictfp\"\n \"synchronized\"\n \"transient\"\n \"void\"\n \"volatile\"\n \"{\"\n \"}\"\n \n\n","errorType":"com.github.javaparser.ParseProblemException","stackTrace":["com.github.javaparser.JavaParser.simplifiedParse(JavaParser.java:242)","com.github.javaparser.JavaParser.parse(JavaParser.java:210)","MethodPaths.FeatureExtractor.parseFileWithRetries(FeatureExtractor.java:86)","MethodPaths.FeatureExtractor.extractFeatures(FeatureExtractor.java:49)","MethodPaths.ExtractFeaturesTask.extractSingleFile(ExtractFeaturesTask.java:140)","MethodPaths.ExtractionHandler.handleRequest(ExtractionHandler.java:49)","sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)","sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)","sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)","java.lang.reflect.Method.invoke(Method.java:498)"]}'**

Errors when creating new dataset.

Hi, thanks for your code, I used your script to create a new data set. I first want to use a small set of 10 methods to run the model. But, when training it will have the following error. I found out that if I use the dict you provided, this error will not appear. Is it not recommended using very small dataset to run the model?

errors:

Started reader...
2020-05-05 15:52:51.484324: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
Traceback (most recent call last):
  File "/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Expect 201 fields but have more in record
	 [[{{node IteratorGetNext}}]]
	 [[model/gradients/model/embedding_lookup_2_grad/Size/_33]]
  (1) Invalid argument: Expect 201 fields but have more in record
	 [[{{node IteratorGetNext}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "code2seq.py", line 39, in <module>
    model.train()
  File "/code2seq/model.py", line 96, in train
    _, batch_loss = self.sess.run([optimizer, train_loss])
  File "/hlib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "//lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Expect 201 fields but have more in record
	 [[node IteratorGetNext (defined at /home/jiyang/code/code2seq/reader.py:192) ]]
	 [[model/gradients/model/embedding_lookup_2_grad/Size/_33]]
  (1) Invalid argument: Expect 201 fields but have more in record
	 [[node IteratorGetNext (defined at /home/jiyang/code/code2seq/reader.py:192) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node IteratorGetNext:
 IteratorV2 (defined at /code2seq/reader.py:190)

Input Source operations connected to node IteratorGetNext:
 IteratorV2 (defined at /code2seq/reader.py:190)

Original stack trace for 'IteratorGetNext':
  File "code2seq.py", line 39, in <module>
    model.train()
  File code2seq/model.py", line 77, in train
    config=self.config)
  File "//code2seq/reader.py", line 43, in __init__
    self.output_tensors = self.compute_output()
  File "/code2seq/reader.py", line 192, in compute_output
    return self.iterator.get_next()
  File "//lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 426, in get_next
    output_shapes=self._structure._flat_shapes, name=name)
  File "//lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1947, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

target file

@yahave @urialon

Hi Uri,

I need a direction for code2sec. I see preprocess.sh files create its own target label from the training dataset. I am trying to use code2sec for the project . My concern here is I already have a target file. How can I point target variable to read my target file?

If so what all modification do I need to make to preprocess.sh file.

Thanks
Anurag

Question about resources in code2seq

Hello, I have a new question.
How much GPU memory is needed for training the network in code2seq with the default configurations?
I use tensorflow 1.15.0 with GTX 2080(8GB GPU memeory) in Ubuntu18.04, but still have the resource exhausted error.

2019-11-01 11:08:10.487546: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ****************************************************************************************************
2019-11-01 11:08:10.487564: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at matmul_op.cc:480 : Resource exhausted: OOM when allocating tensor with shape[102400,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[102400,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node model/bidirectional_rnn/fw/fw/while/lstm_cell/MatMul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[model/Momentum/update/_224]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[102400,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node model/bidirectional_rnn/fw/fw/while/lstm_cell/MatMul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Input for training code2seq on new dataset

Hi, I am using code2seq to run on a custom JAVA dataset. What is the expected format of input to code2seq? I have my code and summary in txt files right now.(All the Code sequences are arranged in a single text file and same for predictions)

Manual Examination of Pretrained model is generating error.

When I try to run the pre trained model from S3 to manually examine the model following errors is coming:

tech-srl / code2seq Goto Github PK

code2seq's Introduction

code2seq

See also:

Table of Contents

Requirements

Quickstart

Step 0: Cloning this repository

Step 1: Creating a new dataset from Java sources

Download our preprocessed dataset Java-large dataset (~16M examples, compressed: 11G, extracted 125GB)

Creating and preprocessing a new Java dataset

Step 2: Training a model

Downloading a trained model (137 MB)

Note:

Training a model from scratch

Step 3: Evaluating a trained model

Step 4: Manual examination of a trained model

Note:

Configuration

config.NUM_EPOCHS = 3000

config.SAVE_EVERY_EPOCHS = 1

config.PATIENCE = 10

config.BATCH_SIZE = 512

config.TEST_BATCH_SIZE = 256

config.SHUFFLE_BUFFER_SIZE = 10000

config.CSV_BUFFER_SIZE = 100 * 1024 * 1024

config.MAX_CONTEXTS = 200

config.SUBTOKENS_VOCAB_MAX_SIZE = 190000

config.TARGET_VOCAB_MAX_SIZE = 27000

config.EMBEDDINGS_SIZE = 128

config.RNN_SIZE = 128 * 2

config.DECODER_SIZE = 320

config.NUM_DECODER_LAYERS = 1

config.MAX_PATH_LENGTH = 8 + 1

config.MAX_NAME_PARTS = 5

config.MAX_TARGET_PARTS = 6

config.BIRNN = True

config.RANDOM_CONTEXTS = True

config.BEAM_WIDTH = 0

config.USE_MOMENTUM = True

Releasing a trained model

Extending to other languages

Datasets

Java

C#

Baselines

Using the trained model

Creating data for the baseline

Citation

code2seq's People

Contributors

Stargazers

Watchers

Forkers

code2seq's Issues

Recommend Projects

Recommend Topics

Recommend Org