Comments (8)
I added a link to Java-med-preprocessed as well, in the README:
https://github.com/tech-srl/code2seq/blob/master/README.md#datasets
from code2seq.
Hi Alexander,
First, this table includes the final results, so they are reported on the test set.
Second, this number is still a little low. I don't remember such a difference between the test and validation sets. I suspect that something went wrong with the preprocessing step (probably a timeout), such that you got fewer examples to train on.
To investigate this direction, can you please count the number of lines in each of your java-small training, validation and test sets?
Just cd
to the dataset dir and run wc -l *
.
from code2seq.
Thank you for the prompt reply!
Indeed, on colab instance some preprocessing has failed, but as state is not persistent there I did not get that stats (will take some hours to re-run, will post back)
And from the same notebook (where some data failed to be preprocessed) results of running evaluation on test set with the best model is:
Evaluation time: 0h2m39s
Accuracy: 0.01070277466367713
Precision: 0.2878609709141995, recall: 0.17890890275846458, F1: 0.22066929919029954
But I did preprocessing twice and training several times on the local machine, just in case (keep some intermediate .csv data), and in both cases results were the same
wc -l data/java-small/*
1727 data/java-small/java-small.dict.c2s
291 data/java-small/java-small.histo.node.c2s
9361 data/java-small/java-small.histo.ori.c2s
3160 data/java-small/java-small.histo.tgt.c2s
57019 data/java-small/java-small.test.c2s
33771 data/java-small/java-small.train.c2s
23844 data/java-small/java-small.val.c2s
Numbers on the test set from the training logs on the local machine with all the data preprocessed, with increased patience to 20:
Accuracy after 24 epochs: 0.04240
After 24 epochs: Precision: 0.22624, recall: 0.11896, F1: 0.15593
Not improved for 20 epochs, stopping training
Best scores - epoch 4:
Precision: 0.29567, recall: 0.13068, F1: 0.18125
from code2seq.
I see, you preprocessed much fewer examples than there are in the dataset. I designed the scripts to work on a 64-core machine, not on colab, so they timed out and less than 5% of the examples were extracted.
Instead of preprocessing on Colab, take the following preprocessed dataset:
https://s3.amazonaws.com/code2seq/datasets/java-small-preprocessed.tar.gz
Regarding training - the default hyperparameters should be OK. In the paper I used (for Java-small specifically):
config.SUBTOKENS_VOCAB_MAX_SIZE = 7300
and:
config.TARGET_VOCAB_MAX_SIZE = 8700
But I think the default vocab sizes will work very similarly.
from code2seq.
Thank you! I'll try these out over the weekend and report back.
From a quick glance - the numbers I posted are from training on ~1/10th of the data
from code2seq.
Regarding training - the default hyperparameters should be OK. In the paper I used (for Java-small specifically):
Hi Uri,
May I ask what parameters you used for the java-med set? Also 190000 and 27000 respectively?
Thank you!
from code2seq.
Hi @claudiosv,
Sure. Can you please create a new issue and I'll answer there?
I need to check, but I'm on vacation and I'm afraid that I'll lose your question if it stayed in this closed thread.
from code2seq.
@urialon
Thanks Uri, I made a new issue. Enjoy your vacation 😄
from code2seq.
Related Issues (20)
- Generating embeddings for Python and Java HOT 5
- Help with implementing local service with JavaExtractor HOT 10
- I can not preprocess Python dataset
- Error running prediction on Code2seq released model
- I got Out of Memory Error during Training
- Unable to get embeddings from the trained model for Java
- Extract Path Contexts Only HOT 5
- InvalidArgumentError in sess.run() HOT 3
- Visualize Python AST HOT 2
- Extract java files HOT 2
- Getting "was not completed in time" error when preprocessing dataset HOT 11
- code2seq for Python HOT 3
- Error processing property '_dropout_mask_cache' of <ContextValueCache> HOT 6
- Sampling k paths from AST tree HOT 11
- I am getting TimeError while using code2seq to predict long method HOT 2
- Generating code documentation with code2seq HOT 8
- Tensorflow out-of-bound error while trying to train the Code2Seq model on our own python dataset HOT 6
- Model is predicting empty string for custom python dataset HOT 8
- Exporting code vectors HOT 6
- Encountered error of preprocess data HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from code2seq.