Comments (8)
No problem @urialon, thanks for the suggestions. I will try and let you know.
from code2seq.
Yes, this sounds correct!
Good luck!
from code2seq.
Hi @Tamal-Mondal ,
When you wrote:
I updated the model config to make it suitable for predicting longer sequences.
Did you also re-train the model after updating the config?
I see that you get about F1=0.50 in the training-logs-2
, so where do you see the empty predictions?
Uri
from code2seq.
Thanks @urialon for the quick reply. Yes, I have started training from scratch after making the config changes. In case of "training-logs-2", I was still getting output like "the|the|the|the". I started getting empty predictions(check training-logs-3) from step 3 i.e. when applied more data cleaning steps.
One more thing is after applying so many constraints related to data cleaning like no punctuation, no numbers, etc. my training dataset size shrank to 1.6k, not sure if the small amount of training data can be the issue(I think the result still should not be this bad).
Regards,
Tamal Mondal
from code2seq.
Hi @urialon, sorry to bother you again. I still haven't understood the problem with my approach and am waiting for your reply. If you can please take a look into it and suggest something to me, it will be a great help.
Thanks & Regards,
Tamal Mondal
from code2seq.
Hey @Tamal-Mondal ,
Sorry, for some reason I replied from my email and it does not appear in this thread.
The small number of examples can definitely be the issue.
You can try to train on the python150k first, and after convergence -- train on the additional 1600 examples.
As an orthogonal idea, in another project, we have recently released a multi-lingual model called PolyCoder: https://arxiv.org/pdf/2202.13169.pdf and code here: https://github.com/VHellendoorn/Code-LMs
PolyCoder us already trained on 12 language such as Java, C and python.
In C, we even managed to get better perplexity than OpenAI's Codex.
You can either use PolyCoder as is, or continue training it ("fine-tune") on your dataset.
So you might want to check it out as well.
Best,
from code2seq.
Hi @urialon,
Here are some updates on this issue.
- I was expecting the issue to be with either dataset size or data pre-processing so to investigate that I used the same pre-processing steps on CodeSearchNet(python) data for the summarization task. Even though it has some 2.2L data points in the training set, after adding constraints like no punctuations, numbers, etc in both AST and doc_string, the total training data point was 11k. This time there were no empty predictions. Following are some samples:
Original: Get|default|session|or|create|one|with|a|given|config , predicted 1st: Get|a
Original: Update|boost|factors|when|local|inhibition|is|used , predicted 1st: Remove|the
Original: Returns|a|description|of|the|dataset , predicted 1st: Returns|a|of|of|of
Original: Returns|the|sdr|for|jth|value|at|column|i , predicted 1st: Returns|the|for|for|for
As you can see those predictions are way too short and this is after convergence(just in 17 epochs). I changed the config for summarization as you suggested in some previous issues. The problem here can still be the dataset size, target summary length, etc. I think(do let me know if you have any other observations). I am attaching the logs.
- I am currently training Code2Seq on Python150k data and will fine-tune that on my own dataset as you suggested. Regarding this my understanding is I need to train Code2Seq with Python150k using standard config, then during fine-tuning, I just need to mention the saved model for the "--load" argument. And this just needs the file like "model_iter2.dict". Do let me know if something I missed.
Thanks & Regards,
Tamal Mondal
from code2seq.
from code2seq.
Related Issues (20)
- Generating embeddings for Python and Java HOT 5
- Help with implementing local service with JavaExtractor HOT 10
- I can not preprocess Python dataset
- Error running prediction on Code2seq released model
- I got Out of Memory Error during Training
- Unable to get embeddings from the trained model for Java
- Extract Path Contexts Only HOT 5
- InvalidArgumentError in sess.run() HOT 3
- Visualize Python AST HOT 2
- Extract java files HOT 2
- Getting "was not completed in time" error when preprocessing dataset HOT 11
- code2seq for Python HOT 3
- Error processing property '_dropout_mask_cache' of <ContextValueCache> HOT 6
- Sampling k paths from AST tree HOT 11
- I am getting TimeError while using code2seq to predict long method HOT 2
- Generating code documentation with code2seq HOT 8
- Tensorflow out-of-bound error while trying to train the Code2Seq model on our own python dataset HOT 6
- Exporting code vectors HOT 6
- Encountered error of preprocess data HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from code2seq.