uclnlp / pycodesuggest Goto Github PK

View Code? Open in Web Editor NEW

156.0 156.0 30.0 7.43 MB

Learning to Auto-Complete using RNN Language Models

License: MIT License

Python 99.86% Shell 0.14%

pycodesuggest's People

Contributors

Stargazers

Watchers

pycodesuggest's Issues

Model testing: Implement evaluation metrics

Top K accuracy on test set
Run and evaluate Qualitative test cases from wiki

Run normalisation on full dataset

a little help required

pavement.py ..
can you tell me where is this file actually located?? because i was not able to find this file while running the code

Normalisation by "find and replace"

Use the AST to obtain a list of identifier names used in a file - generate a lookup table to replace these names (use a scheme per identifier type). Do a find and replace in the source code.

Create a lookup from predictions to tokens

Run model as a service using Tensorflow Serving (https://tensorflow.github.io/serving/serving_basic.html)
Build IntelliJ language service plugin that interacts with the model to obtain code completions/suggestions.
This architecture will be similar to Omnisharp

Normalisation by regenerating code from AST

Replace identifiers in AST and use this to regenerate the code, in effect normalising it in terms of formatting etc.
Use a different scheme for different types of identifiers (eg variable1, variable2 etc for variables, function1, function2 etc for functions)

Investigate why model not working after restoring saved model

The actual saving and restoring of parameters appears to work, but the model perplexity is extremely large using the restored parameters (same as an untrained model).

Change model to work on a per-file basis

Model was previously concatenating all files and then splitting into batches and sequences. Change this to work on a per file basis (reset state at the beginning of each file). This will require a mechanism to handle variable length sequences - see dynamic_rnn

Establish Evaluation Metrics

Standard seems to be top K accuracy - in the test corpus, what percentage of the time does the target token occur in the top K suggestions offered by the model.

The Mean Reciprocal Rank (MRR) score is computed by averaging the reciprocal of the rank of the correct suggestion over all suggestion tasks in the test data. If the correct suggestion is not returned, a score of 0 is assigned. Hence, an MRR score of 0.5 implies that the correct suggestion is on average found at index 2 of the suggestion list.

Get model running on Emerald (GPU)

Model testing: Generating samples

Create a sample generator to see whether the model can produce sensible-looking code

AST for Python2 and Python3

Figure out how to parse both Python2 and Python3 as the corpus contains a mixture of both

Implement Beam search in Generator Hook

Fix memory issue when running on full dataset

May be alleviated by reducing the vocabulary size (eg replacing numbers or more aggressive OOV)
Can also look into more efficient softmax and pre-processing data into a binary stream (protobufs)

Investigate feasibility of running RNNG

Spend 2 days determining whether it will be feasible to run RNNG on this code base.
Need to reverse engineer their clusters file
Need to determine whether it's easy/possible to generate the sequence of algorithm operations from an AST which forms part of the training data.

Running pythonLanguagemodel.py

Can you provide a a sample run time command for Train, test and pre-process ? i am having trouble running the program.
this is my progress till now :
data_path=/home/suhag/Desktop/Sick-Beard_normalised
train=True
list_file=train_files.txt
vocab_file=mapping.map
output_file=/home/suhag/Desktop/output_file
seq_length=100
batch_size=100
num_partitions=1
oov_threshold=10
epochs=50
attention=None
attention_variant=None
init_scale=0.1
max_grad_norm=5
num_layers=1
hidden_size=200
keep_prob=0.9
num_samples=100
status_iterations=1000
max_attention=10
learning_rate=1.0
lr_decay=0.9
model_path=None
lambda_type=state
save_path=./out/model
checkpoint_path=./out/checkpoint
events_path=./out/save
data_pattern=all_{-type-}_data.dat
embedding_path=None
embedding_trainable=True

Tensorflow version: 0.12.1
Running train

Vocab size: 34
No partitions found for train data, exiting...
An exception has occurred, use %tb to see the full traceback.

SystemExit

Can't get attribute 'UTC'

When I tried to recreate the corpus, I got the following issue:

Traceback (most recent call last):
File "github-scraper/scraper.py", line 143, in
main(sys.argv[1:])
File "github-scraper/scraper.py", line 130, in main
repos = create_repos(dbFile)
File "github-scraper/scraper.py", line 59, in create_repos
repos = pickle.load(infile)
AttributeError: Can't get attribute 'UTC' on <module 'github3.utils' from '/Users/apple/anaconda3/lib/python3.7/site-packages/github3/utils.py'>

I zoomed the error in:
pickle.load('/data/cloned_repos.dat')
It seems like this error is related with loading the cloned_repos.dat file.
Does anyone have the same issue?
How can I solve it?
Many thanks!

Save token mapping with model

Currently gets overwritten each job, but need to keep it with the model params

Documentation for reconstructing corpus

Hi,

Upon trying to recreate your dataset, I found both the documentation and code for recreating the corpus from scratch to have some issues, particularly related to the create_repos file.

Firstly, passing --dbFile=data/cloned_repos.dat as described throws an error since this causes scraper.py to look for a local directory "data", which does not exist. I suppose this was meant to be --dbFile=../data/cloned_repos.dat

Having fixed that, the code fails on pickle.load(infile) (line 60) with the message "TypeError: a bytes-like object is required, not 'str'", which seems to refer to the file cloned_repos.dat being opened as a plain-text file when it actually is in some other format, presumably binary data. However, changing line 59 to ... = open(dbFile, 'rb') (which opens the file as bytes) results in a UnicodeDecodeError, perhaps due to it having been pickled under Python 2 and this being incompatible with Python 3.

Please let me know if you can recreate this problem and, if possible, upload a new cloned_repos.dat file which is compatible with Python 3 (provided that is the issue). I would love to work with your dataset :)

Kind regards,

Vincent

Clean up code

Code has become a bit messy (particularly pythonLanguageModel.py. Clean it up!

Handle "out of vocabulary" tokens

Split data into train, validation and test

Split the data and change the code to load the 3 separate datasets. Prerequisite - need to be able to handle OOV tokens.

Flesh out suggestion use-cases

On the wiki

Packages required for running pythonLanguageModel.py

Can you provide information on where to get astwalker, one of the packages requires?
I couldn't find it on google, it looks like it was included in astroid.utils
but I cannot find it in the latest astroid version.
It would be great if you included other requirements for running the codes.
Thanks~

Preprocessing

Remove comments
Anonymise variable names
Replace number literals with a "" token

How to handle string literals?

Removal of comments can be done with python tokenizer.
Variable names and literals need to be identified with python language services

uclnlp / pycodesuggest Goto Github PK

pycodesuggest's People

Contributors

Stargazers

Watchers

Forkers

pycodesuggest's Issues

Tensorflow version: 0.12.1 Running train

Recommend Projects

Recommend Topics

Recommend Org

Tensorflow version: 0.12.1
Running train