Giter Site home page Giter Site logo

pycodesuggest's People

Contributors

avishkar58 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pycodesuggest's Issues

a little help required

pavement.py ..
can you tell me where is this file actually located?? because i was not able to find this file while running the code

Normalisation by "find and replace"

Use the AST to obtain a list of identifier names used in a file - generate a lookup table to replace these names (use a scheme per identifier type). Do a find and replace in the source code.

Normalisation by regenerating code from AST

Replace identifiers in AST and use this to regenerate the code, in effect normalising it in terms of formatting etc.
Use a different scheme for different types of identifiers (eg variable1, variable2 etc for variables, function1, function2 etc for functions)

Change model to work on a per-file basis

Model was previously concatenating all files and then splitting into batches and sequences. Change this to work on a per file basis (reset state at the beginning of each file). This will require a mechanism to handle variable length sequences - see dynamic_rnn

Establish Evaluation Metrics

Standard seems to be top K accuracy - in the test corpus, what percentage of the time does the target token occur in the top K suggestions offered by the model.

The Mean Reciprocal Rank (MRR) score is computed by averaging the reciprocal of the rank of the correct suggestion over all suggestion tasks in the test data. If the correct suggestion is not returned, a score of 0 is assigned. Hence, an MRR score of 0.5 implies that the correct suggestion is on average found at index 2 of the suggestion list.

Fix memory issue when running on full dataset

May be alleviated by reducing the vocabulary size (eg replacing numbers or more aggressive OOV)
Can also look into more efficient softmax and pre-processing data into a binary stream (protobufs)

Investigate feasibility of running RNNG

Spend 2 days determining whether it will be feasible to run RNNG on this code base.
Need to reverse engineer their clusters file
Need to determine whether it's easy/possible to generate the sequence of algorithm operations from an AST which forms part of the training data.

Running pythonLanguagemodel.py

Can you provide a a sample run time command for Train, test and pre-process ? i am having trouble running the program.
this is my progress till now :
data_path=/home/suhag/Desktop/Sick-Beard_normalised
train=True
list_file=train_files.txt
vocab_file=mapping.map
output_file=/home/suhag/Desktop/output_file
seq_length=100
batch_size=100
num_partitions=1
oov_threshold=10
epochs=50
attention=None
attention_variant=None
init_scale=0.1
max_grad_norm=5
num_layers=1
hidden_size=200
keep_prob=0.9
num_samples=100
status_iterations=1000
max_attention=10
learning_rate=1.0
lr_decay=0.9
model_path=None
lambda_type=state
save_path=./out/model
checkpoint_path=./out/checkpoint
events_path=./out/save
data_pattern=all_{-type-}_data.dat
embedding_path=None
embedding_trainable=True

Tensorflow version: 0.12.1
Running train

Vocab size: 34
No partitions found for train data, exiting...
An exception has occurred, use %tb to see the full traceback.

SystemExit

Can't get attribute 'UTC'

When I tried to recreate the corpus, I got the following issue:

Traceback (most recent call last):
File "github-scraper/scraper.py", line 143, in
main(sys.argv[1:])
File "github-scraper/scraper.py", line 130, in main
repos = create_repos(dbFile)
File "github-scraper/scraper.py", line 59, in create_repos
repos = pickle.load(infile)
AttributeError: Can't get attribute 'UTC' on <module 'github3.utils' from '/Users/apple/anaconda3/lib/python3.7/site-packages/github3/utils.py'>

I zoomed the error in:
pickle.load('/data/cloned_repos.dat')
It seems like this error is related with loading the cloned_repos.dat file.
Does anyone have the same issue?
How can I solve it?
Many thanks!

Documentation for reconstructing corpus

Hi,

Upon trying to recreate your dataset, I found both the documentation and code for recreating the corpus from scratch to have some issues, particularly related to the create_repos file.

Firstly, passing --dbFile=data/cloned_repos.dat as described throws an error since this causes scraper.py to look for a local directory "data", which does not exist. I suppose this was meant to be --dbFile=../data/cloned_repos.dat

Having fixed that, the code fails on pickle.load(infile) (line 60) with the message "TypeError: a bytes-like object is required, not 'str'", which seems to refer to the file cloned_repos.dat being opened as a plain-text file when it actually is in some other format, presumably binary data. However, changing line 59 to ... = open(dbFile, 'rb') (which opens the file as bytes) results in a UnicodeDecodeError, perhaps due to it having been pickled under Python 2 and this being incompatible with Python 3.

Please let me know if you can recreate this problem and, if possible, upload a new cloned_repos.dat file which is compatible with Python 3 (provided that is the issue). I would love to work with your dataset :)

Kind regards,

Vincent

Clean up code

Code has become a bit messy (particularly pythonLanguageModel.py. Clean it up!

Packages required for running pythonLanguageModel.py

Can you provide information on where to get astwalker, one of the packages requires?
I couldn't find it on google, it looks like it was included in astroid.utils
but I cannot find it in the latest astroid version.
It would be great if you included other requirements for running the codes.
Thanks~

Preprocessing

  • Remove comments
  • Anonymise variable names
  • Replace number literals with a "" token

How to handle string literals?

Removal of comments can be done with python tokenizer.
Variable names and literals need to be identified with python language services

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.