vovak / codestyle Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 16.92 MB

TeX 9.50% Python 11.46% Java 4.89% Kotlin 3.45% Jupyter Notebook 70.69%

codestyle's People

Contributors

Watchers

Forkers

jetbrains-research

codestyle's Issues

Compute programmers coding style representation

Compute programmers coding style for top-10 and top-50 contributors in idea.

Compute statistics and clear extracted data

Compute stats on (extracted data)https://seafile.ifi.uzh.ch/d/808e216262174188b653/). We are interested in distribution of changes across programmers, distribution of paths across changes, some insights on additions/deletions/changes of methods.
Then we should clear the data based on insights we get. It includes removing of low-populated classes from dataset, de-noising etc.

Train a model for classification

Training includes several sub-problems:

Choose split of dataset
Choose a metric for evaluation
Choose hyperparameters

Review of code2vec implementations

Since we want to use code2vec, we need an implementation of it. There are quite many published implementations at the moment and we need a review with pros and cons for each one.

Mine data from other projects

Currently we're only working with the IDEA Community repo.

We need to mine the data from several other projects to be able to validate the results.

Add method packs to the model

Create new dataset with packs of several methods instead of single methods.
Change data loading pipeline.
Update model, give each method in a pack attention weight.

Research Questions

Get a distributed representation for programmers coding style
a. Stylometry papers
b. To analyze team dynamics based on code
Representation should be interpretable
Analyze characteristics of the representation:
a. Clusterization, which programmers are considered similar
b. How it changes through time
c. Is it robust
Future work:
a. Code review, other SE practices
b. Make representation context-independent

Clean up the mining code

Currently our mining tool is far from nice: the implementation is obscure, undocumented, and hardcoded for the IntelliJ repo in places.

Required to make #10 and #11 possible:

Remove hardcoded fragments

Also (better be done by the FSE deadline as we are going to share the code for reproducibility):

Refactor
Add comments
Switch to jgit and get rid of the Python backend

Analyze several large github projects and their commit histories
Collect information and statistics on the commits
Select one project
Formalize the way to represent commit diffs

AST parsing

Try out parsing ASTs with Babelfish for C++, Java
- Pay attention to parsing of incomplete code snippets
Compare to possible alternatives (javaparser, intellij psi trees)

vovak / codestyle Goto Github PK

codestyle's People

Contributors

Watchers

Forkers

codestyle's Issues

Recommend Projects

Recommend Topics

Recommend Org