codestyle's People
Forkers
jetbrains-researchcodestyle's Issues
Compute programmers coding style representation
Compute programmers coding style for top-10 and top-50 contributors in idea.
Compute statistics and clear extracted data
Compute stats on (extracted data)https://seafile.ifi.uzh.ch/d/808e216262174188b653/). We are interested in distribution of changes across programmers, distribution of paths across changes, some insights on additions/deletions/changes of methods.
Then we should clear the data based on insights we get. It includes removing of low-populated classes from dataset, de-noising etc.
Train a model for classification
Training includes several sub-problems:
- Choose split of dataset
- Choose a metric for evaluation
- Choose hyperparameters
Review of code2vec implementations
Since we want to use code2vec, we need an implementation of it. There are quite many published implementations at the moment and we need a review with pros and cons for each one.
Mine data from other projects
Currently we're only working with the IDEA Community repo.
We need to mine the data from several other projects to be able to validate the results.
Add method packs to the model
- Create new dataset with packs of several methods instead of single methods.
- Change data loading pipeline.
- Update model, give each method in a pack attention weight.
Research Questions
- Get a distributed representation for programmers coding style
a. Stylometry papers
b. To analyze team dynamics based on code - Representation should be interpretable
- Analyze characteristics of the representation:
a. Clusterization, which programmers are considered similar
b. How it changes through time
c. Is it robust - Future work:
a. Code review, other SE practices
b. Make representation context-independent
Clean up the mining code
Currently our mining tool is far from nice: the implementation is obscure, undocumented, and hardcoded for the IntelliJ repo in places.
Required to make #10 and #11 possible:
- Remove hardcoded fragments
Also (better be done by the FSE deadline as we are going to share the code for reproducibility):
- Refactor
- Add comments
- Switch to jgit and get rid of the Python backend
Write the introduction and research questions
Begin writing introduction and formulate research questions.
Implement a standalone AST paths mining tool
With the recent success of the code2vec paper, we can expect a certain demand for a standalone tool capable of mining the AST-path-based representations of code.
It looks feasible to convert a part of our mining pipeline into such a tool and submit a short tool paper to MSR 2019.
Evaluation of the research
A possible approach to evaluation is predicting/projecting information about programmers from the representations but we need to know what to predict. To do so, we can look at PAN competitions and literature domain.
Explore code2seq
Commit analysis
- Analyze several large github projects and their commit histories
- Collect information and statistics on the commits
- Select one project
- Formalize the way to represent commit diffs
AST parsing
- Try out parsing ASTs with Babelfish for C++, Java
- Pay attention to parsing of incomplete code snippets
- Compare to possible alternatives (javaparser, intellij psi trees)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.