Giter Site home page Giter Site logo

codestyle's People

Contributors

egor-bogomolov avatar vovak avatar

Watchers

 avatar  avatar  avatar

codestyle's Issues

Review of code2vec implementations

Since we want to use code2vec, we need an implementation of it. There are quite many published implementations at the moment and we need a review with pros and cons for each one.

Mine data from other projects

Currently we're only working with the IDEA Community repo.

We need to mine the data from several other projects to be able to validate the results.

Add method packs to the model

  1. Create new dataset with packs of several methods instead of single methods.
  2. Change data loading pipeline.
  3. Update model, give each method in a pack attention weight.

Research Questions

  1. Get a distributed representation for programmers coding style
    a. Stylometry papers
    b. To analyze team dynamics based on code
  2. Representation should be interpretable
  3. Analyze characteristics of the representation:
    a. Clusterization, which programmers are considered similar
    b. How it changes through time
    c. Is it robust
  4. Future work:
    a. Code review, other SE practices
    b. Make representation context-independent

Clean up the mining code

Currently our mining tool is far from nice: the implementation is obscure, undocumented, and hardcoded for the IntelliJ repo in places.

Required to make #10 and #11 possible:

  • Remove hardcoded fragments

Also (better be done by the FSE deadline as we are going to share the code for reproducibility):

  • Refactor
  • Add comments
  • Switch to jgit and get rid of the Python backend

Implement a standalone AST paths mining tool

With the recent success of the code2vec paper, we can expect a certain demand for a standalone tool capable of mining the AST-path-based representations of code.

It looks feasible to convert a part of our mining pipeline into such a tool and submit a short tool paper to MSR 2019.

Evaluation of the research

A possible approach to evaluation is predicting/projecting information about programmers from the representations but we need to know what to predict. To do so, we can look at PAN competitions and literature domain.

Explore code2seq

There is another paper by authors of code2vec called code2seq. It is based on the same idea as code2vec but has some differences in network architecture and problem statement.
We need to explore those differences and discuss them at the next talk.

Commit analysis

  • Analyze several large github projects and their commit histories
  • Collect information and statistics on the commits
  • Select one project
  • Formalize the way to represent commit diffs

AST parsing

  • Try out parsing ASTs with Babelfish for C++, Java
    • Pay attention to parsing of incomplete code snippets
  • Compare to possible alternatives (javaparser, intellij psi trees)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.