Giter Site home page Giter Site logo

codescholar's People

Contributors

manishshettym avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

codescholar's Issues

Add an ancestors field to the AST object instead of a separate datastructure

Instead of adding the ancestors to the walk, might be worth considering adding to the AST object in one of two ways:

  1. parent field on every node that points to its parent node (and then can be navigated to root in a linked list fashion)
  2. ancestors field on every node that has an ordered list of ancestors (which can then be navigated by reading through the list)

Whichever one is more efficient is a design decision you're probably better equipped to make. A NodeTransformer from the ast library would be able to achieve this.

Originally posted by @somaniarushi in #5 (comment)

Idiomatic Code + Codex

This experiment will be perform the following tasks:

  1. Pick an idiomatic code examples dataset
  2. Identify programming exercises that require a set of idiomatic examples S = {P1, P2, .. Pk} where Pi is an idiomatic code snippet that performs a specific task.
  3. Next, the exercise will be presented to Codex via the Codex API in 2 forms:
    a. w/o set S in the prompt
    b. w/ set S but w/o P_i descriptions in the prompt
    c. w/ set S and P_i descriptions in the prompt
  4. The precision of the program generated by Codex will be analyzed.

CodeScholar Github Miner

Implement a github miner that pulls projects with following specifications:

  1. Source language: python 3
  2. > 10 GitHub stars
    1. > 10 GitHub commits
  3. Must have a licence
  4. version: latest
  5. Exclude forks
  6. Size < 70708 bytes
  7. isClientOf(library L) -- check if the project uses a specific library; e.g., tensorflow, pytorch, numpy, etc.

CI/CD Pipeline for CodeScholar & Logging

The CI/CD pipeline needs to build CodeScholar for the latest macos and ubuntu platforms. The build should install all dependencies and run test cases.

To start of we need to create a generic logger for codescholar to use. Loguru is the package of choice.

CodeScholar's Subgraph Representation

CodeScholar will represent code snippets in a graph data-structure that explains the structure and semantics of the method. Ideally this should be some form of a graph [1] that captures data flow, control flow, and/or other semantic information.

Then, each program p will be passed through a GNN to learn program embeddings that are aware of subgraph semantics. The GNN training approach and loss function will be adapted from NeuroMatch [2].

Milestones:

  1. [Week 1] Finalize program representation -- AST vs CST vs CFG vs DFG
  2. [Week 1] Translate program repr to networkX format
  3. [Week 2] Implement NeuroMatch modules
  4. [Week 3] Training and Evaluation on pandas-idiom dataset

Experiments:

  1. <GNN Type> + NeuroMatch
  2. <node features> + GNN + NeuroMatch

References:
[1] https://arxiv.org/abs/2208.07461
[2] https://arxiv.org/abs/2007.03092

CodeScholar Concept Farming: Mining Algorithm

Given a dataset of programs P, using standard program analysis techniques and graph-based induction, CodeScholar will reduce P to a set of idiomatic and reusable code snippets S called CodeScholar APIs. Each snippet in S is representative of a code concept such as “search”, “sort”, “join”, etc. Lastly, CodeScholar will refactor these snippets into a python function that can take an arbitrary number of parameters, and return a python object.

Instead of breaking down programs into smaller and frequent fragments, CodeScholar will tackle the problem by "growing" idiomatic code fragments. It should start at single-node programs (1 stmt) and perform a greedy graph composition and pruning to farm idiomatic code patterns.

Here is a brief pseudocode for the algorithm:
Image

CodeScholar's Search

CodeScholar will represent the search space of programs in a latent space. Each program is transformed into a set of neighborhoods (radial) anchored at every node in the program graph.

Each neighborhood is then passed through the trained model to extract a node embedding. As a result, each program is transformed into a set of node embeddings S; where |S| = #nodes in the program graph.

Finally, in this latent search space, explore (1) clustering, (2) walks, and (3) program farming + search methods to identify potential idioms.

Milestones:

  • [Week 1] Embed the search space of programs into a set of node embeddings
  • [Week 1] Cluster embeddings in the search space. Does this work?
  • [Week 2] Write a greedy/MCTS based frequent subgraph explorer in the search space. Does this work?
  • [Week 2+] Use custom program growth algorithm + Neural Subgraph Searcher to extract idioms. Does this work?

Improve the presentation of an idiom

https://github.com/tart-proj/codescholar/blob/9914b8f1c2079ef7a709d6e6d0f6b06132006e8e/codescholar/search/search.py#L42

This function currently takes a bunch of idioms generated and saves the partial programs (partial sasts transformed)
python files. This is not ideal:

  1. the presentation lacks clarity -- a partial prog, esp., for a small-sized idiom, has very less info.
  2. the presentation lacks context -- a partial prog is concise but loses a lot of the context of the problem itself (surrounding code)
  3. the presentation lacks provenance-- a partial prog (presented as an idiom) has separated from the original function; hence the promise of provenance (to a github program) is lost.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.