manishshettym / codescholar Goto Github PK

View Code? Open in Web Editor NEW

10.0 0.0 0.0 43.42 MB

codescholar: growing programs graphs idiomatically for API usage examples

Python 97.54% Shell 2.46%

graphs code-search idioms neural-guided-search

codescholar's People

Contributors

Stargazers

codescholar's Issues

Add an ancestors field to the AST object instead of a separate datastructure

Instead of adding the ancestors to the walk, might be worth considering adding to the AST object in one of two ways:

parent field on every node that points to its parent node (and then can be navigated to root in a linked list fashion)
ancestors field on every node that has an ordered list of ancestors (which can then be navigated by reading through the list)

Whichever one is more efficient is a design decision you're probably better equipped to make. A NodeTransformer from the ast library would be able to achieve this.

Originally posted by @somaniarushi in #5 (comment)

Idiomatic Code + Codex

This experiment will be perform the following tasks:

Pick an idiomatic code examples dataset
Identify programming exercises that require a set of idiomatic examples S = {P1, P2, .. Pk} where Pi is an idiomatic code snippet that performs a specific task.
Next, the exercise will be presented to Codex via the Codex API in 2 forms:
a. w/o set S in the prompt
b. w/ set S but w/o P_i descriptions in the prompt
c. w/ set S and P_i descriptions in the prompt
The precision of the program generated by Codex will be analyzed.

CodeScholar Github Miner

Implement a github miner that pulls projects with following specifications:

Source language: python 3
> 10 GitHub stars
1. > 10 GitHub commits
Must have a licence
version: latest
Exclude forks
Size < 70708 bytes
isClientOf(library L) -- check if the project uses a specific library; e.g., tensorflow, pytorch, numpy, etc.

Pass max value for count (gamma) to stop subgraph matching beyond threshold

Might be worth passing in the max value for count so that we can stop the counter at that — for very large datasets, iterating through the lookup every time might not be worth it

Originally posted by @somaniarushi in #5 (comment)

change return value of node summary to an object/dictionary

The return value of node summary could probably be an object/dictionary for increased readability (node_summary[2] is very unobvious

Originally posted by @somaniarushi in #5 (comment)

CI/CD Pipeline for CodeScholar & Logging

The CI/CD pipeline needs to build CodeScholar for the latest macos and ubuntu platforms. The build should install all dependencies and run test cases.

To start of we need to create a generic logger for codescholar to use. Loguru is the package of choice.

Bug: osp.dirname(path) should not be used when checking if a directory exists.

codescholar/codescholar/search/embed_space.py

Line 169 in b3293f4

if not osp.exists(osp.dirname(args.graphs_dir)):

and the lines that follow use osp.dirname to check and create missing directories.

However, osp.dirname returns the parent directory of the given path, not the name of the directory the path is referring to.

Replace with:

  if not osp.exists(args.graph_dir):
      os.makedirs(args.graph_dir)

CodeScholar's Subgraph Representation

CodeScholar will represent code snippets in a graph data-structure that explains the structure and semantics of the method. Ideally this should be some form of a graph [1] that captures data flow, control flow, and/or other semantic information.

Then, each program p will be passed through a GNN to learn program embeddings that are aware of subgraph semantics. The GNN training approach and loss function will be adapted from NeuroMatch [2].

Milestones:

[Week 1] Finalize program representation -- AST vs CST vs CFG vs DFG
[Week 1] Translate program repr to networkX format
[Week 2] Implement NeuroMatch modules
[Week 3] Training and Evaluation on pandas-idiom dataset

Experiments:

<GNN Type> + NeuroMatch
<node features> + GNN + NeuroMatch

References:
[1] https://arxiv.org/abs/2208.07461
[2] https://arxiv.org/abs/2007.03092

CodeScholar Concept Farming: Mining Algorithm

Given a dataset of programs P, using standard program analysis techniques and graph-based induction, CodeScholar will reduce P to a set of idiomatic and reusable code snippets S called CodeScholar APIs. Each snippet in S is representative of a code concept such as “search”, “sort”, “join”, etc. Lastly, CodeScholar will refactor these snippets into a python function that can take an arbitrary number of parameters, and return a python object.

Instead of breaking down programs into smaller and frequent fragments, CodeScholar will tackle the problem by "growing" idiomatic code fragments. It should start at single-node programs (1 stmt) and perform a greedy graph composition and pruning to farm idiomatic code patterns.

Here is a brief pseudocode for the algorithm:

Init search does not care about the quality of the initial seeds

https://github.com/tart-proj/codescholar/blob/4df46919b13be4ca36f8f09b3f0fda087491396d/codescholar/search/init_search.py#L65C1-L66C21

In the lines above, search initialization just picks the "first" max_init_beams examples that have the seed in them.
This can affect the quality of idioms we get. Can we do better?

CodeScholar's Search

CodeScholar will represent the search space of programs in a latent space. Each program is transformed into a set of neighborhoods (radial) anchored at every node in the program graph.

Each neighborhood is then passed through the trained model to extract a node embedding. As a result, each program is transformed into a set of node embeddings S; where |S| = #nodes in the program graph.

Finally, in this latent search space, explore (1) clustering, (2) walks, and (3) program farming + search methods to identify potential idioms.

Milestones:

[Week 1] Embed the search space of programs into a set of node embeddings
[Week 1] Cluster embeddings in the search space. Does this work?
[Week 2] Write a greedy/MCTS based frequent subgraph explorer in the search space. Does this work?
[Week 2+] Use custom program growth algorithm + Neural Subgraph Searcher to extract idioms. Does this work?

Improve the presentation of an idiom

https://github.com/tart-proj/codescholar/blob/9914b8f1c2079ef7a709d6e6d0f6b06132006e8e/codescholar/search/search.py#L42

This function currently takes a bunch of idioms generated and saves the partial programs (partial sasts transformed)
python files. This is not ideal:

the presentation lacks clarity -- a partial prog, esp., for a small-sized idiom, has very less info.
the presentation lacks context -- a partial prog is concise but loses a lot of the context of the problem itself (surrounding code)
the presentation lacks provenance-- a partial prog (presented as an idiom) has separated from the original function; hence the promise of provenance (to a github program) is lost.

manishshettym / codescholar Goto Github PK

codescholar's People

Contributors

Stargazers

codescholar's Issues

Add an ancestors field to the AST object instead of a separate datastructure

Idiomatic Code + Codex

CodeScholar Github Miner

Pass max value for count (gamma) to stop subgraph matching beyond threshold

change return value of node summary to an object/dictionary

CI/CD Pipeline for CodeScholar & Logging

Bug: osp.dirname(path) should not be used when checking if a directory exists.

CodeScholar's Subgraph Representation

CodeScholar Concept Farming: Mining Algorithm

Init search does not care about the quality of the initial seeds

CodeScholar's Search

Improve the presentation of an idiom

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent