manishshettym / codescholar Goto Github PK
View Code? Open in Web Editor NEWcodescholar: growing programs graphs idiomatically for API usage examples
codescholar: growing programs graphs idiomatically for API usage examples
Instead of adding the ancestors to the walk, might be worth considering adding to the AST object in one of two ways:
parent
field on every node that points to its parent node (and then can be navigated to root in a linked list fashion)ancestors
field on every node that has an ordered list of ancestors (which can then be navigated by reading through the list)Whichever one is more efficient is a design decision you're probably better equipped to make. A NodeTransformer
from the ast
library would be able to achieve this.
Originally posted by @somaniarushi in #5 (comment)
This experiment will be perform the following tasks:
S = {P1, P2, .. Pk}
where Pi
is an idiomatic code snippet that performs a specific task.S
in the promptS
but w/o P_i
descriptions in the promptS
and P_i
descriptions in the promptImplement a github miner that pulls projects with following specifications:
Might be worth passing in the max value for count so that we can stop the counter at that — for very large datasets, iterating through the lookup every time might not be worth it
Originally posted by @somaniarushi in #5 (comment)
The return value of node summary could probably be an object/dictionary for increased readability (node_summary[2]
is very unobvious
Originally posted by @somaniarushi in #5 (comment)
The CI/CD pipeline needs to build CodeScholar for the latest macos and ubuntu platforms. The build should install all dependencies and run test cases.
To start of we need to create a generic logger for codescholar to use. Loguru is the package of choice.
codescholar/codescholar/search/embed_space.py
Line 169 in b3293f4
osp.dirname
to check and create missing directories.
However, osp.dirname returns the parent directory of the given path, not the name of the directory the path is referring to.
Replace with:
if not osp.exists(args.graph_dir):
os.makedirs(args.graph_dir)
CodeScholar will represent code snippets in a graph data-structure that explains the structure and semantics of the method. Ideally this should be some form of a graph [1] that captures data flow, control flow, and/or other semantic information.
Then, each program p will be passed through a GNN to learn program embeddings that are aware of subgraph semantics. The GNN training approach and loss function will be adapted from NeuroMatch [2].
Milestones:
pandas-idiom
datasetExperiments:
<GNN Type>
+ NeuroMatch<node features>
+ GNN + NeuroMatchReferences:
[1] https://arxiv.org/abs/2208.07461
[2] https://arxiv.org/abs/2007.03092
Given a dataset of programs P, using standard program analysis techniques and graph-based induction, CodeScholar will reduce P to a set of idiomatic and reusable code snippets S called CodeScholar APIs. Each snippet in S is representative of a code concept such as “search”, “sort”, “join”, etc. Lastly, CodeScholar will refactor these snippets into a python function that can take an arbitrary number of parameters, and return a python object.
Instead of breaking down programs into smaller and frequent fragments, CodeScholar will tackle the problem by "growing" idiomatic code fragments. It should start at single-node programs (1 stmt) and perform a greedy graph composition and pruning to farm idiomatic code patterns.
In the lines above, search initialization just picks the "first" max_init_beams examples that have the seed in them.
This can affect the quality of idioms we get. Can we do better?
CodeScholar will represent the search space of programs in a latent space. Each program is transformed into a set of neighborhoods (radial) anchored at every node in the program graph.
Each neighborhood is then passed through the trained model to extract a node embedding. As a result, each program is transformed into a set of node embeddings S; where |S| = #nodes in the program graph.
Finally, in this latent search space, explore (1) clustering, (2) walks, and (3) program farming + search methods to identify potential idioms.
Milestones:
This function currently takes a bunch of idioms generated and saves the partial programs (partial sasts transformed)
python files. This is not ideal:
clarity
-- a partial prog, esp., for a small-sized idiom, has very less info.context
-- a partial prog is concise but loses a lot of the context of the problem itself (surrounding code)provenance
-- a partial prog (presented as an idiom) has separated from the original function; hence the promise of provenance (to a github program) is lost.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.