microsoft / python-program-analysis Goto Github PK

A Typescript library for parsing Python 3 and doing basic program analysis, like forming control-flow graphs and def-use chains.

License: MIT License

TypeScript 77.35% Yacc 21.61% JavaScript 1.04%

python-program-analysis's Introduction

python-program-analysis

A Typescript library for parsing Python 3 and doing basic program analysis, like forming control-flow graphs and def-use chains.

Parsing

To parse Python 3 code, pass a string containing the code to the parse method.

const code = [
    'x, y = 0, 0',
    'while x < 10:',
    '   y += x * 2',
    '   x += 1',
    'print(y)'
];
const tree = parse(code.join('\n'));

This method returns a tree of SyntaxNode objects, discriminated with a type field. The library also provides a function walk for pre-order tree traversal. With no arguments, it returns a list of the syntax nodes in the tree.

walk(tree).map(node => node.type)
// produces ["module", "assign", "literal", "literal", "name", "name", "while", "binop", …]

Optionally, walk takes a visitor object with methods onEnterNode (for pre-order traversal) and onExitNode (for post-order traversal).

Syntax nodes can be turned back into code with the printNode function, which produces a string. There is no guarantee of round-tripping. That is printNode(parse(code)) could be syntactically different than code, but will be semantically the same. For example, there may be extra parentheses around expressions, when compared with the original code. The printNode function is primarily for debugging.

Control flow

A control flow graph organizes a parse tree into a graph where the nodes are "basic blocks" (sequences of statements that run together) and the edges reflect the order of block execution.

const cfg = new ControlFlowGraph(tree);

cfg.blocks is an array of the blocks in the control flow graph, with cfg.entry pointing to the entry block and cfg.exit pointing to the exit block. The control flow graph for the parse tree above looks like this.

Each block has a list of its statements.

printNode(cfg.blocks[0].statements[0])

prints x, y = 0, 0.

The methods cfg.getSuccessors and cfg.getPredecessors allow the edges to be followed forward or backward.

const cond = cfg.getSuccessors(cfg.entry)[0];
printNode(cond)

prints x < 10.

Data flow

The library also provides basic def-use program analysis, namely, tracking where the values assigned to variables are read. For example, the 0 assigned to x in the entry block is read in the conditional x < 10, in the assignments y = x * 2 and x += 1.

const analyzer = new DataflowAnalyzer();
const flows = analyzer.analyze(cfg).flows;
for (let flow of flows.items) 
    console.log(printNode(flow.fromNode) + 
        " -----> " + printNode(flow.toNode))

prints

x, y = 0, 0 -----> x < 10
x, y = 0, 0 -----> print(y)
x, y = 0, 0 -----> y = x * 2
x += 1 -----> x < 10
y = x * 2 -----> print(y)
x += 1 -----> y = x * 2

Program Slicing

Program slicing removes lines from a program that are unnecessary to see the effect of a chosen line of code. For example, if we only care about the print statement in this program:

sum = 0
diff_sum = 0
for i in range(min(len(A), len(B))):
    sum += A[i] + B[i]
    diff_sum += A[i] - B[i]
print(sum)

then we can simplify the code to this:

sum = 0
for i in range(min(len(A), len(B))):
    sum += A[i] + B[i]
print(sum)

The function call slice(P,loc) takes a program P (a parse tree) and a program location loc and returns the program locations that are necessary for loc. For example, to do the slicing example above, we call slice(ast, {first_line: 6, first_column: 0, last_line, 6: last_column: 10}) which returns a LocationSet whose members have first_line values of 1, 3, 4, and 6 (but not 2 or 5).

API Specs

When deciding whether an API call needs to appear in a program slice, the slicing algorithm needs to know whether the call has a side-effect on the variables that are passed to it (including the self parameter for method calls). The call f(x) has a side-effect on x if f updates a field (x.m = y), updates an element (x[i] = y), updates a global variable, or transitively calls another function that has a side effect. Rather than analyzing the code of a called function (which may not even be available), we rely on having specifications, recorded in JSON files. Here is the specification pandas.json (with some lines omitted):

{
  "pandas": {
    "functions": [
      "array",
      "bdate_range",
      ...
      { "name": "read_clipboard", "returns": "DataFrame" },
      { "name": "read_csv", "returns": "DataFrame" },
      { "name": "read_excel", "returns": "DataFrame" },
      ...
    ],
    "types": {
      "DataFrame": {
        "methods": [
          "abs",
          "add",
          "add_prefix",
          ...
          { "name": "pop", "updates": [0] },
          ...
        ]
      }
    }
  }
}

A module's spec provides a list of the module's functions, types, and submodules. A type spec provides a list of the type's methods. In the function/method list, if a function/method appears just as a name (for example, "array" or "abs"), then it has no side-effects and doesn't return any objects with specifications. Otherwise, the function/method appears as a dictionary with its name in a name field and any of the following:

updates is an array of strings that lists the parameters that experience side-effects. 0 refers the self parameter, a number k >= 1 refers to the k th parameter, and any non-numeric string is the name of an updated global variable.
reads is an array of strings with the global variables that the method reads. (If a slice includes a call to a function that reads a global, then it must also include any calls that update that global.)
returns is the type of object the call returns, which is only necessary if that type has a spec.

The specs allow slice to analyze code like the following:

import pandas as pd
d = pd.read_csv("some_path")
d.pop("Column")
d.memory_usage()
d.count()            // ← slice on this line

Looking at the spec above, we can see that read_csv return a DataFrame object, so d is a DataFrame. The call to pop on d has a side-effect on the self parameter (d), because DataFrame's pop spec has an updates of [0]. Therefore, this call to pop must appear in the slice. On the other hand, the call to memory_usage has no side-effects, so it can be left out. So, the final slice includes lines 1, 2, 3, and 5, but not 4.

If there is no spec for an API call, then the data-flow and slicing algorithms conservatively assume that any passed parameter could experience a side-effect.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

python-program-analysis's People

Stargazers

Watchers

python-program-analysis's Issues

Odd behavior with analysis 0.4.1

I’m having trouble getting a minimal repro for this issue.

Starting with this repo: https://github.com/ronglums/PyCon2019Booth then fix cell 10 to say “from sklearn.model_selection import train_test_split”.

If I execute the first 6 code cells and then gather a slice for the 6th cell, only get one line of code: “del df[‘skin’]. Shouldn’t I get cells 1 & 2 as well?
If I execute to code cell 15 and then gather that, the code I get includes code cell 5 (“check(df)”), but otherwise seems good. In particular, cell 11 (which is just two print statements) isn’t included.
If I execute the rest of the notebook and gather the final cell (#27) I end up with the print-only cells that were not included in step 2.

If there’s an easier way to demonstrate this to you, let me know.

Analysis generates invalid python code.

To repro with gather extension:

Open a notebook with these two cells:

#%%
A = [0,1,2,3]
B = [4,5,6,7]
sum = 0
diff_sum = 0
for i in range(min(len(A), len(B))):
    sum += A[i] + B[i]
    diff_sum += A[i] - B[i]

#%%
print(sum)

Execute the first cell twice.
Execute the second cell.
Gather the second cell to a new notebook.

Expected: 2 cells

#%%
A = [0,1,2,3]
B = [4,5,6,7]
sum = 0
for i in range(min(len(A), len(B))):
    sum += A[i] + B[i]

#%%
print(sum)

Actual: 3 cells

#%%
A = [0,1,2,3]
B = [4,5,6,7]
for i in range(min(len(A), len(B))):

#%%
A = [0,1,2,3]
B = [4,5,6,7]
sum = 0
for i in range(min(len(A), len(B))):
    sum += A[i] + B[i]

#%%
print(sum)

The first cell is invalid Python code and shouldn't be included in the final gather.

Don't gather print statements (or similar) by default

This issue's repro is related to issue #5.

Given this script/notebook:

#%%
import pandas as pd # pandas is a dataframe library

#%%
Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }
df = pd.DataFrame(Cars,columns= ['Brand', 'Price'])

#%%
def check(df, size=11):
    print(df)

#%%
check(df)

#%%
print(df)

#%%
x = df['Brand'].values
y = df['Price'].values

Expected

# This file contains the minimal amount of code required to produce the code cell you gathered.
#%%
import pandas as pd # pandas is a dataframe library

#%%
Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }
df = pd.DataFrame(Cars,columns= ['Brand', 'Price'])

#%%
def check(df, size=11):
    print(df)

#%%
check(df)

#%%
x = df['Brand'].values
y = df['Price'].values

No need to gather the stand-alone print(df) call (cell 5). Actually, I would expect the function "check" and it's call to not be there either, but that's the topic of issue #5. I'd think print calls should not be gathered by default.

Actual

# This file contains the minimal amount of code required to produce the code cell you gathered.
#%%
import pandas as pd # pandas is a dataframe library

#%%
Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }
df = pd.DataFrame(Cars,columns= ['Brand', 'Price'])

#%%
def check(df, size=11):
    print(df)

#%%
check(df)

#%%
print(df) ***Why is this gathered?***

#%%
x = df['Brand'].values
y = df['Price'].values

rewriteCellMagic is over aggressive

The regex being used to match and comment out lines that contain cell magics is too aggressive. I think this one will do what's expected.

/^[^#\s]\s%%/gm

I actually created a PR but I can't submit because I'm not a contributor. :)

Because of this issue, any cell that contains code lines that have any %%'s in it will be entirely commented out. This is a problem if the user has a line in the cell like "# this is not a cell magic %%" or (in my case) "# To add a new cell, type '#%%'"

Function and it's call site gathered when it shouldn't

Using this notebook/script:

#%%
import pandas as pd # pandas is a dataframe library

#%%
Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }
df = pd.DataFrame(Cars,columns= ['Brand', 'Price'])

#%%
def check(df, size=11):
    print(df)

#%%
check(df)

#%%
x = df['Brand'].values
y = df['Price'].values

Gather the last cell and generate a new notebook

Expected

# This file contains the minimal amount of code required to produce the code cell you gathered.
#%%
import pandas as pd # pandas is a dataframe library

#%%
Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }
df = pd.DataFrame(Cars,columns= ['Brand', 'Price'])

#%%
x = df['Brand'].values 
y = df['Price'].values

Actual

# This file contains the minimal amount of code required to produce the code cell you gathered.
#%%
import pandas as pd # pandas is a dataframe library

#%%
Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }
df = pd.DataFrame(Cars,columns= ['Brand', 'Price'])

#%%
def check(df, size=11):
    print(df)

#%%
check(df)

#%%
x = df['Brand'].values
y = df['Price'].values

Since the function call doesn't do anything to the input parameter, you'd expect that the function and the function call not be included in what is gathered.

Detect when numpy function calls don't modify parameters

As shown in microsoft/gather#23, gather will detect methods like np.sin as modifying the np object, thus including these calls in slices where they don't belong.

As an example, consider the following two cells. Try to gather the output of the second cell. You'll find nbgather will gather the line np.sin(10), even though it's not necessary in the slice.

# Cell 1
import numpy as np
np.sin(10)

# Cell 2
print(np.arange(0, 10))

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.