lisa-analyzer / lisa Goto Github PK
View Code? Open in Web Editor NEW๐ a modular easy to use Library for Static Analysis aiming at multi-language analysis
Home Page: https://lisa-analyzer.github.io/
License: MIT License
๐ a modular easy to use Library for Static Analysis aiming at multi-language analysis
Home Page: https://lisa-analyzer.github.io/
License: MIT License
Description
The smashed sum abstract domain takes n > 2
abstract domains and compose them smashing their bottom elements and creating a new single bottom element. Top elements are not smashed but, to preserve the lattice structure, a new top element is introduced (that is greater than each top element of the underlying abstract domains). Least upper bound of two elements of such domain is top when elements come from different abstract domains composing the smashed sum, the least upper bound of the underlying domain otherwise. It is important to have such a way to compose domains especially when dealing with dynamic languages or analysis that handle values of different types.
Suggested implementation
The implementation should be an instance of SemanticDomain
and should introduce singletons bottom and top elements. The implementation of the methods of SemanticDomain
(e.g., assign, lub) should be similar to the Cartesian product one, except that we need to pay attention when we get non-bottom values from different abstract domains, in such a casa top should be returned.
Description
As for #69, HeapEnvironment
is adding additional fields to the standard environment/functional lattice. Both equals()
/hashCode()
, as well as lattice operators, should take these fields into account to guarantee the correct behavior of all operations.
Description
Currently, equals
and hashCode
methods in FunctionalLattice
just consider the class of the lattice references, while they should take into accout their instances. This is important in particular for the class Environment
, where lattice
is used to check if the environment is either top or bottom: without considering the lattice instances, a top and bottom environments are wrongly equal.
Currently, base non relational domains do not reason about type casting and type conversion expressions, since they just return top when these expressions are met. These domains should be able instead to reason about these symbolic expressions: for instance, interval domain should be able to cast or convert an interval to the type tracked by the domain, i.e., integer.
Description
To properly assess the effectiveness of the InferenceSystem
and InferredValue
classes, we need a proper analysis based on an inference system that needs a program counter/execution state. Non-interference, in its type system definition, is the chosen candidate.
Description
In branch program-structure
(commit 680e53a), a NullPointerException
is thrown when LiSA tries to dump the analysis or typing results for a non reachable node of a CFG. In particular, the exceptions are raised at line 399
dumpCFG("typing___", result, st -> result.getAnalysisStateAt(st).toString());
and line 425
dumpCFG("analysis___", result, st -> result.getAnalysisStateAt(st).toString());
since, if st
is an unreachable node in the CFG (i.e., it has no incoming edges), result.getAnalysisStateAt(st)
returns null
. In order to fix this bug, it should be enough to check if result.getAnalysisStateAt(st)
is null
, for instance:
dumpCFG("analysis___", result, st -> result.getAnalysisStateAt(st) == null ? "Dead code" : result.getAnalysisStateAt(st));
Description
We should implement a general analysis infrastructure for LiSA, in order to provide a starting framework for the definition of new semantic-driven analyses.
Motivation
This feature implementation corresponds to the LiSA analysis core and it should contain the starting point for the definition of new semantic analyses. In particular, we should provide:
Description
Variables that appear in one statement and are pushed in the abstract state are never popped out. Having a variable table that defines the scope of each of them enables the automatic removal through forgetIdentifier()
of out-of-scope variables.
Motivation
Not removing variables leads to noise when inspecting the abstract state, but also to errors: two different weak identifiers in two different scopes but with same name should be treated separately (i.e. the first assignment to each of them should be strong), but the first assignment to the second will be lubbed with the last assigned value of the first.
Description
Interval
domain does not override the satifies methods from the BaseNonRelationalValueDomain
class. We need to implement them in order to gain more precision when the domain checks the satisfiability of a symbolic expression.
Non relational domains should implement the greatest lower bound operation in order to improve precision on assume
method on environments, in particular for filtering the values that should survive after the assumption.
Description
We should ensure that the code location of each cfg element (Parameter, CFGDescriptor, CompilationUnit, etc) is not null. If the code location of an element is unknown, we should put the unknown location of the corresponding CodeLocation
concrete implementation (e.g., for SourceCodeLocation
we should put new SourceCodeLocation(null, -1, -1)
).
Description
There is a mistake in the current implementation of interval multiplication: it calculates x2 * y2
in the intervals multiplication equations, instead of x1 * y2
.
Source: https://en.wikipedia.org/wiki/Interval_arithmetic#Interval_operators
Reproducibility information
This bug was introduced in commit ae11414 , on master branch inside the file lisa/src/test/java/it/unive/lisa/test/imp/tutorial/intervals/Intv.java
.
You can reproduce this bug by doing operations in your programs like: [5,5] * [10,10]
, that returns: [50, 100]
instead of the correct answer: [50, 50]
.
Currently, weak updates are handled by the analysis state. Given an assignment id = exp, if id is a weak identifier, the analysis state lubs the previous analysis state with the current one (the one tracking the assignment). This leads to a terrible loss of precision when the previous state is the TOP abstract state: considering again the weak assignment id = exp, we would lose the information that id -> TOP. In order to get more precise results, we need to move the handling of the weak updates on the domains. For instance, in the case of non-relational domains, weak updates should be handled by the environment.
Description
Execution of syntactic checks should exploit the low resource consumption that is typical of these checks to reduce the overall execution time.
Motivation
It is reasonable to assume that syntactic checks would far outnumber semantic ones in a realistic analysis scenario. It is possible to take advantage of the isolation of these checks (typically considering information from one statement/expression at a time) to parallelize their execution. This would also represent the first parallelization effort, that would provide the framework for a library-wide parallelization framework.
Suggested implementation
Standard executors relying on thread pools would be the preferred way to go. Once that framework is in place, one can either keep the execution of single checks sequential while parallelizing different checks, or execute one check at a time but iterating over the cfgs concurrently. A more intelligent approach, deciding a tradeoff between the two approaches relying on the number of cores available would probably be best.
Further content
We should also investigate the possibility of parallelizing semantic checks too.
Description
A call to a constructor, using the Intraprocedural call graph abstraction, returns the top abstract state, causing the loss of any previously gathered analysis information. Informally speaking, when a constructor is called, it is as the analysis restarted from the point the constructor is called.
Reproducibility information
In branch program-structure
(commit: 680e53a). Let's consider the following IMP program
class example {
f(e) {
def obj = new example();
}
}
and let's suppose to analyze it using the default abstractions for the CallGraph (IntraproceduralCallGraph
) and HeapDomain (MonohiliticHeap
), and Sign
for the ValueDomain. The result of the analysis is reported in the following, by means of the corresponding dumped dot file.
Note that, the abstract value regarding the parameter e
(namely TOP
, the top of the Sign domain) is lost.
The expected result is the same dot file, except that in the first node the value analysis also contains the information about the parameter e
, i.e., contains the entry vid$e: TOP
.
Description
SetLatticeLattice
and InverseSetLattice
concrete implementations in LiSA should override equals
and hashCode
methods, considering their own logics. Specifically, any concrete implementation of these domains we have currently in LiSA relies on the property isTop
in order to check, together with the elements
property of the superclass, if a lattice instance is either top or bottom. Currently, concrete implementations rely on the equals
and hashCode
methods of the superclass, that just check the equality between elements. In this way, top and bottom elements of the concrete implementations are wrongly seen as equal.
Description
Support for frontend definition of conditional structures and heuristics to identify them if not provided by frontends.
Motivation
Domains might need syntactic information about where the program point being analyzed is inside the cfg in order to apply additional logic.
Suggested implementation
The most accurate information possible comes from the original program: frontends, where possible, should add conditional structures corresponding to loops and if-elses. Where those are not present (compiled code analysis), a set of best-effort heuristics should be in place, in order to extract as much information as possible.
Description
The abstract state provided by LiSA is not expressive enough. For instance, techniques like trace partitioning would have to be applied on value state and heap state independently. We should create a general AbstractState
interface, such that users can define their own abstract state and combine them, having a direct effect on both the heap state and value state at once.
Description
The CFG graph structure should be abstracted away, since more components could benefit from the same structure. One example of these are the callgraphs, that could be implemented as a graph backed from an adjacency matrix and that will surely need a fixpoint algorithm.
Motivation
Reducing code duplication and increase encapsulation and modularity.
Description
Depending on the language, calls might be resolved in different ways:
The callgraph should be aware of the resolution strategy of the call to be aware of how to resolve the target signature.
We should centralize the visit of symbolic expressions (by introducing a visitor interface) in order to avoid duplicated code when we recursively inspect symbolic expressions, for example in non-relational domains and base inferred value.
Description
Add support for semantic checks that can access the results of the analysis.
Motivation
The most interesting part of static analysis is being able to issue warnings exploiting semantic information. Moreover, it gives a quick chance to verify the results of an analysis without having to thoroughly inspect its results.
Suggested implementation
Extend the structure of SyntacticCheck
and of CheckTool
, so that the same executor can be used to spawn the various checks.
Description
The collection of cfgs is not enough to express a program: there is no information about fields, global variables, cfg overriding, ...
We should introduce a form of program structure where cfgs are grouped into units, and the general model is enriched by adding all the capabilities that we are missing.
Description
Following the work of #12, we should provide basic components for first analyses implementations (callgraph, heap abstraction), runtime type information, and an architecture for selecting and executing analyses.
Motivation
We want to provide basic features for running analyses for testing/educational purpose. Ideally we want one simple, imprecise and fast implementation of callgraph, heapdomain and valuedomain. Type inference is also needed for providing domains a way to check if a variable/expression/constant is of interest (e.g. numerical domains want to track properties only on numerical values). A way of specifiying the analyses to execute is also needed.
Suggested implementation
CallGraph
A first implementation could be a callgraph that (i) computes the fixpoint by iterating over all cfgs in no particular order, (ii) resolves call targets to all possible cfgs that are compatible with the target's signature, (iii) when queried for the abstract result of a cfg call, returns top (that is, it treats all calls as open).
HeapDomain
A first implementation could be one that abstracts every heap location to a unique identifier, causing all reads and writes to refer to that one. Extremely imprecise but very fast.
TypeInference
This should be a separate fixpoint iteration that computes type runtime information and then stores them directly inside Statement
s, that can then use this information to determine what to do during the semantics()
computation.
Architecture for analysis selection
A nice feature would be to just specify some analyses through addAnalysis(analysis.class)
, and then to have LiSA selecting the required analyses to run before and constructing a combination of domains to run.
Right now, most of the syntactic components have a reference to the file, line and column where they are defined in the source code, but not all programs might have source code: reading from bytecode or compiled code, the location in the original program can be determined through other means.
We should introduce a generic 'code location' interface, implemented by a 'source code location', a 'bytecode location', ... that provides a unique string representation of the location where that element is defined, and that domains can use.
Description
SetLattice
and InverseSetLattice
should provide the greatest lower bound operation in order to better manipulate set elements in abstract domains using them, such as upper bounds or pentagons.
Description
Graph comparisons use string equality on the contents of the nodes. If the same set of information can be dumped in different order due to non-deterministic iteration, such comparisons will fail even when performing on the same data.
Affected classes are all the ones that hold sets of information: DataflowDomain
s, Environment
s, ...
Currently, LiSA supports only type casting in symbolic expressions, but there is no support for type conversion. We need to add two separate symbolic expressions handling type conversion and type casting. Specifically,
Description
Currently, LiSA does not support annotations, limiting its internal representation's expressiveness. We need to provide annotations in LiSA internal representation that are as generic as possible in order to model the different programming languages annotations (e.g., Java vs C#) and an efficient way to inject the annotations to the proper elements of the control-flow graph.
Description
Submitting new analyses to LiSA is now achieved by invoking lisa.addXXX(domain)
. This implies that for each new "construct" (call graphs, abstract states, ...) a new method should be implemented. Having a factory that loads and compose the bits of an analysis would simplify the process.
Implement the first instances of heap analyses:
When a method with an empty body is analyzed using the IMP frontend, the analysis crashes during the fixpoint computation.
Branch InterproceduralAnalysis. Testcase ContextSensitiveInterproceduralAnalysisTest.testRTAContextSensitive
When it resolves the various method calls, the dynamic types of the receivers are always empty, thus it fails to resolve the calls.
In particular, when debugging the aforementioned test case with a breakpoint at line 40 of class CallGraphBasedInterproceduralAnalysis it firsts stop for call [unresolved]A(this) (that is the first line of method subtyping). When stepping into, this is resolved to dynamic type A (this looks odd to me, this should be of type subtyping since this is the class where the code is contained, however the type is correct - but the receiver is not :)), but when it tries to match the signature of the called methods with the possible candidates (line 83 AbstractBaseCallGraph) it fails since the call has a parameter (this) while the constructor has no parameter.
Support inteprocedural analysis, and in particular:
Description
All instances of abstract domains (SemanticDomain
, NonRelationalDomain
, ...) should have knowledge of the program point that generated the expressions they are evaluating.
Motivation
Information about the program point (i.e. if it is inside a loop) can be required by abstract domains for them to be able to apply their reasoning. A simple example of this is a heap abstraction that creates weak identifiers only for program points that are inside a loop.
Suggested implementation
Statement
should implement a ProgramPoint
interface that provides access to the cfg. The cfg can then provide methods to know if a program point is inside loops, or any other logic required by domains. All methods of abstract domains should take the program point being evaluated as parameter.
Collections of symbolic expressions that we use around the analysis (for example, rewritten expressions) behave as instances of a SetLattice abstract domain. We should create a set lattice domain of symbolic expression (ExpressionSetLattice) and replace the collections around the analysis with instances with this domain.
Description
The CFG structure should have an Collection<Statement> entrypoints
instead of the the Statement first
field to support multi-entrypoints procedures.
Motivation
In some languages it is possible to jump in the middle of the code of a procedure. Thus, those procedure will have more than one entry point (or "first" instruction). To model those, a more flexible structure is needed.
Currently, non-relational abstract domains are able to process any identifier, independently of the identifier types. For instance, numerical abstract domains can only track identifiers, and not, for example, pointer identifiers. This lead to unexpected analysis results (e.g., pointer identifiers abstracted into interval values). We need to define which identifiers can be tracked by a non-relational abstract domain: if a non-relational domain cannot track an identifier, it should assign top to it.
Description
Currently, InferenceSystem
inherits equals
and hashCode
methods from Environment
, without taking into account the inferred value of the instance of inference system. It should override the methods taking into account also the inferred value.
Reproducibility information
On master branch (commit d7e12f7) for instance the inference systems
x: [int] y: [int] [inferred: _|_, state: _|_]
and x: [int] y: [int] [inferred: [int], state: _|_]
are equal even if they infer different values. This lead to a wrong behavior on lub
: in particular, the expected result is x: [int] y: [int] [inferred: [int], state: _|_]
, but the actual one is x: [int] y: [int] [inferred: _|_, state: _|_]
Description
Heap domains might produce weak identifiers. On those, the semantics of assignment should be modified to perform the least upper bound between the entry state and the exist state of the assignment.
Suggested implementation
Add an isWeak
property in Identifier
s, and have the AnalysisState
check for that flag and perform a strong or weak assignment.
The results of the type analysis in the test cases seem to be wrong. For instance, in https://raw.githubusercontent.com/UniVE-SSV/lisa/master/lisa/imp-testcases/type-inference/typing___untyped_typing.test2(untyped_i).dot it infers that the type of the expression after a Boolean condition (node 3) is TOP instead of Boolean.
In addition, it exposes non-deterministic behaviors in branch inteprocedural. For instance, in the example above sometimes it computer TOP, other times integer.
I tried to use HashSet instead of ExternalSet in field elements of InferredTypes but it does not help (the nondeterministic behaviors are fewer/less frequent, but still there).
Description
At the moment, the call resolution algorithm inside call graph throws a CallResolutionException
if a call that has both CFG
s and NativeCFG
s as targets is found. We need to support this case.
Description
All inputs (CFGs for now) and outputs (collections of warnings for now) should be serializable, such that we can read them from - and write them to - the file system.
Motivation
Having serialization and deserialization feature would ease the creation of tests as well as keeping a clean reference result for regression tests. This would also enable external tools to produce inputs for LiSA and read her outputs.
Suggested implementation
JAXB is probably the best choice, since schemas can be automatically generated for other tools to generate data structures modeling LiSA's.
Description
It should be possible to easily define dataflow analysis by just defining the join operator and the gen
/kill
functions.
Description
The current simplification algorithm is mishandling the simplification of NoOp
s that are entry nodes or exit nodes. This leads to exceptions at various stages of the analysis (fixpoint, graph dumping, ...).
In short:
NoOp
s that are entrypoints get removed, but their outgoing edges are not; the set of entrypoints is not modifiedNoOp
s that are exitpoints get removed, but their ingoing edges are notReproducibility information
Happening on v0.1a2. Tested on three CFGs:
CFG first = new CFG(new CFGDescriptor("foo"));
NoOp start = new NoOp(first);
Assignment assign = new Assignment(first, new Variable(first, "x"), new Literal(first, 5, Untyped.INSTANCE));
Return ret = new Return(first, new Variable(first, "x"));
first.addNode(start);
first.addNode(assign);
first.addNode(ret);
first.addEdge(new SequentialEdge(assign, ret));
first.addEdge(new SequentialEdge(start, assign));
first.simplify();
CFG second= new CFG(new CFGDescriptor("foo"));
Assignment assign1 = new Assignment(first, new Variable(first, "x"), new Literal(first, 5, Untyped.INSTANCE));
Assignment assign2 = new Assignment(first, new Variable(first, "y"), new Literal(first, 50, Untyped.INSTANCE));
NoOp end = new NoOp(first);
first.addNode(end);
first.addNode(assign1);
first.addNode(assign2);
first.addEdge(new SequentialEdge(assign1, assign2));
first.addEdge(new SequentialEdge(assign2, end));
second.simplify();
CFG third = new CFG(new CFGDescriptor("foo"));
Assignment assign1 = new Assignment(first, new Variable(first, "b"),
new Literal(first, true, Untyped.INSTANCE));
Assignment assign2 = new Assignment(first, new Variable(first, "x"), new Literal(first, 5, Untyped.INSTANCE));
Assignment assign3 = new Assignment(first, new Variable(first, "y"), new Literal(first, 50, Untyped.INSTANCE));
NoOp end = new NoOp(first);
first.addNode(end);
first.addNode(assign1);
first.addNode(assign2);
first.addNode(assign3);
first.addEdge(new TrueEdge(assign1, assign2));
first.addEdge(new FalseEdge(assign1, assign3));
first.addEdge(new SequentialEdge(assign2, end));
first.addEdge(new SequentialEdge(assign3, end));
third.simplify();
Expected behavior
toString()
of the adjacency matrix of the three cfgs:
first:
"x = 5" -> [
ingoing:
outgoing: [ x = 5 ] ---> [ return x ]
]
"return x" -> [
ingoing: [ x = 5 ] ---> [ return x ]
outgoing:
]
second:
"x = 5" -> [
ingoing:
outgoing: [ x = 5 ] ---> [ y = 50 ]
]
"y = 50" -> [
ingoing: [ x = 5 ] ---> [ y = 50 ]
outgoing:
]
third:
"b = true" -> [
ingoing:
outgoing: [ b = true ] -T-> [ x = 5 ], [ b = true ] -F-> [ y = 50 ]
]
"y = 50" -> [
ingoing: [ b = true ] -F-> [ y = 50 ]
outgoing: [ y = 50 ] ---> [ no-op ]
]
"x = 5" -> [
ingoing: [ b = true ] -T-> [ x = 5 ]
outgoing: [ x = 5 ] ---> [ no-op ]
]
"no-op" -> [
ingoing: [ x = 5 ] ---> [ no-op ], [ y = 50 ] ---> [ no-op ]
outgoing:
]
Actual behavior
toString()
of the adjacency matrix of the three cfgs:
first:
"x = 5" -> [
ingoing: [ no-op ] ---> [ x = 5 ]
outgoing: [ x = 5 ] ---> [ return x ]
]
"return x" -> [
ingoing: [ x = 5 ] ---> [ return x ]
outgoing:
]
second:
"x = 5" -> [
ingoing:
outgoing: [ x = 5 ] ---> [ y = 50 ]
]
"y = 50" -> [
ingoing: [ x = 5 ] ---> [ y = 50 ]
outgoing: [ y = 50 ] ---> [ no-op ]
]
third:
"b = true" -> [
ingoing:
outgoing: [ b = true ] -T-> [ x = 5 ], [ b = true ] -F-> [ y = 50 ]
]
"y = 50" -> [
ingoing: [ b = true ] -F-> [ y = 50 ]
outgoing: [ y = 50 ] ---> [ no-op ]
]
"x = 5" -> [
ingoing: [ b = true ] -T-> [ x = 5 ]
outgoing: [ x = 5 ] ---> [ no-op ]
]
Description
The Interval
greatest lower bound does not correctly work when the intersection between the interval is empty.
Reproducibility information
On branch master (commit 9eea32e), consider the interval [0,0] and [1,1]. The expected result is bottom, while glbAux
computes 1 as low bound and 0 as high bound. We should check if the new low bound is greater than the new high bound: in this case, we should return bottom.
Description
With the generic graph structure being introduced, a generic visitor pattern should be introduced to uniformly inspect all types of graph (call graphs or cfgs). Checks (syntactic and all future types) would directly benefit from this, but also other constructs might exploit this pattern (e.g. a call graph that wants to find all calling statements in a cfg).
Description
LiSA should be aware of (static) typing coming from the original program, and should then be able to infer runtime types.
Motivation
Typing is essential for determining the appropriate target of instance method calls in object-oriented programs, and are also necessary to determine the result of casting operations. LiSA should not enforce typing where not needed, since some languages are not statically typed and frontends should not be forced to add typing information on them.
Suggested implementation
We would go for a generic Type interface with some sub-interfaces for more specific types. A partial hierarchy would be:
interface Type {
default public boolean isNumeric() {
return this instanceof NumericType;
}
default public NumericType asNumeric() {
return isNumeric() ? (NumericType) this : null;
}
}
interface NumericType extends Type {
boolean isFloatingPoint();
boolean is64Bits();
}
Together with NumericType
, other specific interfaces can be provided (e.g. StringType
). A concrete instance Untyped
should also be created, serving as a default type for non-statically typed programs.
Information about static typing should be added to CFG (return value and parameters) and to each expression instance.
Right now, there is no way to define what are the entry points of the applications. When we run the interprocedural analysis, we take all methods and we consider them as entry points. We should:
@lucaneg I can take care of the second and third point, but I am not sure on how to proceed for the first one.
Description
We need a special ValueEnvironment
that can model inference systems, like type systems or non interference.
Suggested implementation
An extension of ValueEnvironment
that gives access to (i) the last inferred value, and (ii) an execution state/program counter that keeps track of the traversed conditions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.