souffle-lang / souffle Goto Github PK

Soufflé is a variant of Datalog for tool designers crafting analyses in Horn clauses. Soufflé synthesizes a native parallel C++ program from a logic specification.

Home Page: http://souffle-lang.github.io/

License: Universal Permissive License v1.0

Shell 0.18% C++ 93.79% C 0.35% Yacc 1.39% CMake 2.87% Java 0.11% Python 0.56% SWIG 0.03% Lex 0.72%

datalog logic-programming souffle static-code-analysis translator

souffle's People

Contributors

Stargazers

Watchers

Forkers

herbertjordan b-scholz cotsog sinthetix juliawong seanmcl copumpkin codetaro limasierra psubotic2 0xdomrom pnappa mmcgr alexjordan tiros-dev ktrianta gdmclellan azreika davidwzhao littlelightlittlefire lememta loadedice brofranks 65linesofcode greedy cfallin iltae95 superr4y chan0415 chubbymaggie ricelang ramyshahin tigerqiu712 jiewmurphy shamrock-frost andrewbuss refaqtor olligobber knowscieng alexdura maweki aeflores qxg2 gfour taneb tomasklapka ubuntu-repo ssallinger mrnbayoh nystrom bpd0018 langston-barrett detljh brianfairservice chadgavin leyu2048 yuli6313 bpot7558 nubkel kjezek luc-tielen longjohncoder tytus-metrycki dwuid marklturner ohamel-softwaresecure kylestanfield emessiha rdowavic samarch27 moqingyan joshuakarp bragaigor qcorporation reshift-projects binrats iamsiddhant1005 escapingbug qunqunqun zeta1999 hyunsukimsokcho thomas-seed clouddatalab yejeen aziemchawdhary-gs yihozhang lwy0518 wdduncan laszlo-kiss c0de3 zenhumany yannramusat wenhuizhang julienhenry arvindsastha paulineb30 f0829 csabahruska patricklam happinessyeah

souffle's Issues

Alternatives for OpenMP in Mac OS X

Default OSX compilers do not provide openmp support. A potential solution is to grab an openmp implementation (for instance llvm openmp library) and build it before Soufflé is compiled.

MAC OS X / CLANG-OpenMP

CLANG++/OpenMP on MAC OS X does not work. It seems that the parallelisation macros in ParalleUtils.h do not work properly for Clang++/OpenMP (NB: for G++/OpenMP on MAC OS X seems to work without issues).

Provide a code-generation only flag to integrate souffle in external tool build system.

For embedding souffle analysis into tools, only source code needs to be generated. An external build script then build the analysis as part of the enclosing application.

Suggestion: add an option -g to the compiler, which stops the compiler after generating the output code.

Usefull test script feedback

It is very frustrating if, after setting up a test cases, the test case is failing with the message

/evaluation.at:82: exit code was 127, expected 0

and running all the steps manually does not produce any errors. There is no more detailed log, nor any information that helps you narrowing down what the actual issue is -- which might be as little as an additional newline at the end of an input file or the wrong order of lines in the file.

While not only being super annoying and time consuming, it makes creating test cases a painful task that someone would wish to avoid or circumvent as much as possible.

The test script has to be improved in terms of usability.

Renaming wave to souffle-wave

Renaming wave to souffle-wave to avoid name clashes (there might be other tools around named wave).

Managing CPP

Provide options to disable souffle-wave if boost is not installed and use the standard cpp or SOUFFLECPP instead.

Allow repeated, identical declarations of relations or use-before-declare.

When structuring specifications in multiple files, there are situations in which declarations of the same relations would be allowed to be repeated several times -- just the have the declaration and definition of relations in the same file.

Performance Optimization for Indicator Relations

Example:

R(x) :- A(_), B(x).

Transfer tuples from relation B to relation R if relation is not empty. Semantically, this can be interpreted as

R(x) :- size(A)>0, B(x).

Souffle currently does not provide this optimization.

Liberal Identifiers for Soufflé

Synopsis: characters are quite restrictive for relation and variable names

Example:

.decl _a(x:number) 
.decl b?b(x:number)

would not compile.

For this feature, the code generator has to be changed. In the emitted C++ code, Souffle's identifiers are translated to valid and legible C++ identifiers. The translation must be unique to avoid name clashes with the temporary table names which have the format _temp1_<name>. In addition, replacement strings are necessary for characters that are not permitted for C++ identifiers such as ?.

Build engineering vv issue #67

We need to look at the deb/pkg building process now that were using travis-ci for automatic deployment.
For the time being ive made it:

bootstrap; configure; make check; make deb

This is not ideal:

debian packages are built even for non-release commits
the compilation of souffle we make check against is not the version we package
- otherwise, I re-enable auto_test for the deb packages, then we run make check twice
both the gcc-compiled and clang-compiled souffles create the same artifact

I think the build process should go:

bootstrap; configure; make deb; dpkg <package>; make distclean; make deb-check

Where it does:

compiles souffle once, into a debian package
installs the debian-package it just made
cleans all traces of the compilation (to prevent build artifacts from assisting the test run)
labels with gcc or llvm depending on the build infrastructure
runs the tests
- test runs must be written to allow the user to test against their installed souffle (not the one built locally)
- This is necessary, otherwise we aren't actually testing the package we built (even deb-helper only tests the source in the package, not that it works as installed)

edit: explicitly tagging issue #67

edit 2:
We should not rely on multiple definitions in source locations/files to define the version.
Github treats tags as versions, so we should use the most recent tag as the version number for a given commit.
The latest tag below your current commit is printed by the command

git describe --tags --abbrev=0

edit 3:

AC_INIT(souffle, m4_esyscmd([git describe --tags --abbrev=0 | tr -d '\n']), [[email protected]])

Optimize symbol table in SymbolTable.h

The symbol table does not use a hash-map and hence is inefficient. In addition, strings are copied rather than referred to. This causes another inefficiency. Hence, when loading large data from files, the symbol table becomes a bottle neck.

Aggregation

Currently, aggregation is sub-optimally translated.

multiple occurrences of the same aggregate computation are computed multiple times
for complex rules the aggregation is computed in a side-table and the side-table is computed exhaustively; compute the side-table on demand rather exhaustively
for min/max aggregate the whole table is traversed to find the min/max; better to let the index do the job.

Fixing the type of relations used for the Code Generator

Souffle has Trie and B-Tree implementations for relations. However, the Souffle compiler decides when to use B-Trees or Tries. In some cases, it would be great to fix the representation. E.g.,

.decl A(x:number, y:number) trie 
.decl B(x:number, y:number) trie

C-Preprocessor

At the moment we rely on cpp to convert Datalog input programs with macros to programs without macros. The conversion is done in main.cpp. The popen() function invokes the cpp and returns a FILE handle on the converted program.

Unfortunately, various systems use different cpp versions/implementations. As a consequence, the behaviour of the macro expansion is not consistent.

My suggestion is to replace cpp with boost's wave library. This library implements a C-Preprocessor and can be added to the Souffle source code. Souffle has a UPL license. Hence I feel that using BOOST with a BSD style license should work (i.e. in general, I believe we cannot add code that has a GPL style license).

Implicit declarations of relations

would be nice to be able to infer types for IDBs, i.e., so one does not have to declare an IDB table if it isn't an output table.

Improve schedule algorithm for executing a rule

The query scheduling algorithm does not deliver high quality schedules. A better algorithm has to be deployed - perhaps even auto-tuning to improve the quality of the schedules.

Java API

Would be nice to wrap the c++ api with JNI for scala/java users.

Travis builds the distribution twice

Looking at the travis logs, seems we build souffle twice.

This is about 10 minutes every pull-request we can save

Packaging for distributons

I will be soon trying to push souffle to different Linux distributions. I suppose we can start with Debian and then address others (such as Fudora, ArchLinux, Gentoo etc). There are a couple of minor issues left.

We need a gpg key to sign the package. I can probably create one for myself and use it. Any objections or suggestions?
The source files have a souffle version hard-coded. Presently it is 0.0.0 (or 0.0.0-25-g8be1361 in some test files). So I suppose before doing anything this needs to be changed to 1.0.0. Do we want to maintain the version in the source files? This would need a change before every release/version change.
I suggest that each version becomes a protected branch. Then if there is a fix in a specific version then it is propagated to the master branch (if applicable) but recent developments from master do not go to previous versions normally. Any objections to such a model?
I suppose we need to create a new tag.

Enclose generated code into a user-provided namespace to avoid name collisions

The code generated by souffle should be moved into a user-definable namespace (including all the header includes) to avoid name collisions with symbols defined in a hoste application.

Souffle Licensing

Miscelaneous licensing issues for Souffle.
Enhances issue #77 for the next version, and addresses limitations of #99 .
Not all of this needs to be done

create a licenses/ directory
Move the LICENSE to licenses/souffle-UPL
Add full copies of all the licenses for all the souffle dependencies to licenses/<DEPENDENCY>-<LICENSE>
Fix some sources currently license to Oracle (without UPL)
- e.g. src/souffle-complie.in
Fix the headers of all souffle sources to contain the license:

Souffle version <VERSION>
Copyright (c) <YEARS...> Oracle and/or it's affiliates.  All rights reserved.
Licensed under the Universal Permissive License v 1.0 as shown at:
 - https://opensource.org/licenses/UPL
 - souffle/licenses/souffle-UPL

include above license in licenses/souffle-UPL.header

Managing licenses

I have noticed that the Oracle license uses UTF-8 quotes around “Larger Work”, which caused an issue with some Java files in OSX (can be fixed with -encoding option to javac). I am wondering is these quotes are intentional? Also, I think that for managing licences it would be good to use something like a headache (available via apt in Debian/Ubuntu). Any thoughts?

Correct reporting of line numbers

After the latest additions, I get errors and warnings of the form:
Variable ?lastFld only occurs once in file analysis.dl at line 2966
on a file of only 2000 lines. I'm not sure if this is due to wave (suspect not, since I don't use preprocessor directives in that file, but it could be, since I've run into other wave issues without using preprocessor directives) or if it's an issue with the support of multiple heads of rules.

CXX flag not picked up during deb build

Seems that make deb on travis is not picking up the LLVM compiler, wven when that build is active (look at the logs to confirm)

We need make deb to actually make its internal version of souffle with clang if that is the CXX executable

Disjoint namespaces for Types, Relations, and Components

Types and relations may have the same name which is confusing in real code:

.type A
.decl A(r:A)
.comp A {

}

A name clash should be reported.

Improve memory efficiency

Fix-points are generated using the precedence graph. Strongly connected components in the precedence graph become fixed-point computations and a topological order gives a computation order among the fix-points / non-recursive relations. However, there may be a multitude of topological orders and some of the orders are more memory efficient than others assuming that a relation is deleted from memory as soon as it is not needed in future.

The task is to find a greedy algorithm that finds the most beneficial orders ahead of time.

Issues with souffle-compile.in script

if additional CXX flags are passed during configuration (e.g., -fsanitize=address) there are test failures.
it is a bash script (uses /bin/bash path), this needs to be changed to /bin/sh as on some systems bash may be missing or located elsewhere.
improve quoting of variables
compile flags (for instance -fopenmp) should be dependent on the system running the script rather than on the system configuring the build, otherwise if souffle is compiled on a system with openmp support is moved to the system without it there will be compile-time failures of translated datalog programs

Eliminating Identical Rules / Relations

Redundant rules are not eliminated at compile time,

A(x,y) :- B(x,z), C(z,y).

A(x,y) :- B(x,z), C(z,y).

and produces two rules in the RAM program.

Three features to speed up development and make it more fun

The following features would ease the development process

support for out-of-source builts so that release, debug and gcc/clang versions can co-exist without the need of constant re-configuration and re-building
a faster way to run individual integration tests selectively to investigate bugs; running all of them and waiting for 20 minutes until your failing test is processed the first time so that you get the test runner you can then run manually is not much fun
run tests in parallel so that developers have not to wait for lots of tests to be processed while (n-1) of their n cores are idle

Get versioning of packages from git tags

The semantic versioning should be picked up automatically via the git tags.

Currently the version number is hard coded in configure.ac and in the changelog of the debian package.

Nic suggested:

AC_INIT(souffle, m4_esyscmd([git describe --tags --abbrev=0 | tr -d '\n']), [[email protected]])

Change Oracle's old Datalog group email

configure.ac has an old oracle group email that is unreachable from outside the Oracle network.

I would suggest to replace it by a google mailing list email address.

Unable to import multiple analysis in same TU

Due to the utilization of global management structures, including the

ProgramFactory *ProgramFactory::base __attribute__ ((weak)) = nullptr;

and

static factory_Sf_varpointsto_example factory;

it is not possible to import more than one analysis into a single C++ translation unit. We should try to get rid of those.

Type definitions in components

At the moment type definitions are not permitted in components, e.g.,

.comp X {
.type A = number
}

would result in a compile error.

Disjunctions in the Body of a Clause

Synopsis: extend the syntax of Souffle such that disjunctions in the body of a clause are allowed.

Example:

A(x) :- B(x); C(x).

The relation A(x) receives either the values from B(x) or C(x). Above relation relation can be rewritten to:

A(x) :- B(x).
A(x) :- C(x).

As a first step, a simple expansion mechanism in the AST is sufficient that translated bodies with disjunction to DNF. The DNF is obtained by repetitive applications of the laws of distributivity and DeMorgan. In a later optimization step, we would need to consider which loop nests can be fused to optimize the execution.

Unary Operations: Enumeration instead of sublcassing

At the moment we have for every single unary operator an own subclass in the RAM engine.

I propose to switch to an enumeration type as we did for binary operations. This is for sake of consistency and permits an easy extension to add new unary operators.

Compressed fact files

Fact files can be very large, contain a text-based representation of data. This data can be effectively compressed (by a reasonable factor, example data showed a factor of 6). Storing fact files in a compressed format would not only reduce disc usage, it may also increase the speed of loading facts into relations, since less data needs to be retrieved from secondary storage.

Suggestion A: support XY.facts as either being clear-text or gzip-compressed input

Suggestion B: pre-processed facts; we provide an extra tool converting facts into a binary format that can be directly loaded as a relation;

Suggestion A would requires less effort and it reuses common tools while suggestion B may lead to higher performance, in particular when the fact files contain unordered data, causing lots of cache misses during load operations. However, they are not mutual exclusive. Both may be added independently.

Re-design loading and storing of relations

Synopsis: extend mechanism to load and store relations

Relations have qualifiers such as input and output to load relations from tab-separated files and store relations to tab-separated files. Example,

.decl A(x:number, y:number) input
.decl B(x:number) output

Provide an input/output statement that specifies a set of relations to be loaded from files and stored to files, respectively.

Stable release

I think now is the time to think about an actual release of Soufflé. Without a stable release it would be difficult to get into various repositories we have talked about. I suggest that we create a protected branch for the first release and start working on making it bullet-proof: no new functionality or features, just bugfixes and/or maybe some additional tests. If some new functionality comes in, then it can go to the master branch. Of course we need to discuss which features should actually go into the release. Any suggestions?

Broken Build Infrastructure

After merging the last commits of this afternoon (at 15:00 it was still working) the build system is broken on my system. When running ../configure I get the message:

checking for boostlib >= 1.46... yes
checking whether the Boost::System library is available... yes
checking for exit in -lboost_system... no
checking for exit in -lboost_system... (cached) no
configure: error: Could not link against boost_system !

What should I do? The boost system lib is installed on my system and is also tryed to ste the LD_LIBRARY_PATH environment variable. No change.

Logical Propositions

boolean expressions in single clauses to make more succinct code

Interactive query processing

While working with souffle on a big data set, unknowing of its internal structure, it turned out to be quite annoying that it takes very long to load input data, just to run a single query, which will then be modified and processed again.

Dismissing the obvious technical issues, would it be desirable to provide an interactive shell for submitting queries? How should it look like and/or would it have value for a wider range of important use cases?

Aggregate Semantic Check Missing

Program crashes while writing data to a file.

segfault_count_sccs.zip

Boolean type and operators

It would be great to add boolean values in addition to numbers, symbols and strings to the set of basic types.

Currently, when modeling booleans, the corresponding operators have to be modeled via extra relations. This extra overhead and memory access is slow.

What would be needed:

new bool primitive type
true and false constants
and, or, xor, negation, equality, and inequality operations

Options A: full integration

Options B: map it to numbers and encode it as follows:

true = 1
false = 0
neg = 1-x
and = x * y
or = neg(and(neg(x),neg(y)))
equality and inequality remains the same

Option B does not actually require any change since it can be realized by macros in a DL file. For our current work we will need this feature -- the question is whether we should go for the macros or think about a full integration.

Race condition using records

The new test case tests/evaluation/nqueens causes a race condition for a problem size that is equal to 6 or above. The test case uses records for representing the state of the board game.

Parameter Handling of Option -j

When compiled with -o -j<nr> the number of cores <nr> is ignored in the executable and the default number of processors is used instead.

Assertion failure--evaluation related?

I'm getting an assertion failure in the latest version:

souffle: RamTranslator.cpp:146: void souffle::{anonymous}::ValueIndex::addVarReference(const souffle::AstVariable&, const souffle::{anonymous}::Location&): Assertion `(locs.empty() || locs.back() < l) && "Unordered location insertion!"' failed.
Aborted (core dumped)

I have no clue what may be the cause and it will take me hours to construct a redacted example of the code that does the same computation. But here is the single-rule addition that precipitates the failure:

// If some access path is to be rebased, its prefixes are as well.
AccessPathShouldBeRebased( [ ?from, ?rest ], ?from, ?to) :-
AccessPathShouldBeRebased(?ap, ?from, ?to),
?ap = [ ?from, ?suffix ],
?suffix = [ ?rest, ?lastFld ].

Without this, the program performs meaningful computations so I have high confidence. The relation AccessPathShouldBeRebased is not populated anywhere else (though I tried that as well, and the failure persists) nor is it consumed anywhere at this point.

Function Predicates

Synopsis: extend Soufflé with function predicates to detect functional dependency violations for debugging purposes.

A set of function predicates can be defined for a relation. They are enforced at tuple-creation time, i.e., when a new tuple for the relation is generated, the function predicates of a relation form an assertion that is checking for duplicate key entries.

Example:

// define set of weighted edges of a tree
.decl Edge(parent:Node,child:Node, id:Label) 

// The following function predicates ensure that for each child
// there exists at most one parent node, and each edge has a 
// unique label.  
.function Edge(child) -> [Edge(parent), Edge(id)]
.function Edge(id) -> [Edge(parent), Edge(child)]

The implementation for this extension will contain two parts:

Extend the AST with function predicates, and provide necessary semantic checks
Extend the RAM machine and the code generation with the notion of duplicate checks for relations

Note that the proposed mechanism above is only concerned about functional properties of a single relation. No complex query that involves various other relations can be expressed with the aforementioned construct.

MDM5 Computations for Generated Packages

Compute md5 sums for packages after building them and upload them with the released packages.
Change the Download page of souffle-lang.github.io so that we can download the releases and the latest packages (making the md5 sums visible).

Efficient Record Access

Example:

A(x) :- B(x), C([x, _]).

A full scan is performed for table C, i.e., all tuples of C are scrutinised to check whether the first element of the record is x. To overcome this performance bottleneck, construct a helper relation that relates a record value with the record id. E.g., we can rewrite the example to

  C_helper(x) :- C([x,_]), B(x).
  A(x) :- B(x), C_helper(x).

Add datatypes for network addresses

We need some datatypes and associated operations for network addresses. I anticipate we will need
some bit-vector logic here or compile it into some underlying data structure e.g., tries.

souffle-lang / souffle Goto Github PK

souffle's People

Contributors

Stargazers

Watchers

Forkers

souffle's Issues

Recommend Projects

Recommend Topics

Recommend Org