Giter Site home page Giter Site logo

souffle-lang / souffle Goto Github PK

View Code? Open in Web Editor NEW
865.0 865.0 192.0 528.8 MB

Soufflé is a variant of Datalog for tool designers crafting analyses in Horn clauses. Soufflé synthesizes a native parallel C++ program from a logic specification.

Home Page: http://souffle-lang.github.io/

License: Universal Permissive License v1.0

Shell 0.18% C++ 93.79% C 0.35% Yacc 1.39% CMake 2.87% Java 0.11% Python 0.56% SWIG 0.03% Lex 0.72%
datalog logic-programming souffle static-code-analysis translator

souffle's People

Contributors

0xdomrom avatar 65linesofcode avatar aeflores avatar azreika avatar b-scholz avatar bfairservice-gt avatar brofranks avatar chadgavin avatar davidwzhao avatar dcol97 avatar detljh avatar gueckmooh avatar herbertjordan avatar julienhenry avatar langston-barrett avatar mmcgr avatar ohamel-softwaresecure avatar olligobber avatar pnappa avatar psubotic avatar ptrk8 avatar quentin avatar rdowavic avatar samarch27 avatar sarah-sallinger-sonarsource avatar scyptnex avatar tomaspuverle avatar tytus-metrycki avatar xiaowenhu96 avatar yuli6313 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

souffle's Issues

MAC OS X / CLANG-OpenMP

CLANG++/OpenMP on MAC OS X does not work. It seems that the parallelisation macros in ParalleUtils.h do not work properly for Clang++/OpenMP (NB: for G++/OpenMP on MAC OS X seems to work without issues).

Usefull test script feedback

It is very frustrating if, after setting up a test cases, the test case is failing with the message

/evaluation.at:82: exit code was 127, expected 0

and running all the steps manually does not produce any errors. There is no more detailed log, nor any information that helps you narrowing down what the actual issue is -- which might be as little as an additional newline at the end of an input file or the wrong order of lines in the file.

While not only being super annoying and time consuming, it makes creating test cases a painful task that someone would wish to avoid or circumvent as much as possible.

The test script has to be improved in terms of usability.

Managing CPP

Provide options to disable souffle-wave if boost is not installed and use the standard cpp or SOUFFLECPP instead.

Performance Optimization for Indicator Relations

Example:

R(x) :- A(_), B(x). 

Transfer tuples from relation B to relation R if relation is not empty. Semantically, this can be interpreted as

R(x) :- size(A)>0, B(x). 

Souffle currently does not provide this optimization.

Liberal Identifiers for Soufflé

Synopsis: characters are quite restrictive for relation and variable names

Example:

.decl _a(x:number) 
.decl b?b(x:number) 

would not compile.

For this feature, the code generator has to be changed. In the emitted C++ code, Souffle's identifiers are translated to valid and legible C++ identifiers. The translation must be unique to avoid name clashes with the temporary table names which have the format _temp1_<name>. In addition, replacement strings are necessary for characters that are not permitted for C++ identifiers such as ?.

Build engineering vv issue #67

We need to look at the deb/pkg building process now that were using travis-ci for automatic deployment.
For the time being ive made it:

bootstrap; configure; make check; make deb

This is not ideal:

  • debian packages are built even for non-release commits
  • the compilation of souffle we make check against is not the version we package
    • otherwise, I re-enable auto_test for the deb packages, then we run make check twice
  • both the gcc-compiled and clang-compiled souffles create the same artifact

I think the build process should go:

bootstrap; configure; make deb; dpkg <package>; make distclean; make deb-check

Where it does:

  • compiles souffle once, into a debian package
  • installs the debian-package it just made
  • cleans all traces of the compilation (to prevent build artifacts from assisting the test run)
  • labels with gcc or llvm depending on the build infrastructure
  • runs the tests
    • test runs must be written to allow the user to test against their installed souffle (not the one built locally)
    • This is necessary, otherwise we aren't actually testing the package we built (even deb-helper only tests the source in the package, not that it works as installed)

edit: explicitly tagging issue #67

edit 2:
We should not rely on multiple definitions in source locations/files to define the version.
Github treats tags as versions, so we should use the most recent tag as the version number for a given commit.
The latest tag below your current commit is printed by the command

git describe --tags --abbrev=0

edit 3:

AC_INIT(souffle, m4_esyscmd([git describe --tags --abbrev=0 | tr -d '\n']), [[email protected]])

Optimize symbol table in SymbolTable.h

The symbol table does not use a hash-map and hence is inefficient. In addition, strings are copied rather than referred to. This causes another inefficiency. Hence, when loading large data from files, the symbol table becomes a bottle neck.

Aggregation

Currently, aggregation is sub-optimally translated.

  • multiple occurrences of the same aggregate computation are computed multiple times
  • for complex rules the aggregation is computed in a side-table and the side-table is computed exhaustively; compute the side-table on demand rather exhaustively
  • for min/max aggregate the whole table is traversed to find the min/max; better to let the index do the job.

Fixing the type of relations used for the Code Generator

Souffle has Trie and B-Tree implementations for relations. However, the Souffle compiler decides when to use B-Trees or Tries. In some cases, it would be great to fix the representation. E.g.,

.decl A(x:number, y:number) trie 
.decl B(x:number, y:number) trie 

C-Preprocessor

At the moment we rely on cpp to convert Datalog input programs with macros to programs without macros. The conversion is done in main.cpp. The popen() function invokes the cpp and returns a FILE handle on the converted program.

Unfortunately, various systems use different cpp versions/implementations. As a consequence, the behaviour of the macro expansion is not consistent.

My suggestion is to replace cpp with boost's wave library. This library implements a C-Preprocessor and can be added to the Souffle source code. Souffle has a UPL license. Hence I feel that using BOOST with a BSD style license should work (i.e. in general, I believe we cannot add code that has a GPL style license).

Java API

Would be nice to wrap the c++ api with JNI for scala/java users.

Packaging for distributons

I will be soon trying to push souffle to different Linux distributions. I suppose we can start with Debian and then address others (such as Fudora, ArchLinux, Gentoo etc). There are a couple of minor issues left.

  1. We need a gpg key to sign the package. I can probably create one for myself and use it. Any objections or suggestions?
  2. The source files have a souffle version hard-coded. Presently it is 0.0.0 (or 0.0.0-25-g8be1361 in some test files). So I suppose before doing anything this needs to be changed to 1.0.0. Do we want to maintain the version in the source files? This would need a change before every release/version change.
  3. I suggest that each version becomes a protected branch. Then if there is a fix in a specific version then it is propagated to the master branch (if applicable) but recent developments from master do not go to previous versions normally. Any objections to such a model?
  4. I suppose we need to create a new tag.

Souffle Licensing

Miscelaneous licensing issues for Souffle.
Enhances issue #77 for the next version, and addresses limitations of #99 .
Not all of this needs to be done

  • create a licenses/ directory
  • Move the LICENSE to licenses/souffle-UPL
  • Add full copies of all the licenses for all the souffle dependencies to licenses/<DEPENDENCY>-<LICENSE>
  • Fix some sources currently license to Oracle (without UPL)
    • e.g. src/souffle-complie.in
  • Fix the headers of all souffle sources to contain the license:
Souffle version <VERSION>
Copyright (c) <YEARS...> Oracle and/or it's affiliates.  All rights reserved.
Licensed under the Universal Permissive License v 1.0 as shown at:
 - https://opensource.org/licenses/UPL
 - souffle/licenses/souffle-UPL
  • include above license in licenses/souffle-UPL.header

Managing licenses

I have noticed that the Oracle license uses UTF-8 quotes around “Larger Work”, which caused an issue with some Java files in OSX (can be fixed with -encoding option to javac). I am wondering is these quotes are intentional? Also, I think that for managing licences it would be good to use something like a headache (available via apt in Debian/Ubuntu). Any thoughts?

Correct reporting of line numbers

After the latest additions, I get errors and warnings of the form:
Variable ?lastFld only occurs once in file analysis.dl at line 2966
on a file of only 2000 lines. I'm not sure if this is due to wave (suspect not, since I don't use preprocessor directives in that file, but it could be, since I've run into other wave issues without using preprocessor directives) or if it's an issue with the support of multiple heads of rules.

CXX flag not picked up during deb build

Seems that make deb on travis is not picking up the LLVM compiler, wven when that build is active (look at the logs to confirm)

We need make deb to actually make its internal version of souffle with clang if that is the CXX executable

Improve memory efficiency

Fix-points are generated using the precedence graph. Strongly connected components in the precedence graph become fixed-point computations and a topological order gives a computation order among the fix-points / non-recursive relations. However, there may be a multitude of topological orders and some of the orders are more memory efficient than others assuming that a relation is deleted from memory as soon as it is not needed in future.

The task is to find a greedy algorithm that finds the most beneficial orders ahead of time.

Issues with souffle-compile.in script

  • if additional CXX flags are passed during configuration (e.g., -fsanitize=address) there are test failures.
  • it is a bash script (uses /bin/bash path), this needs to be changed to /bin/sh as on some systems bash may be missing or located elsewhere.
  • improve quoting of variables
  • compile flags (for instance -fopenmp) should be dependent on the system running the script rather than on the system configuring the build, otherwise if souffle is compiled on a system with openmp support is moved to the system without it there will be compile-time failures of translated datalog programs

Three features to speed up development and make it more fun

The following features would ease the development process

  • support for out-of-source builts so that release, debug and gcc/clang versions can co-exist without the need of constant re-configuration and re-building
  • a faster way to run individual integration tests selectively to investigate bugs; running all of them and waiting for 20 minutes until your failing test is processed the first time so that you get the test runner you can then run manually is not much fun
  • run tests in parallel so that developers have not to wait for lots of tests to be processed while (n-1) of their n cores are idle

Get versioning of packages from git tags

The semantic versioning should be picked up automatically via the git tags.

Currently the version number is hard coded in configure.ac and in the changelog of the debian package.

Nic suggested:

AC_INIT(souffle, m4_esyscmd([git describe --tags --abbrev=0 | tr -d '\n']), [[email protected]])

Change Oracle's old Datalog group email

configure.ac has an old oracle group email that is unreachable from outside the Oracle network.

I would suggest to replace it by a google mailing list email address.

Unable to import multiple analysis in same TU

Due to the utilization of global management structures, including the

ProgramFactory *ProgramFactory::base __attribute__ ((weak)) = nullptr;

and

static factory_Sf_varpointsto_example factory;

it is not possible to import more than one analysis into a single C++ translation unit. We should try to get rid of those.

Type definitions in components

At the moment type definitions are not permitted in components, e.g.,

.comp X {
.type A = number
}

would result in a compile error.

Disjunctions in the Body of a Clause

Synopsis: extend the syntax of Souffle such that disjunctions in the body of a clause are allowed.

Example:

A(x) :- B(x); C(x).

The relation A(x) receives either the values from B(x) or C(x). Above relation relation can be rewritten to:

A(x) :- B(x).
A(x) :- C(x). 

As a first step, a simple expansion mechanism in the AST is sufficient that translated bodies with disjunction to DNF. The DNF is obtained by repetitive applications of the laws of distributivity and DeMorgan. In a later optimization step, we would need to consider which loop nests can be fused to optimize the execution.

Unary Operations: Enumeration instead of sublcassing

At the moment we have for every single unary operator an own subclass in the RAM engine.

I propose to switch to an enumeration type as we did for binary operations. This is for sake of consistency and permits an easy extension to add new unary operators.

Compressed fact files

Fact files can be very large, contain a text-based representation of data. This data can be effectively compressed (by a reasonable factor, example data showed a factor of 6). Storing fact files in a compressed format would not only reduce disc usage, it may also increase the speed of loading facts into relations, since less data needs to be retrieved from secondary storage.

Suggestion A: support XY.facts as either being clear-text or gzip-compressed input

Suggestion B: pre-processed facts; we provide an extra tool converting facts into a binary format that can be directly loaded as a relation;

Suggestion A would requires less effort and it reuses common tools while suggestion B may lead to higher performance, in particular when the fact files contain unordered data, causing lots of cache misses during load operations. However, they are not mutual exclusive. Both may be added independently.

Re-design loading and storing of relations

Synopsis: extend mechanism to load and store relations

Relations have qualifiers such as input and output to load relations from tab-separated files and store relations to tab-separated files. Example,

.decl A(x:number, y:number) input
.decl B(x:number) output

Provide an input/output statement that specifies a set of relations to be loaded from files and stored to files, respectively.

Stable release

I think now is the time to think about an actual release of Soufflé. Without a stable release it would be difficult to get into various repositories we have talked about. I suggest that we create a protected branch for the first release and start working on making it bullet-proof: no new functionality or features, just bugfixes and/or maybe some additional tests. If some new functionality comes in, then it can go to the master branch. Of course we need to discuss which features should actually go into the release. Any suggestions?

Broken Build Infrastructure

After merging the last commits of this afternoon (at 15:00 it was still working) the build system is broken on my system. When running ../configure I get the message:

checking for boostlib >= 1.46... yes
checking whether the Boost::System library is available... yes
checking for exit in -lboost_system... no
checking for exit in -lboost_system... (cached) no
configure: error: Could not link against boost_system !

What should I do? The boost system lib is installed on my system and is also tryed to ste the LD_LIBRARY_PATH environment variable. No change.

Interactive query processing

While working with souffle on a big data set, unknowing of its internal structure, it turned out to be quite annoying that it takes very long to load input data, just to run a single query, which will then be modified and processed again.

Dismissing the obvious technical issues, would it be desirable to provide an interactive shell for submitting queries? How should it look like and/or would it have value for a wider range of important use cases?

Boolean type and operators

It would be great to add boolean values in addition to numbers, symbols and strings to the set of basic types.

Currently, when modeling booleans, the corresponding operators have to be modeled via extra relations. This extra overhead and memory access is slow.

What would be needed:

  • new bool primitive type
  • true and false constants
  • and, or, xor, negation, equality, and inequality operations

Options A: full integration

Options B: map it to numbers and encode it as follows:

  • true = 1
  • false = 0
  • neg = 1-x
  • and = x * y
  • or = neg(and(neg(x),neg(y)))
  • equality and inequality remains the same

Option B does not actually require any change since it can be realized by macros in a DL file. For our current work we will need this feature -- the question is whether we should go for the macros or think about a full integration.

Race condition using records

The new test case tests/evaluation/nqueens causes a race condition for a problem size that is equal to 6 or above. The test case uses records for representing the state of the board game.

Parameter Handling of Option -j

When compiled with -o -j<nr> the number of cores <nr> is ignored in the executable and the default number of processors is used instead.

Assertion failure--evaluation related?

I'm getting an assertion failure in the latest version:

souffle: RamTranslator.cpp:146: void souffle::{anonymous}::ValueIndex::addVarReference(const souffle::AstVariable&, const souffle::{anonymous}::Location&): Assertion `(locs.empty() || locs.back() < l) && "Unordered location insertion!"' failed.
Aborted (core dumped)

I have no clue what may be the cause and it will take me hours to construct a redacted example of the code that does the same computation. But here is the single-rule addition that precipitates the failure:

// If some access path is to be rebased, its prefixes are as well.
AccessPathShouldBeRebased( [ ?from, ?rest ], ?from, ?to) :-
AccessPathShouldBeRebased(?ap, ?from, ?to),
?ap = [ ?from, ?suffix ],
?suffix = [ ?rest, ?lastFld ].

Without this, the program performs meaningful computations so I have high confidence. The relation AccessPathShouldBeRebased is not populated anywhere else (though I tried that as well, and the failure persists) nor is it consumed anywhere at this point.

Function Predicates

Synopsis: extend Soufflé with function predicates to detect functional dependency violations for debugging purposes.

A set of function predicates can be defined for a relation. They are enforced at tuple-creation time, i.e., when a new tuple for the relation is generated, the function predicates of a relation form an assertion that is checking for duplicate key entries.

Example:

// define set of weighted edges of a tree
.decl Edge(parent:Node,child:Node, id:Label) 

// The following function predicates ensure that for each child
// there exists at most one parent node, and each edge has a 
// unique label.  
.function Edge(child) -> [Edge(parent), Edge(id)]
.function Edge(id) -> [Edge(parent), Edge(child)]

The implementation for this extension will contain two parts:

  • Extend the AST with function predicates, and provide necessary semantic checks
  • Extend the RAM machine and the code generation with the notion of duplicate checks for relations

Note that the proposed mechanism above is only concerned about functional properties of a single relation. No complex query that involves various other relations can be expressed with the aforementioned construct.

MDM5 Computations for Generated Packages

  • Compute md5 sums for packages after building them and upload them with the released packages.
  • Change the Download page of souffle-lang.github.io so that we can download the releases and the latest packages (making the md5 sums visible).

Efficient Record Access

Example:

A(x) :- B(x), C([x, _]).

A full scan is performed for table C, i.e., all tuples of C are scrutinised to check whether the first element of the record is x. To overcome this performance bottleneck, construct a helper relation that relates a record value with the record id. E.g., we can rewrite the example to

  C_helper(x) :- C([x,_]), B(x).
  A(x) :- B(x), C_helper(x).

Add datatypes for network addresses

We need some datatypes and associated operations for network addresses. I anticipate we will need
some bit-vector logic here or compile it into some underlying data structure e.g., tries.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.