A Specialized B-Tree for Concurrent Datalog Evaluation

This is the A Specialized B-Tree for Concurrent Datalog Evaluation. It was published in the Symposium on Principles and Practice of Parallel Programming, February 16–20, 2019, Washington, DC, USA (PPoPP’19).


Modern Datalog engines are employed in industrial applications such as graph-databases, networks, and static program analysis. To cope with vast amount of data, Datalog engines must employ parallel execution strategies, for which specialized concurrent data structures are of paramount importance.

In this paper, we introduce a specialized B-tree data structure for an open-source Datalog compiler written in C++. Our data structure has been specialized for Datalog workloads running on shared-memory multi-core computers. It features (1) an optimistic locking protocol for scalability, (2) is highly tuned, and (3) uses the notion of ``hints’’ to re-use the results of previously performed tree traversals to exploit data ordering properties exhibited by Datalog evaluation. In parallel micro-benchmarks, the new data structure achieves up to 59x higher performance than state-of-the-art industrial standards, while integrated into a Datalog engine it accounts for 3x higher, overall system performance.


Brie: A Specialized Trie for Concurrent Datalog

This is the Brie: A Specialized Trie for Concurrent Datalog. It was published in the 10th International Workshop on Programming Models and Applications for Multicores and Manycores workshop (PMAM’19).


Modern Datalog engines are employed in industrial applications such as graph databases, networks, and static program analysis. To cope with the vast amount of data in these applications, Datalog engines must employ specialized parallel data structures. In this paper, we introduce the Brie, a specialized data structure for high-density relations storing large data volumes. It effectively compresses dense data in a lock-free fashion and obtains up to 15× higher performance in parallel insertion benchmarks compared to state-of-the-art alternatives. Furthermore, when integrated into a Datalog engine running an industrial points-to analysis, runtime improves by a factor of 4× with a compression ratio of up to 3.6× are obtained.


Soufflé 1.5.1 Released

Hello! We have a new release of Soufflé for you. Key features are

  1. User defined functors (b-scholz)
  2. Rewritten code generation (taipan-snake)
  3. Improved Provenance via generated data structures (taipan-snake)
  4. Profile cpu & memory usage (mmcgr)
  5. Enhanced profiler graphs (mmcgr)
  6. Productivity measures in profiler (mmcgr)
  7. General profiler enhancements (mmcgr)
  8. Added support for non-x86 platforms (mmcgr)
  9. Improve compilation speed (mmcgr)
  10. Bash autocompletion (mmcgr)
  11. Extended verbose mode (azreika/mmcgr)
  12. Reorder atoms to optimise evaluation speed (azreika)
  13. Profile-guided atom reordering (azreika)
  14. Various AST optimisations (azreika)

Equivalence Relations in Soufflé

Patrick Nappa has implemented a fast & low overhead method of representing equivalence relations in Soufflé. An Honours thesis has been submitted, available here, where corresponding slides can be found here.

The improved version is currently undergoing maintenance pending inclusion into the main branch of Soufflé, however an older (albeit slightly slower) implementation currently exists in Soufflé. The following demo demonstrates the special relation eqrel_fast which is equivalent to the eqrel_slow relation, yet the former carries an upper bounded quadratic speedup & memory improvement over the latter.

.decl eqrel_fast(x : number, y : number) eqrel
eqrel_fast(a,b) :- rel1(a), rel2(b).

.decl eqrel_slow(x : number, y : number)
eqrel_slow(a,b) :- rel1(a), rel2(b).
eqrel_slow(a,a) :- rel1(a).             // reflexivity
eqrel_slow(b,a) :- eqrel_slow(a,b).     // symmetry
eqrel_slow(a,c) :- eqrel_slow(a,b), eqrel_slow(b, c).   // transitivity

Soufflé 1.4.0 Released

Hello! We have a new release of Soufflé for you. Key features are

  1. improved parallel performance (HerbertJordan)
  2. improved operators hints in btree (HerbertJordan)
  3. extended progress logging in verbose mode (mmcgr)
  4. added to_string and to_number functors (b-scholz)
  5. added live profiler (mmcgr)
  6. changed profile output format to json (mmcgr)
  7. profile output is determinate for easier parsing (mmcgr)
  8. profiler tracks memory and cpu usage during execution (mmcgr)
  9. profiler tracks load/store times (mmcgr)
  10. multiple input directives may be declared in one step (mmcgr)
  11. simplify scripting by handling missing files (mmcgr)
  12. MPI engine (lyndonhenry)

Soufflé 1.3.1 Released

Hello! We have a new release of Soufflé for you. Key features are

  1. Added more feedback in verbose mode (azreika/mmcgr)
  2. Fixed and enhanced 64 bit domain support (mmcgr/cfallin)
  3. Improved debug report (azreika)
  4. Enhanced profiler (atom frequency, bug fixes) (mmcgr)
  5. Hashmap support (HerbertJordan,65linesofcode)
  6. Enhanced provenance tools (taipan-snake)
  7. Performance enhancements (azreika)
  8. Fix parallel profiler logging (mmcgr)

High performance datastructures in Soufflé

Jonathan Chung has extended Soufflé to utilize various data-structures. Souffle supports different default-structures including hash-table, red-black trees (STL), HTM for B-Trees, etc. Jonathan present his finding on data-structures in Souffle at the University of Sydney’s Programming Languages and Compiler research group meeting. Slides for the talk can be found here.

Inlining in Soufflé

Soufflé now supports the ability to manually select relations to inline. A talk was presented on the process at the University of Sydney’s Programming Languages and Compiler research group meeting. Slides for the talk can be found here.

Provenance in Soufflé

David Zhao has extended Soufflé to allow provenance tracking. A produced tuple can be explained via a proof tree. An Honours thesis has been submitted on this topic, available here. Slides for the corresponding talk can be found here

Soufflé 1.2.0 Released

Hello! We have the third open-source release of Soufflé for you. Key features are

  1. Interactive provenance browser (taipan-snake)
  2. Compilation to subprograms for separate execution (lyndonhenry)
  3. Magic Sets (azreika)
  4. Sum aggregator fixed
  5. gcc7 optimisations
  6. File IO optimisations
  7. Pragmas to enable argument options (e.g., .pragma "-mtest")
  8. Various minor bug fixes and usability enhancements

PLDI Tutorial

We ran our second tutorial Engineering Static Analyzers with Soufflé on the 23rd of June, 2017, at PLDI’17. We had approximately 25 participants. The tutorial was given by Bernhard Scholz, Pavle Subotic, and Alexander Jordan. You find some of the slides here.

An efficient tunable selective points-to analysis for large codebases

This is the paper that uses Souffle. It was published in SOAP@PLDI’17.


Points-to analysis is a fundamental static program analysis technique for tools including compilers and bug-checkers. Although object-based context sensitivity is known to improve precision of points-to analysis, scaling it for large Java codebases remains a challenge.

In this work, we develop a tunable, client-independent, object-sensitive points-to analysis framework where heap cloning is applied selectively. This approach is aimed at large codebases where standard analysis is typically expensive. Our design includes a pre-analysis that determines program points that contribute to the cost of an object-sensitive points-to analysis. A subsequent analysis then determines the context depth for each allocation site. While our framework can run standalone, it is also possible to tune it – the user of the framework can use the knowledge of the codebase being analysed to influence the selection of expensive program points as well as the process to differentiate the required context-depth. Overall, the approach determines where the cloning is beneficial and where the cloning is unlikely to be beneficial.

We have implemented our approach using Souffle (a Datalog compiler) and an extension of the DOOP framework. Our experiments on large programs, including OpenJDK, show that our technique is efficient and precise. For the OpenJDK, our analysis reduces 27% of runtime and 18% of memory usage in comparison with 2O1H points-to analysis for a negligible loss of precision, while for Jython from the DaCapo benchmark suite, the same analysis reduces 91% of runtime for no loss of precision.

Soufflé 1.1.0 Released

Hello! We have the second open-source release of Soufflé for you. Key features are

  1. Configurable I/O System with more options and support for new language extensions (mmcgr).
  2. Support of equivalence relation using union/find data-structures (pnappa) New profiling tool rewritten in C++ with HTML/Javascript output (DominicRomanowski).
  3. Replacing the Boost C-prepocessor wave by mcpp (pnappa)
  4. Adding ternary functors (b-scholz)
  5. JNI interface (psubotic)
  6. Memory optimisations (lyndonhenry)
  7. Numerous bug fixes.

First conference tutorial

We ran our first tutorial on the 5th of December, 2016, at APSEC’16. We had 11 participants. The tutorial was run by Bernhard Scholz, Raghavendra K. R., and Alexander Jordan. You find some of the slides here.

Apsec'16 Tutorial

Soufflé: On Synthesis of Program Analyzers

This is the tool paper about Souffle. It was published in CAV’16.


Soufflé is an open source programming framework that performs static program analysis expressed in Datalog on very large code bases, including points-to analysis on OpenJDK7 (1.4M program variables, 350K objects, 160K methods) in under a minute. Soufflé is being successfully used for Java security analyses at Oracle Labs due to (1) its high-performance, (2) support for rapid program analysis development, and (3) customizability. Soufflé incorporates the highly flexible Datalog-based program analysis paradigm while exhibiting performance results that are on-par with manually developed state-of-the-art tools. In this tool paper, we introduce the Soufflé architecture, usage and demonstrate its applicability for large-scale code analysis on the OpenJDK7 library as a use case.

Soufflé 1.0.0 Released

Hello! We have the first open-source release of Soufflé for you. Key features are

  1. Continuous Integration / Travis support for Souffle

  2. Automatic packaging for Debian and MAC OS X platform.

  3. Multiple Header clauses, and disjunctions in bodies of clauses.

  4. BOOST’s C-preprocessor called wave adapted so that MAC OS X port is functional without a GCC installation.

  5. Nullary relations (i.e., relations with no attributes become attributes).

  6. Liberal identifiers in Souffle programs, e.g., A(?x,?y) :- B(?y,?x).

  7. Enable type declarations in Souffle’s components.

  8. Added bitwise and logical functors, and binary, and hexadecimal constants.

  9. Configuration files for Doxygen documentation.

  10. Numerous bug fixes in all parts of the system.

Soufflé is open-source!

Oracle Labs made Soufflé open-source on the 8th of March, 2016. Soufflé is a Datalog-like language that compiles programs to efficient parallel C++ code. Currently, Soufflé is used as a domain-specific language for static program analysis, over large code bases with millions of lines of code.

Features of Soufflé

  • Efficient translation to parallel C++ of Datalog programs
  • Extended semantics of Safe Datalog, e.g., permitting unbounded recursions with numbers
  • Simple component model for Datalog specifications
  • Recursively defined record types for tuples
  • Static type system for attributes

On fast large-scale program analysis in Datalog

This is the system paper about the architecture of Souffle. It was published in CC’16.


Designing and crafting a static program analysis is challenging due to the complexity of the task at hand. Among the challenges are modelling the semantics of the input language, finding suitable abstractions for the analysis, and handwriting efficient code for the analysis in a traditional imperative language such as C++. Hence, the development of static program analysis tools is costly in terms of development time and resources for real world languages. To overcome, or at least alleviate the costs of developing a static program analysis, Datalog has been proposed as a domain specific language (DSL). With Datalog, a designer expresses a static program analysis in the form of a logical specification. While a domain specific language approach aids in the ease of development of program analyses, it is commonly accepted that such an approach has worse runtime performance than handcrafted static analysis tools. In this work, we introduce a new program synthesis methodology for Datalog specifications to produce highly efficient monolithic C++ analyzers. The synthesis technique requires the re-interpretation of the semi-naïve evaluation as a scaffolding for translation using partial evaluation. To achieve high-performance, we employ staged- compilation techniques and specialize the underlying relational data structures for a given Datalog specification. Experimentation on benchmarks for large-scale program analysis validates the superior performance of our approach over available Datalog tools and demonstrates our competitiveness with state-of-the-art handcrafted tools.