## Introduction to Datalog

### Overview

Datalog is a (declarative) logic-based query language, allowing the user to perform recursive queries. It adopts syntax in the style of Prolog. In its pure form, it is based on a decidable fragment of first-order logic (FIL). Here, the universe – the colleciton of elements by which computation can be performed within – is finite, and functors are not permitted. Applications of Datalog include program analysis, security, graph databases, and declarative networking.

## Soufflé: The Language

### Motivation

The syntax of Soufflé is inspired by implementations of Datalog, namely bddbddb and muZ in Z3.
There is no unified standard for the specification of Datalog syntax. Thus, each implementation of Datalog may differ.
A principle goal of the Soufflé project is speed, tailoring program execution to multi-core servers with large amounts of memory.
With this in mind, Soufflé provides software engineering features (components, for example) for large-scale logic-oriented programming.
For practical usage, Soufflé extends Datalog to make it Turing-equivalent through arithmetic functors.
This results in the ability of the programmer to write programs that may never terminate.
An example of non-termination is a program where the fact `A(0).`

and rule `A(i + 1) :- A(i).`

exist without additional constraints. This causes Soufflé to attempt to output an infinite number of relations `A(n)`

where `n >= 0`

.
This is in some way analogous to an infinite while loop in an imperative programming language like C.
However, the increased expressiveness afforded by arithmetic functors is very convenient for programming.

### First example

Soufflé supports UNIX, FreeBSD, Mac OS X and Windows, and can be installed as per these instructions.

Soufflé is invoked by command line in the format `souffle <flags> <program.dl>`

, where `<program.dl>`

is the input program that is to be evaluated.
The input fact directory is set with flag `-F <dir>`

, specifying the input directory for relations. The default directory is the current directory.
The output directory for relations is set with flag `-D <dir>`

, and also defaults to the current directory. Note that if `<dir>`

is set to `-`

, the output will be written to stdout.

#### Transitive closure

A relation R on a set X is transitive if, for all x, y, z in X, whenever x R y and y R z then x R z. In the example below, we consider a directed graph, where edges define relations, and a tuple is in the transitive closure (the `reachable`

relation) if it satisfies either of the two rules below.

```
.decl edge(n: symbol, m: symbol)
edge("a", "b"). /* facts of edge */
edge("b", "c").
edge("c", "b").
edge("c", "d").
.decl reachable (n: symbol, m: symbol)
.output reachable // output relation reachable
reachable(x, y):- edge(x, y). // base rule
reachable(x, z):- edge(x, y), reachable(y, z). // inductive rule
```

Indeed, all elements in `edge`

are in `reachable`

(by the base rule), and the inductive rule captures the transitive property, including tuples like `reachable("a", "d")`

.

- Extend the previous code to add a new relation SCC(x, y), that is defined as: if node x reaches node y and node y reaches node x, then (x, y) is in SCC.
- Check whether a node is cyclic.
- Check whether the graph is acyclic.
- Omit the flag `-D-` - where is the output?

- TODO
- TODO
- TODO
- TODO

#### Same generation example

Given a tree (a directed acyclic graph with a particular root node), the goal is to find which nodes are on the same level.

In the above graph, nodes `b`

and `c`

are on the same level, as are nodes `e`

and `g`

.

```
.decl Parent(n: symbol, m: symbol)
Parent("d", "b"). Parent("e", "b"). Parent("f","c").
Parent("g", "c"). Parent("b", "a"). Parent("c","a").
.decl Person(n: symbol)
Person(x) :- Parent(x, _).
Person(x) :- Parent(_, x).
.decl SameGeneration (n: symbol, m: symbol)
SameGeneration(x, x):- Person(x).
SameGeneration(x, y):- Parent(x,p), SameGeneration(p,q), Parent(y,q).
.output SameGeneration
```

#### Data-flow analysis example

Data-flow analysis (DFA) aims to determine *static* properties of programs.
DFA is a unified theory, providing information for global analysis about a program.
DFA is based on control flow graphs (CFGs), where analysis of the program is derived from properties of nodes and the graph.

An example is a reaching definition - that is, whether a definition of a variable reaches a given point in the program. Note that an assignment of a variable can directly affect the value at another point in the program.
To cope, we consider an unambiguous definition `d`

of a variable `v`

as defined as follows:

```
d: v = <expression>;
```

Then, a definition `d`

of a variable `v`

is considered to reach a statement `u`

if all paths from `d`

to `u`

do not contain any unambiguous definition of `v`

. We note that functions can have side effects to variables even in the absence of explicit unambiguous definitions.

Consider the following control flow graph:

The above example contains unambiguous definitions d_{1} and d_{2} of the variable v. By inspection, we see that B3 can only be reached when v is defined as per d_{2}. The following code outputs all stages of the CFG where v is possible captured by one of these definitions.

```
// define control flow graph
// via the Edge relation
.decl Edge(n: symbol, m: symbol)
Edge("start", "b1").
Edge("b1", "b2").
Edge("b1", "b3").
Edge("b2", "b4").
Edge("b3", "b4").
Edge("b4", "b1").
Edge("b4", "end").
// Generating Definitions
.decl GenDef(n: symbol, d:symbol)
GenDef("b2", "d1").
GenDef("b4", "d2").
// Killing Definitions
.decl KillDef(n: symbol, d:symbol)
KillDef("b4", "d1").
KillDef("b2", "d2").
// Reachable
.decl Reachable(n: symbol, d:symbol)
Reachable(u,d) :- GenDef(u,d).
Reachable(v,d) :- Edge(u,v), Reachable(u,d), !KillDef(u,d).
.output Reachable
```

#### Remarks on Input and C-Preprocessor.

Like in C++, Soufflé uses two types of comments:

- Type 1:
`// this is a remark`

- Type 2:
`/* this is a remark as well */`

C preprocessor processes input from Soufflé. For example:

```
#include "myprog.dl"
#define MYPLUS(a,b) (a+b)
```

### Relations

#### Relation declaration

A relation must be declared in order to be able to be used. We briefly note that the order in which lines of code are written in Soufflé code does not impact on a program’s correctness. The statement:

```
.decl edge(a:symbol, b:symbol)
```

Declares the relation `edge`

with two fields, `a`

and `b`

, each of which are symbols (strings).

#### I/O Directives

Users can specify directives for loading input and writing output to a file:

- The input directive
`.input <relation-name>`

reads from the`<relation-name>.facts`

, which is assumed to be tab-separated by default. - The output directive
`.output <relation-name>`

writes to a file, usually`<relation-name>.csv`

(by default) or stdout (depending on options). - The print relation size directive
`.printsize <relation-name>`

prints the cardinality of a given relation to stdout.

The following example illustrates the aforementioned functionality:

```
.decl A(n: symbol)
.input A // facts are read from file A.facts
.decl B(n: symbol)
B(n) :- A(n).
.decl C(n: symbol)
.output C // output appears in C.csv
C(n) :- B(n).
.decl D(n: symbol)
.printsize D // the number of facts in D is printed
D(n) :- C(n).
```

Relations can be loaded from/stored to:

- Arbitrary CSV files
- Compressed text files
- SQLite3 databases

For example, to store a relation after evaluation into a SQLite3 database, the user can specify something like this:

```
.decl A(a:number, b:number)
.output A(IO=sqlite, dbname="path/to/sqlite3db")
```

More information about the possibilites for I/O can be found in the documentation here.

#### Lack of goals in Soufflé

A goal in Datalog is a logical relation of the form `false <= p`

, where `p`

is a logical relation.
In the case of Soufflé, goals are simulated by output directives.
The advantage is that several independent goals can be evaluated in a single execution of a Soufflé program.
TODO: provenance and query processor and interactive processing

#### Syntactic conveniences:

Rules with multiple heads can be written. This is syntactic sugar to minimise coding effort. Here is a code snippet taking advantage of this feature and the equivalent code without the transformation applied:

Multiple heads | Single head |
---|---|

Similarly, disjunctions in rule bodies are permitted as syntactic sugar, such as in the following example:

Disjunction in rule bodies | No disjunction in rule bodies |
---|---|

### Type system for attributes

Soufflé’s type system is static, like languages like C and unlike languages like Python. The attributes of a relation must be defined before compilation (or interpretation), and types are checked at compile-time. This supports programmers having a clear idea of the definition of relations and their usage. To minimise evaluation time, no dynamic checks are made at runtime.

Types correspond to the subdivision of elements (relations) and sets of elements over the universe. In particular, a type refers to either a subset of a universe or the universe itself. Elements of subsets are not defined explicitly. As we shall see with union types, subsets can be composed of other subsets.

#### Primitive types

Soufflé has two primitive types, namely the symbol type `symbol`

and the number type `number`

.

The symbol type comprises the universe of all strings.
Internally, it is represented by an ordinal number.
The value `ord("hello")`

corresponds to this ordinal number for a given program, in this case for the string “hello”.

The number type comprises the universe of all numbers.

### Arithmetic expressions

Among others functors, Soufflé permits arithmetic functors, which extend traditional Datalog semantics.
Variables in functors must be grounded for use. That is, they may not contain any free variables.
As described earlier, termination may be a problem as in the case of an imperative `while`

loop.

A basic example is as follows:

```
.decl A(n: number)
.output A
A(1).
A(x+1) :- A(x), x < 9.
```

In particular, the second condition in the conjunction on the last line invokes the arithmetic operator <.

The following arithmetic functors are allowed in Soufflé:

- Addition:
`x + y`

- Subtraction:
`x - y`

- Division:
`x / y`

- Multiplication:
`x * y`

- Modulo:
`a % b`

- Power:
`a ^ b`

- Counter:
`$`

- Bit operations:
`x band y`

,`x bor y`

,`x bxor y`

, and`bnot x`

- Logical operations:
`x land y`

,`x lor y`

, and`lnot x`

The following arithmetic constraints are allowed in Soufflé:

- Less than:
`a < b`

- Less than or equal to:
`a <= b`

- Equal to:
`a = b`

- Not equal to:
`a != b`

- Greater than or equal to:
`a >= b`

- Greater than:
`a > b`

In source code, numbers can be written in decimal, binary and hexadecimal. Below is an example illustrating such:

```
.decl A(x:number)
A(4711).
A(0b101).
A(0xaffe).
```

Note that in fact files only decimal numbers are permitted.

#### Number encoding

Numbers can be used as logical values, as in C:

- 0 represents false
- != 0 represents true

As such, they can be used for logical operations: `x land y`

, `x lor y`

and `lnot x`

An example is as follows:

```
.decl A(x:number)
.output A
A(0 lor 1).
```

#### The functor $

The functor `$`

issues a new number every time it is evaluated.
However, it is not permitted in recursive relations. It can be used to create unique numbers (acting as identifiers) for symbols, such as in the following example:

```
.decl A(x: symbol)
A(“a”). A(“b”). A(“c”). A(“d”).
.decl B(x: symbol, y: number)
.output B
B(x, $) :- A(x).
```

#### String functors

The usage and characterisation of the string (symbol) functors supported by Soufflé can be found here.

### Aggregation

Aggregation in Soufflé refers to the use of particular functors to summarise information about queries.
Aggregates can only be applied on stable relations in Soufflé.
Types of aggregates include counting, finding the minimum/maximum of a set of numbers, and summation.
Generally in Soufflé, information cannot flow from the sub-goal (the argument/s of the aggregate functor) to the outer scope.
For example, if one wishes to find the minimum value of the relation `Cost(x)`

, they cannot find a specific value of `x`

that minimises `Cost`

.
Indeed, such a value of `x`

may not be unique.

#### Counting

The counting functor allows the user to count the set size of a sub-goal.
The syntax is `count{<sub-goal>}`

.
The following example outputs the number of “blue” cars - that is, the number of elements in `Car`

with second argument “blue”:

```
.decl Car(name: symbol, colour:symbol)
Car(“Audi”, ”blue”).
Car(“VW”, “red”).
Car(“BMW”, “blue”).
.decl BlueCarCount(x: number)
BlueCarCount(c) :- c = count:{Car(_,”blue”)}.
.output BlueCarCount
```

#### Maximum

The `max`

functor outputs the maximum value of a set.
The syntax is `max <var>:{<sub-goal(<var>)>}`

.
An example:

```
.decl A(n:number)
A(1). A(10). A(100).
.decl MaxA(x: number)
MaxA(y) :- y = max x:{A(x)}.
.output MaxA
```

#### Minimum & Sum

Similarly to the above, these functors compute the minimum/sum of a sub-goal.
The `min`

syntax is `min <var>:{<sub-goal(<var>)>}`

, and the `sum`

syntax is `sum <var>:{<sub-goal>(<var>)>}`

.

#### Witnesses are not permitted

Consider the following example which will not compile:

```
.decl A(n:number, w:symbol)
A(1, ”a”). A(10, ”b”). A(100, ”c”).
.decl MaxA(x: number,w:symbol)
MaxA(y, w) :- y = max x:{A(x, w)}.
```

Here, the user wishes to produce a witness for the maximum value of the first argument of `A`

.
This causes semantic and performance issues.
Semantically, it is not clear what it means to find a witness for the `count`

and `sum`

functors.
In terms of performance, there is overhead.
As such, this kind of code is forbidden by the type-checker.

### Records

Relations are two-dimensional structures in Datalog. With a large code-base and/or a complex problem, it can be convenient to consider relations with more complex structure (recursion/hierarchy, etc.). Records provide such an abstraction, breaking out of the flat world of Datalog at the price of performance, due to an additional table lookup required when invoking records. Their semantics are comparable to those in Pascal/C. In future, unions of records will be allowed to simulate polymorphism. The syntax of a Record Type definition is as follows:

```
.type <name> = [ <name_1>: <type_1>, ..., <name_k>: <type_k> ]
```

Currently, there is no output facility, but it is planned that Soufflé 2.0 will support this.
In the meantime, this can be simulated by mapping records to relations, where `.output`

can then be called.
The following example creates a record corresponding to a pair of numbers, in which the `Flatten`

relation can be then be printed.

```
// Pair of numbers
.type Pair = [a:number, b:number]
.decl A(p: Pair) // declare a set of pairs
A([1,2]).
A([3,4]).
A([4,5]).
.decl Flatten(a:number, b:number) output
Flatten(a,b) :- A([a,b]).
```

#### Overview of record internals

Each record type has a hidden type relation. There, elements of a record are translated to a number. During evaluation, if a record does not exist, it is created on the fly. Considering the example as above:

```
.type Pair = [a: number, b: number]
.decl A(p: Pair)
A([1,2]).
A([3,4]).
A([4,5]).
```

Internally, these records are represented and stored as per this diagram:

#### Recursive records

Recursively-defined records are permitted in Soufflé.
The recursion is terminated by the existence of a `nil`

record.
Consider the following:

```
.type IntList = [next: IntList, x: number]
.decl L(l: IntList)
L([nil,10]).
L([r1,x+10]) :- L(r1), r1=[r2,x], x < 30.
.decl Flatten(x: number)
Flatten(x) :- L([_,x]).
.output Flatten
```

Here, an `IntList`

contains a reference to the next element, which is an `IntList`

itself.
Internally, this is represented as follows:

As before, the `Flatten`

relation allows for output.

The semantics of recursive records are tricky.
Records are relations and sets of recursive elements, which monotonically grow in size over time.
They are equivalent to relations with the use of the `nil`

record.
In the future, polymorphism may be possible at the expense of execution time and space.

### Components

Components are defined and showcased in this article.

An interesting exercise for the reader is to develop a library of components for graphs. The library should be extensible and contain various items of functionality, including:

- A component for a directed graph
- A component for an undirected graph
- A component that checks for a cycle in a graph
- A component that checks whether a graph is acyclic.

### Performance and profiling facilities

To gain performance in the execution of Datalog progrsm, we can:

- Compile, rather than interpret code
- Schedule queries, through user annotations or automation
- Find faster queries that perform the same desired task
- Find faster data models

To achieve these aims, profiling a given program is paramount. An overview of Soufflé’s profiler, which provides a textual and graphical UI, can be found here.

To compile and immediately execute a Soufflé program, the option `-c`

can be used, e.g. `souffle -c test.dl`

executes the binary produced from the `.cpp`

file produced by `test.dl`

. To just generate the executable, the option `-o`

can be used.

To achieve high performance, the programmer can manually re-order the atoms in the body of a rule. The auto-scheduler can be disabled for a rule by the strict qualifier, with syntax `<rule>. .strict`

. One can also provide their own query schedule, which for a given rule has syntax `<rule>. .plan{ <#version> : (idx_1, ..., idx_k) }`

.

An exercise for the reader is to execute the following code, varying the choice of the last three lines each time, and benchmarking using the profiler:

```
.decl Edge(x:number, y:number)
Edge(1,2).
Edge(500,1).
Edge(i+1,i+2) :- Edge(i,i+1), i < 499.
.decl Path(x:number, y:number)
.printsize Path
Path(x,y) :- Edge(x,y).
// Path(x,z) :- Path(x,y), Path(y,z). .strict
// Path(x,z) :- Path(x,y), Edge(y,z). .strict
// Path(x,z) :- Edge(x,y), Path(y,z). .strict
```