Getting started with Datalog & Rust for program analysis

Prolog, Datalog and somewhat related CodeQL aka Semmle have been gaining attention in the program analysis / bug hunting / variant detection community. While those “logical” languages are really weird if one has never used them before, they are excellent at implementing various program analysis tasks.

This tutorial will give you an introduction to Datalog and get you started with a setup to experiment with Differentiable Datalog (DDlog) & Rust. While this post only discusses some snippets of the code we use, the whole code is available at github, and can be used for experiments.

As an example, we define our own little Intermediate Language (IL) and implement a simple backward symbolic execution engine on top of it.

Introduction to Datalog

Datalog is a very high-level programming language. However, Datalog is more like SQL than Python. Similar to SQL, Datalog operates on a database. However, in contrast to SQL, Datalog’s database stores and infers “facts”. The main use case for Datalog is to take a database of such “facts” and infer additional interesting facts via given rules. Datalog and related programming languages are commonly called “logic programming”. Their main idea is to iteratively prove new facts from the current knowledge base. To this end, Datalog features a powerful query language to obtain information on the current state of the database. Datalog programs consist of rules that state “The new fact A is true if B, C and D are already known to be true”, notated as A :- B,C,D . The :- reads as “holds if”. We can use arbitrary queries in place of A,B and C. The Datalog engine will then iteratively apply all rules to obtain a full set of facts.

While somewhat unintuitive at first, this approach maps very nicely to many problems encountered during program analysis. It offers two big advantages over other approaches: First of all, we have a powerful query language. Secondly, fixed point iterations are built into the language itself.

For example, we could imagine a database containing information on dataflow between various variables in the program and a rule: Warning("dangerous use of malloc", path) :- FunctionCall("input",[], input), FunctionCall("malloc",[size],_), DataFlow(input, size, path).

This rule reads: Add a Warning with text “dangerous use of malloc” and the set of lines path IF there is a call to the function “input” without arguments that returns a variable input AND there is a call to the function “malloc” that has some first argument size and an irrelevant return value AND there is a dataflow between the return value input and the argument size along the lines path.

Here we assume that each function call in the program is stored in a fact FunctionCall(name, arguments, return_value), and that there is a factDataFlow(src, dst, path) for each possible data flow in the program. Using this rule we would obtain a list of all paths that connect the return value of calls to the function “input” to the first argument of “malloc”.

While this example has many problems (in particular around scalability), it serves to illustrate how easily we can represent, query and recombine analysis results in Datalog.

To illustrate the advantage of easily expressing fixed point iterations, lets consider the problem of disassembly: To properly disassemble a binary, we need to resolve indirect branches. To resolve indirect branches, we need to know about possible values. To know about possible values, we need to know what pointers point to and which pointers are aliasing and of course the disassembly itself. Consequently, we have multiple algorithms all requiring the results of the other algorithms. This can be very neatly expressed as a fixed point iteration: “rerun each algorithm on all results, until no algorithm produces new information”. And each individual algorithm is also easily expressed as a fixed point iteration. For example recursive disassembly can be neatly expressed as: For a given set of entry points, disassemble to the next branch instruction, resolve the targets and add them to the set of entry points until no new entry points are added.

Datalog will repeatedly apply all available rules until no new facts are derived. This allows us to express such fixed point iterations as naturally as we would express a simple for loop in C. It should be noted that storing all facts can sometimes cause a significant amount memory consumption, and pure Datalog has limited abilities to discard intermediate values.

Getting started with DDlog

DDlog is a pragmatic, high performance Datalog implementation that compiles Datalog programs to a Rust crate that one can use to perform inference on a set of facts provided by other Rust code. Similarly, we can easily call into Rust code to use functions that are not readily expressed in Datalog or would have unwanted performance implications. Lastly, as the name implies, DDlog allows to update the database incrementally at runtime. We can add new facts, and DDlog only recomputes the updates based on the changes. If we need to add apply small changes (e.g. add user provided information), this drastically increases the performance.

To get started with DDlog, we first download release v0.28.0 from their GitHub and unpack it somewhere. Then, we create a project folder and set the following environment variables to point to the installation folder:

export PATH=$PATH:../../source/ddlog/bin
export DDLOG_HOME=../../source/ddlog

Next we create a simple DDlog file. You can use VSCode with the souffle extension for syntax highlighting. If the highlighting doesn’t show up, make sure the file is interpreted as Datalog file.

Now it’s time to build a simple Datalog program (bse.dl). For starters, we will compute the set of all reachable nodes in a graph.

// In this program we implement transitive closure on a graph
// As input, we provide an edge relation that connects nodes.
// Each edge is represented as a fact Edge(src,dst), where 
// src is the ID of the source node, and dst is the ID of the 
// target node.
input relation Edge(i: u32, j: u32)

// The program computes the whole set of pairs of nodes where
// there is a path going from i to j
output relation Reaches(i: u32, j: u32)

// Any node is able to reach its successors
Reaches(i,j) :- Edge(i,j).
// i Reaches j if there is an Edge from i to some other node that
// already reaches j
Reaches(i,j) :- Edge(i, p), Reaches(p, j).

First, we define the input and output relations (think: Tables in SQL). Input relations are supplied by the environment, output relations can be queried after the inference has finished. Note how in contrast to other Datalog engines, DDlog uses strong types, including fixed bit width integers. Then we define rules that describe how to compute reachability from the known facts about edges. While this is a common and very intuitive example, you probably wouldn’t want to use this for a real program, as the resulting set of facts grows with O(n²).

Lastly, we create a test input bse.dat:

start;
# 1 -> 2 -> 3 --> 4
#            \
#             `-> 5
insert Edge(1,2),
insert Edge(2,3),
insert Edge(3,4),
insert Edge(3,5);

commit;
dump Reaches;

First, we start a transaction, than we insert facts for four edges. Then we commit the transaction which starts the execution, and lastly we dump all facts that we inferred on reachability.

To compile the Datalog program into a Rust crate and run the test cli, we execute:

ddlog -i bse.dl &&
(cd bse_ddlog && cargo build --release) &&
./bse_ddlog/target/release/bse_cli < bse.dat

# Prints:
# Reaches{.i = 1, .j = 2}
# Reaches{.i = 1, .j = 3}
# ....
# Reaches{.i = 3, .j = 4}
# Reaches{.i = 3, .j = 5}

The first step compiles the Datalog file into a Rust library, the second step compiles the library and the last step runs the test cli with the test input.

Unfortunately, there is almost no material out there to learn the basics of DDlog. However, you can find an excellent tutorial on learning the basics of Datalog at learndatalogtoday.org. We will now continue exploring some of the more advanced topics required to perform program analysis by integrating DDLog with Rust.

Data Structures in DDLog

Pure Datalog requires you to store all data in the rules itself. On the other hand, in most modern Datalogs various types of “atomic” data are available. DDlog has a particular powerful set of data types. We already saw u32 (numeric types are available in most Datalogs), but there is also Vec<T>, Set<T> and Map<T>. Additionally, we can design our own data types, including Rust style enums (aka Algebraic Data Types). This is extremely useful to express more complex ideas. For example, we can express a set of registers as such an enum similar to C enums:

typedef Reg64T = Rax | Rbx | Rcx 

However, we are also able to express more complex structs, such as instructions in an intermediate Language:

typedef InstrT = Malloc{arg: Expr64T}
                 | Cmp{cop1: Expr64T, cop2: Expr64T}
                 | Jump{target: u32, cond: Expr1T}
                 | Set{dst: Reg64T, src: Expr64T}
                 | Store{ptr: Expr64T, a2: Expr64T}

This structure represents a (very simple) intermediate representation that we will use to build a little backwards symbolic execution engine. It contains a conditional branch (Jump), a compare instruction, a Store instruction (No load, as it won’t be needed in this example) and a general purpose Set instruction that evaluates an expression and stores it in a register.

To represent expressions, many Datalog based analysis use individual facts with IDs as references to their subexpressions. While in most Datalog implementations this is the only way to store expressions, DDLog allows us to create a datatype for expressions as well:

typedef Expr64T = Reg64{r: Reg64T} 
    | Const64{v: u32} 
    | BinOp64{ op: Op64T, a1: Ref<Expr64T>, a2: Ref<Expr64T>} 
    | Alloc{a1: Ref<Expr64T>} 

We need to use Ref<Expr64T> to embed a nested sub-expression. As all these data types compile directly to Rust types, recursive structfields without a layer of indirection are forbidden. Otherwise, recursive structures would have an infinite size.

To make the output of DDlog more readable, we can define pretty print functions for our custom types.

function to_string(a: Ref<Expr64T>) : string{
    deref(a).to_string()
}
function to_string(a: Expr64T) : string{
    match(a) {
        Reg64{r} -> r.to_string(),
        Const64{v} -> v.to_string(),
        BinOp64{op, a1, a2} -> "(${a1.to_string()}${op.to_string()}${a2.to_string()})",
        Alloc{a1} ->  "Alloc(${a1.to_string()})"
    }
}

Once we have defined most of the basic types that we are going to use, we can start building the actual analysis: First of all, we need to define our input facts. In our case we will have each intermediate representation instruction and it’s bytecode address stored in one fact. We define two additional inferred relations Fallthrough and JumpsTo that represent edges in the control flow of the program. Lastly, we add the relation that we actually care about: EventOn. It represents: “if all formulas in state_t hold, we also know something interesting is going to happen later on”. From an initial set of candidate positions (such as memory accesses), we will later trace backwards to find preconditions that guarantee that the memory access will fail.

    input relation Instr(i: u32, op: InstrT)

    output relation Fallthrough(i: u32, op: InstrT, j: u32)
    output relation JumpsTo(i: u32, op: InstrT, j: u32)
    output relation EventOn(i: u32, s: state_t)

Constructing the control flow graph from our IL is rather simple:

    Fallthrough(i, op, j) :- Instr(i, op), var j=i+1, Instr(j,_).
    JumpsTo(i, Jump{j,c}, j) :- Instr(i,Jump{j,c}), Instr(j, _).

Here we can nicely see how powerful Datalog can be to transform data in interesting ways. The relation Fallthrough generally holds if there are instructions both at offset i and i+1. The relation JumpsTo holds if the instruction at position i is a jump to position j. We also store the instruction at the given position in the CFG relation to ease look-ups later on.

Programming in DDlog

Now for the most interesting part: We will build a simple backward symbolic execution engine. It can be used to start from a candidate point of failure, and go backwards to find statements earlier in the program that already guarantee that the failure will be triggered eventually. Like normal forward symbolic execution, this approach is used to generate testcases, and can — in certain cases — avoid creating very large state formulas, as we don’t have to track everything, but only changes affecting the outcome we care about.

We start with a simple rule that states: If we access a pointer at instruction i, it has to be in bounds, otherwise and interesting event (aka bug) occurs .

    EventOn(i, State{[ref_new(Boundcheck{ref_new(dst)})]} ) :- 
        Instr(i, MovPtr{dst,_}).

Next we propagate any events upwards through Set instructions:

    EventOn(i, new) :- 
    Fallthrough(i, Set{dst, expr}, j ),   
        EventOn(j,old), 
        var new = old.repl(Subst64{ref_new(Reg64{dst}), ref_new(expr)}).

Here we state that we can prove that at instruction i an event will occur if new is satisfied if:

  • The instruction at address i is a Set instruction where j is the next instruction.

  • We know that after instruction j, we will observe an interesting event if the constraint old is satisfied,

  • and new is obtained from old by replacing all occurrences of the target register dst by the right hand side of the Set instruction (expr).

For example, lets assume, we know that something interesting will happen if rax==1337 at instruction j, and instruction i is Set{rax, rbx+1336} then we obtain the new constraints rbx+1336==1337. This constraint guarantees that, after executing the Set instruction we will eventually reach the event we care about.

Note that we use a function old.repl that performs the replacement in the expressions that we use. This function is implemented in Rust for performance reasons.

Calling Rust Code from DDlog

Now the astute reader may have noticed how we use old.repl() in various places to substitute a register by a given value. So far, we ignored how to implement this function. With common Datalog implementations, substituting sub expressions would be somewhat annoying to implement, as we do not want to store all intermediate steps in the database. However, DDlog is a rather pragmatic implementation and allows us to freely call Rust functions. The types declared in our Datalog program directly map to Rust types, making this process butter smooth. First we create a file bse.rs and implement our functions. As you can see in this example, Expr64T directly maps to a Rust enum with the same layout. Similarly, Vec<T>, Set<T> and Map<K,V> map to DDlogs types. The function maybe_substitute_expr_64 is a bit more convoluted than one might expect at first. This is not due to using DDlog data structures. Instead, it is returning None if no modification was performed, so that we can reuse the old expressions instead of creating new ones, reducing the memory required to store our symbolic states significantly.

use crate::ddlog_std::Vec;
use crate::ddlog_std::Ref;

pub fn sub_state(val: & Vec<Ref<Expr1T>>, 
                 subst: &ExprSubst) 
                 -> Vec<Ref<Expr1T>>  {
    let res = val.iter()
                .map(|v| maybe_substitute_expr_1(v, subst)
                .unwrap_or(v.clone()) ).collect::<std::vec::Vec<_>>();
    return Vec{x: res}
}

pub fn maybe_substitute_expr_64(
                                val: & Ref<Expr64T>, 
                                subst: &ExprSubst) 
                                -> Option<Ref<Expr64T>>  {
    if let ExprSubst::Subst64{pat64, repl64} = subst {
        if pat64 == val {
            return Some(repl64.clone())
        }
    }
    match &**val {
        Expr64T::Reg64{..} => None,
        Expr64T::Const64{..} => None,
        Expr64T::Alloc{a1} => maybe_substitute_expr_64(&a1, subst)
                                    .map(|a1| Ref::from( Expr64T::Alloc{a1} ) ),
        Expr64T::BinOp64{op, a1, a2} => {
            let na1 = maybe_substitute_expr_64(&a1, subst);
            let na2 = maybe_substitute_expr_64(&a2, subst);
            if na1.is_none() && na2.is_none() {
                None
            } else {
                let op = Expr64T::BinOp64{
                    op: op.clone(), 
                    a1: na1.unwrap_or(a1.clone()), 
                    a2: na2.unwrap_or(a2.clone())
                };
                Some(std_Ref::from(op))
            }
        },
    }
}

We can than declare our Rust functions in the Datalog program (bse.dl) by using the extern keyword and use it in our Datalog program.

extern function sub_state(val: Vec<Ref<Expr1T>>, sub: ExprSubst) : Vec<Ref<Expr1T>>

Calling DDlog from Rust

Now that we have a working example of a simple DDlog program analysis pass, we would like to embed our DDLog program into an existing rust project. For this purpose, we create a new library crate:

cargo init --lib bse_ddlog_lib

Then we add the following dependencies to the crates Cargo.toml

[dependencies]
differential_datalog = {path = "../bse_ddlog/bse_ddlog/differential_datalog"}
bse = {path = "../bse_ddlog/bse_ddlog/"}
types = {path = "../bse_ddlog/bse_ddlog/types"}
value = {path = "../bse_ddlog/bse_ddlog/value"}

In lib.rs, we need to import various types that were auto-generated by DDlog. Most of these types are defined by the engine itself. The second chunk of imports includes the types that are specific to our DDLog program.

// import types for the engine
use bse_ddlog::api::HDDlog; //The DDLog Database engine itself
use differential_datalog::DDlog; // A helper trait
use differential_datalog::DeltaMap; // A trait representing the changes resulting from a given update.
use differential_datalog::ddval::DDValue; // A generic DLog value type
use differential_datalog::ddval::DDValConvert; //Another helper trair
use differential_datalog::program::RelId; // Numeric relations id
use differential_datalog::program::Update; // A type representing updates to the database
use differential_datalog::record::Record; // A type representing individual facts

// import all types defined by the datalog program itself
use types::*;
use types::ddlog_std::Ref;

// import some helpers to access the database
use value::relid2name; //maps the relation id to a string 
use value::Relations; //Enum of available relations 
use value::Value; // wrapper type for the input/output relations

Using these types, we can create a struct in lib.rs that allows us to create and interact with the DDlog database engine. Note that the API is still under heavy development and contains some vestigial features such as the unused callback cb that will be removed soon.

pub struct DDLogBSE{
    hddlog: HDDlog,
}

impl DDLogBSE{
    pub fn new()  -> Result<DDLogBSE, String> {
        // callback is useless  
        fn cb(_rel: usize, _rec: &Record, _w: isize) {}
        let number_threads = 1;
        let track_complet_snapshot = false;
        let (hddlog, _init_state) = HDDlog::run(number_threads, track_complet_snapshot, cb)?;

        return Ok(Self{hddlog});
    }
}

After creating such an object we need to insert instructions into the database. In the test case we create a vector of instructions, and add it to the database:

#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn it_works() {

        let malloc = InstrT::Malloc{arg: Expr64T::Const64{v: 10} };
        // Some more instructions ...
        let store = InstrT::MovPtr{ptr: Expr64T::Reg64{r: Reg64T::Rax},a2: Expr64T::Const64{v: 0}};

        let mut bse = DDLogBSE::new().unwrap();
        let mut delta = bse.add_code(vec!(malloc, cmp, jmp, set, store)).unwrap();
        //DDLogBSE::dump_delta(&delta);
        DDLogBSE::enum_events(&mut delta);
        assert!(false);
    }
}

Lastly, we need to create a function that inserts the instructions into the database itself. To do this, we need to convert our vector of instructions into a vector of Update::Insert Objects that contain the whole fact that should be inserted. In our case, we want to add Instr facts, so we need to get the relation id of that type and turn our instructions in DDValues. Lastly, we apply the update, and return the delta for further inspection.

    pub fn add_code(&mut self, code: Vec<InstrT>) 
                    -> Result<DeltaMap<DDValue>, String> {
        self.hddlog.transaction_start()?;
        let updates = code.into_iter().enumerate().map(|(i,inst)| 
            Update::Insert {
                relid: Relations::Instr as RelId,
                v: Value::Instr(types::Instr{i: i as u32, op: inst}).into_ddvalue(),
            }
        ).collect::<Vec<_>>();
        self.hddlog.apply_valupdates(updates.into_iter())?;   
        let delta = self.hddlog.transaction_commit_dump_changes()?;
        return Ok(delta); 
    }

Wrapping up

While this is still far from building your own CodeQL, it gives you a solid base to get started. What remains to be done is to actually load your target program into the database in some form, and to build a reasonable analysis on top of it (aka: Everything). However, implementing this kind of analysis can be very straight forward, as soon as we have the basics in place. After considering generic code for creating the bytecode and substitution of expressions, building a simple backwards symbolic execution engine only took only around 10 lines of code. If we wanted to build some taint static tracking engine, we would only have to add a few more lines.

Recently, it became apparent that having the ability to experiment and customize static analysis passes easily is a gigantic advantage. The success of recent publications such as Sys: a Static/Symbolic Tool for Finding Good Bugs in Good (Browser) Code and tools like Semmle demonstrate how adding domain knowledge makes static analysis significantly more powerful. Using Datalog, adding a special cases for complex metaprogramming specific to your codebase can be as simple as adding two lines. Datalog is a programming language that allows you to implement program analysis at a much higher level than other languages. In combination with a few libraries to parse programs or bytecode and a bunch of helper functions such as expression substitution, I believe this will be an incredible powerful tool in the toolbox of everyone working on analyzing code. And I hope this example will help you to get started exploring the options that DDlog provides to program analysts.