Verification of B+-Trees with Interactive Theorem Proving and Shape Analysis

This page describes proofs for the correctness of the insert and delete algorithms on B+Trees. The proofs have three parts:

The toplevel proof of the main recursion using well-founded induction in KIV.
The proofs of all subroutines were done using the shape analysis tool TVLA.
Constraints used in TVLA were verified based on algebraic specifications of the data types using KIV.

The KIV proofs are organized in two projects. A generic library that specifies untyped heaps, explained next. Types, predicates and algorithms specific for B+-Trees are in the second project. Finally, the TVLA proofs are explained.

KIV-Projects always link to an overview which gives the dependency graph of specifications. Clicking on one specification gives its text, and links to a theorem base summary (which lists names of theorems) and a theorem base listing (which shows the full sequent for each theorem). From there individual theorems and the full details of each proof are accessible.

The Heap-Library

The Heap library project specifies generic heaps with predicative types. It defines various predicates (like reachability, tree shape etc.) fo these and proves elementary facts about them. The project contains

Elementary specifications of generic references r (type ref has a =null predicate which specifies "is a null pointer") and generic objects d (type data).
A generic heap is a partial function that maps finitely many non-null references to objects. Only mappings that obey a predicative type discipline can be constructed. The data type of generic heaps is specified as a non-free algebraic data type with two constructors: ∅ creates an empty heap. H[k,d] adds or overwrites key k to point to object d, if the type of key k is compatible with the type of object d (specified as predicate comp). k ∈ H tests whether key k is allocated, i.e. mapped to some object. In this case H[r] returns the object to which k is mapped. comp(r,H[r]) is always true for allocated references. # H gives the number of allocated keys.
Objects are assumed to have some fields which store references. The available fields are assumed to be from a specification of selector names. Selectors are classified into horizontal (hsel), vertical (vsel) and other selectors. This classification is used in the definition of height for tree shaped pointer structures below.
The set of selector names is assumed to be finite: a list of all selectors must exist.
Again a predicative type system is used on selectors and updates: comp(d,sn) and comp(d,sn,r) check, whether selector name sn is applicable on object d, and whether the field type of field sn of object d is compatible with the type of reference r to allow an update. get(d,sn) returns the result of field access in this case, and set(d,sn,r) overwrites field sn of object d with reference r.
Generic heaps allow the definitions of a number of properties. These are used to verify consistency rules used by TVLA:
- Predicate ok specifies, that a heap is consistent, i.e. that it contains no dangling pointers.
- A subset relation on heaps and a domain function that returns the set of allocated nodes can be defined.
- paths through a pointer structure and reachability can be defined.
- Predicate root specifies that all allocated objects are reachable from a root reference.
- Predicate tree specifies tree-shaped structures. For those height can be defined as well as size (number of nodes). Note that only compatible selectors must be followed to compute height and size.
- The height of a tree can be used to define balanced trees, where all leaves are at the same depth.
- Finally, unchanged(H0,H1,r) ensures that modifications done to get from heap H0 to heap H1 affect nodes only, that were reachable from a node r (for tree-shaped structures: which affected the subtree below r only).
- The library finally specifies untyped heaps, which have a single null pointer, no selectors on objects, and no type constraints. These are an the instance , of generic heaps, where all compatibility predicates are true or just disallow allocation of the null pointer. The two parameters ref and data of untyped heaps can be actualized to define instances of heaps which contain one individual type of references and objects.

The B+Tree Project

Project B+Tree specifies the data types required for B+Trees, the algorithms. it contains the proofs of the main theorems as well as proofs for TVLA constraints that are specific to B+Trees. The project is organized as follows:

Data Type node specifies the type of nodes of a B+Tree. A node can be either a branch or a leaf. A branch stores a sequence branch entries, a leaf stores a sequence of leaf entries. All three data types are specified as free data types.
Each of these objects contain typed references to other objects. A node reference (of type nref) points to a node, a branch resp. leaf entry reference (types beref and neref) point to entries. All three reference types are given (together with constants for their null pointers) here.
nodeheap, bentryheap, lentryheap are three instances of the heap specification from the library. Each defines a separate heap for nodes, leaf and branch entries.
To adapt the generic definitions of reachability, paths etc. that the untyped setting of TVLA uses to a typed setting, we also define an instance of generic heaps with paths etc. from the library. This instance needs a the sum type of all references , the sum type of all objects , an enumeration type of the two selector names next and down , a specification of a generic get function with a compatibility function (comp), that checks, when get is applicable. Finally, the three typed heaps are collected in a tuple, and accessor functions are defined. With this tuple, we get an instance of the generic definitions of the library.
Based on these specifications the main algorithms for insert_top#, that adds a key to a B+Tree and delete_top#, that removes a key from a B+Tree can be defined. The algorithms are recursive, and based on various subroutines for splitting or merging leafs and branches, and moving a node to the left or right sibling node (for delete). These subroutines are verified with TVLA. In KIV, their contracts are specified.
All contracts are based on definitions of predicates that specify, when a pointer structure is a well-formed B+Tree (predicate btree). Auxiliary predicates are partly inherited from the generic library (predicate tree that specifies tree shape), partly defined here: sorted (based on sorted1, and ok) specifies the sorting constraints, balanced specifies the constraint that all leaves must be at the same depth, and nzsels specifies the requirement, that all branches and leaves must have at least one entry (a weaker constraint than that each node must have between n and 2n entries). okpath(H,r,k) specifies that the key k should be inserted or deleted from the subtree below reference r.
The B+Tree properties are extended to cover node sizes as well, given by the function nsize and the function nset, that computes horizontally reachable nodes. The rank of the B+Tree is given by a constant N with 2 ≤ N.
Over the specifications of insert and delete, the two main inductive theorems proved in KIV are for the mutually recursive routines insert# and insert_be# and for the mutually recursive routines delete# and delete_be# . Based on these and the definition of an abstraction function, the two main refinement theorems for insert_top# and delete_top# are proved in the toplevel specification .
The project also contains various proofs of constraints that were used in TVLA. Proofs that are independent of the instantiation with B+trees can be found in the specifications named tvla_* and are partitioned into consistency rules that follow from invariants, tvla-updates that proves update formulas, tvla-guards that proves preservation of invariants and pre-/post that proves equivalence of global and local characterizations of invariants.
For example:
- The consistency rules for predicates eqh, olh which serve as the definition for balancedness (balance-sa) for shape analysis are proved here.
- Here is the proof (in the generic setting), that checks the guard for preserving tree shape, when some field in an object is updated. The predicate uforest specifies unshared forests using the TVLA encoding.
- Here and here are the proofs (in the generic setting) that check that the precondition regarding tree shape for a subroutine established in KIV indeed implies the global invariant specified by consistency rules in TVLA and vice versa.
- Here is similar proof for the equivalence of the different characterizations of sorting.
The proofs that are specific for B+trees are in the following specifications
- btree-consistency
- btree-update , that proves preservation of the ≤next predicate, called ok in the kiv specifications and facts about changes to the size of a node
- btree-guards , that proves preservation of sortedness under the respective guards

TVLA-Proofs for subroutines

The TVLA input files consist of

The specification of instrumentation predicates and consistency rules for B+Tree invariants in the files {tree,balance,sorting,size,create}.spec.tvp
The implementation of subroutines is in the files ending with .imp3. They are written in a simple imperative language, the translator is available here . The generated .tvp files are included in the archive. The translator requires python 2.x and depends on the zapps parser module - installation instructions are included.
Proofs for the different properties are executed independently. There are folders tree,balance,sorting,size which contain ready-to-run TVLA input files and a script run.sh that runs all proofs in that folder. Additionally, the folder "create" contains the proof for root_create.
If you wish to generate the TVLA inputs, you can run make SPEC={tree,balance,sorting,size,create} [prog.ps] , where SPEC is the property to prove and the optional [prog.ps] specifies the subroutine to check. Please run make clean whenever you switch between different values of SPEC. The pregenerated files in the subfolders are not overwritten by these commands.
Note: TVLA generates the resulting postscript documents by itself per default. There is, however, a tool bin/fmtdot.py that rewrites the dot files to omit some of the edges and node labels and colors the output. It is called automatically from the Makefile (not by run.sh), so you might want to remove the following lines from your TVLA activation script. if [ -e $1.dt ] then echo "Converting output to PostScript..." dot -Tps -o$1.ps < $1.dt fi Also, increase the maximum memory of the Java virtual machine in this script, e.g. -mx2000m.
The proofs are successful if no messages are generated. You can check this by grep -i message *.dt after running the analysis or by inspecting the generated postscript documents.
actions0.tvp defines statement semantics, actions.tvp defines further actions.
File $SPEC.spec.tvp specifies the predicates and consistency rules.
File $SPEC.include.tvp defines property-specific statement semantics (e.g., update formulas and guards).
File $SPEC.tvs specifies the initial structure. It is refined in code from $SPEC.imp3 which contains initialization and contracts for the respective property
The current version of TVLA is available here . We used version 3 for the verification.

Used basic libraries:

The basic KIV library with basic data types

Here you can find all our publications.

Back to the overview of all KIV projects.
For general informations on KIV and a download site for the whole system refer to this page.

[Imprint] [Data Privacy Statement] [E-Mail]