I’m excited to announce I will be joining Facebook next month as part of the Infer team in London. Using FP and static analysis to find bugs across the Facebook estate.
Painless NP-complete problems: an embedded DSL for SMT solving
SMT solvers are automated tools to solve the “Satisfiability Modulo Theories” problem — that is, determining whether a given logical formula can be satisfied. However, unlike SAT solvers, SMT solvers generalize to solving such NP-complete problems that contain not just boolean variables, but more useful types, such as lists, tuples, arrays, integers and reals. And they do this while retaining very good performance (with yearly shootouts for the best solvers). As a result, SMT solvers provide a big hammer for solving many difficult problems automatically.
They are so interesting they’ve been described as a disruptive technology in formal methods. They are more automated than interactive theorem provers, while still providing high-end expressive capability.
And they’re growing in use in industrial applications. In the Cryptol toolset, for example, SMT solvers are used to determine if two implementations of a cryptographic function are functionally identical; Microsoft uses SMT solvers for device driver verification; and they’ve been used for large scale test generation, planning and routing problems.
The challenge now is to make such power easily available to programmers without a formal methods background.
To do this, my approach is to embed the SMT solver into a good programming language, as an embedded DSL. Embedded languages let us hide significant complexity, using the host language’s combining forms to build up complex structures from simple primitives — instead of starting with a complex API in the first place.
If you already know the host language, you can program in the EDSL. If you know Haskell, you can program an SMT solver.
The code is available:
- cabal install yices-painless (depends on an install of Yices from SRI)
- darcs get http://code.haskell.org/~dons/code/yices-painless/
- Haddock documentation (since Hackage won’t generate the docs without Yices headers)
Scripting a Solver
SMT solvers are typically scripted in an imperative fashion: variables are declared, and then facts about those variables are asserted, step by step. The resulting set of declarations is passed to the solver to determine satisifiability (and to find any satisfying values to assign to the variables, if possible). From a programming language perspective, these declarations and assertions look like (typed) (pure) functions with free variables.
The standard SMT-LIB format (accepted by most SMT solvers) looks something like:
(assert (=> (and (=> p q) (=> q r)) (=> p r)))
Which we’ll write in our EDSL in Haskell as:
\p q r -> (p → q && q → r) → (p → r)
Haskell’s type inference can fill in the types in the EDSL, and infix notation also helps clean things up.
Such a proposition may be passed to a solver, like so:
> solve $ \p q r -> (p → q && q → r) → (p → r)
r => False
q => True
p => True
We start with something that looks like a Haskell function (but is actually a function in the EDSL), which can be interpreted by a solver, with Haskell’s language machinery filling in types, and checking safety.
In this post I’ll talk about how I designed yices-painless, an EDSL that embeds SMT scripting support deeply into Haskell (deeply, in that the Haskell type system will check the types of the SMT propositions). The work is based on the language embedding approach taken by Chakravarty et al’s, accelerate arrays language (for GPU programming), using higher-order abstract syntax to support elegant variable binding, with a type-preserving translation to a form that can be executed by the underlying solver.
Designing the EDSL
For a client problem at work, I was interested in trying an SMT solver. The Yices SMT solver from SRI is a high-performance SMT solver with a well-documented C interface, which is a good place to start experimenting. The design challenge was to get from the imperative C interface we have to use, to the typed, pure Haskell embedding that we want to write.
To do this, we build up several layers of abstraction, each one removing redundancy and hiding unnecessary complexity:
- The base layer, Yices itself, a big blob of compiled C or C++ code, exporting 150+ functions.
- Yices.Painless.Base.C: Stitching Yices to Haskell via a full (and straightforward) FFI bindings (150+ Haskell functions, and a few types)
- Yices.Painless.Base: The FFI bindings are just as unsafe as the C layer, so in the next abstraction layer we automate the resource managment, use native Haskell types, and native Haskell abstractions (e.g. lazy lists instead of iteratees). The number of functions exposed is reduced as a result, making the API simpler. This layer is still imperative (all functions are in IO), but relatively useful.
- Yices.Painless.Language: The AST layers are the big step up in expressive power: we design an abstract syntax tree to represent the embedded language, along with a set of primitive functions and types. This pure data structure will be interpreted (or compiled to SMT-LIB format), generating calls to the imperative layer. We thus have very strong separation between the messy, error-prone imperative model, and the correct-by-construction pure AST. We have just simple constants, variables, conditions and a handful of (polymorphic) primitive functions, relying on the language to build up complex expressions from a few simple forms.
- Surface Syntax: Finally, we expose the EDSL language via smart surface syntax, overloaded via Haskell’s type classes: Bits, Num, String. And now programmers need only use Haskell’s native math and bit functions, syntax and types, to build propositions. Type inference and checking works, as well as proper scoping of variables in the propositions. This user-facing layer looks and feels exactly like Haskell, yet it is translated to a form that drives an imperative C library.
The surface layer reveals the power of a good EDSL: hiding unnecessary complexity, increasing expressivity (compare the few functions (and combinators) in the EDSL layer at the top, with the 150+ functions in the C API).
The EDSL reduces the cognitive load on the programmer by an order of magnitude.
The Base Layers
The base layers are the most straightforward part of the system. First, we write a natural FFI binding to every function exposed by the Yices C API. Each C function gets an FFI import; value types are represented by types defined in the Haskell FFI report (e.g. CInt, CString), while Yices types allocated and controlled by Yices are typed in Haskell with empty data types (e.g. YExpr), giving us increased type safety (compared to using void pointers).
The FFI layer is then wrapped in resource handling code, hiding a set of functions from an C++ iterator behind a single function returning a lazy list). The FFI bindings are about 350 lines of code; the wrapper layer another 500. Most of this can be generated.
The Typed AST
Following Chakravarty’s approach, we represent the embedded language with a surface syntax (supporting numerical and scalar polymorphic operations), which builds a nameless AST (variables in the EDSL are represented as Haskell variables). The typed embedding gives us a type checker for the DSL for free. The AST is decorated with type information, so we can resolve overloading (in the EDSL), and do other type-based operations. The EDSL itself is just over 1000 lines of Haskell.
The language is relatively simple: variables (of any scalar type), literals (of any scalar type), conditionals, and a range of built-in functions. Variables are bound/quantified at the top level of the EDSL.
data Exp t where Tag :: (IsScalar t) => Int -> Exp t Const :: (IsScalar t) => t -> Exp t Cond :: Exp Bool -> Exp t -> Exp t -> Exp t PrimApp :: PrimFun (a -> r) -> Exp a -> Exp r data PrimFun t where PrimEq :: ScalarType a -> PrimFun ((a, a) -> Bool) PrimNEq :: ScalarType a -> PrimFun ((a, a) -> Bool) PrimAdd :: NumType a -> PrimFun ((a, a) -> a) PrimSub :: NumType a -> PrimFun ((a, a) -> a) PrimMul :: NumType a -> PrimFun ((a, a) -> a) ...
We can expose these via a type class interface for Exp things, yielding nice surface syntax: (+), (*), (==*), and (/=*). It isn’t possible to write sensible instance of Eq, sadly, as those Haskell type classes wire in return types of Bool (instead of a polymorphic Boolean class). We have to redefine (==) instead.
instance (IsNum t) => Num (Exp t) where (+) = mkAdd (-) = mkSub (*) = mkMul infix 4 ==*, /=*, <*, <=*, >*, >=* -- | Equality lifted into Yices expressions. (==*) :: (IsScalar t) => Exp t -> Exp t -> Exp Bool (==*) = mkEq -- |Inequality lifted into Yices expressions. (/=*) :: (IsScalar t) => Exp t -> Exp t -> Exp Bool (/=*) = mkNEq
The smart constructors build AST nodes, with type information added:
mkAdd :: (IsNum t) => Exp t -> Exp t -> Exp t mkAdd x y = PrimAdd numType `PrimApp` tup2 (x, y) mkMul :: (IsNum t) => Exp t -> Exp t -> Exp t mkMul x y = PrimMul numType `PrimApp` tup2 (x, y) mkEq :: (IsScalar t) => Exp t -> Exp t -> Exp Bool mkEq x y = PrimEq scalarType `PrimApp` tup2 (x, y) mkNEq :: (IsScalar t) => Exp t -> Exp t -> Exp Bool mkNEq x y = PrimNEq scalarType `PrimApp` tup2 (x, y)
Converting to de Bruijn form
This nameless form (via surface syntax) is convenient to write, and solves all the issues with binders for us, but harder to optimize (as propositions are reprsented as opaque functions). So this form is then converted to a form with explicit variables (represented as de Bruijn indices into a variable environment). The approach is identical to the one taken in the accelerate EDSL, with the change that at the top level we support n-ary functions (to represent propositions with n free variables).
Thanks to Trevor Elliott, and for historical reference, we can convert n-ary functions to their de Bruijn form via a type class, yielding a common type for propositions with arbitrary variables:
:: Yices f r => f -> Prop r
Running the Solver
Once we’ve built the AST for the proposition, and replaced all variables with their de Bruijn tags, we can now execute the proposition, recursively calling Yices operations to reconstruct the AST on the Yices side, before asking to solve, finding a satisfying assignment of variables (or not), and returning a model of the bindings.
solve :: (Yices f r) => f -> IO Result solve q = do c <- Yices.mkContext Yices.setTypeChecker True let t = convertYices q -- convert to de Bruijn form (g,e) <- execY c t -- build a Yices expression representing the AST Yices.assert c e -- register it with Yices Yices.check c -- check for satisfiability, returning the result
There’s quite a few bindings to Yices (in particular) in Haskell, but none on Hackage (at the time of writing) done as a deep EDSL (ensuring type safety and lexical scoping). The original inspiration for the work was yices-easy, which exposes a monadic, shallow DSL — and has a good tutorial.
Significant related work is by my colleague at Galois, Levent Erkok, who simultaneously developed “SBV”, a deep EDSL that does support sharing and SMT-LIB access, via heavy type class use. We’ll be sharing code.
The approach to embedding taken is based on the one found in accelerate, which shares similarities with Atkey, Lindley and Yallop‘s approach. The general idea of de Bruijn/HOAS translations was introduced by McBride and McKinna.
At the moment, this is basically a proof of concept. Simple propositions can be solved, but its not ready for large scale problems. A few things need to be done:
- Support for sharing in the AST. The EDSL currently loses sharing information, duplicating terms. This isn’t feasible for large problems.
- Operations that require certain sizes of bit vector should check those sizes (e.g. via Iavor Diatchki’s type level natural support for Haskell).
- Support deriving encodings from new Haskell types into Yices types.
- Support for lists, tuples and arrays of variables (some problems have very large numbers of variables)
- Supporting function types.
- Compile the AST to SMT-LIB statements (instead of Yices FFI calls), making it independent of a particular SMT solver (or solver’s license).
- Use it on some big problems.
I’ll be continuing work on this in 2011, and would welcome patches, feedback and use cases.