Haskell is a strict language

A common misconception from non-Haskellers is that Haskell, when compiled, pays an ongoing penalty for supporting laziness by default. The idea is that in a lazy language, every expression would suspend to a heap-allocated, garbage collected, thunk. Even if you were to use the variable immediately.

That sounds scarily expensive.

In the real world however, Haskell is often as not a strict language. Compiled Haskell undergoes strictness analysis which filters out the “needs to be lazy” code, from the “just do it already” strict stuff. The end result is purely functional code, written in a lazy style, that gets compiled down to raw, strict assembly, without any thunks, heaps, or GC getting in the way of raw, all out speed.

Like type inference, this means we usually don’t have to manually annotate code with its strictness (or laziness) properties, instead, the compiler figures it out from how we use the values.

Here’s a nice example, from a new arrays library, based on the larger data parallel arrays project.

We’ll write some code to generate 3 arrays, add them together, then bit shift all the elements, and finally, take the sum.

In a language like C, this would mean first allocating a bunch of arrays, writing loops to fill them, then separate loops to do the the rest of the pipeline. If we were looking for speed, we’d perhaps try to manually fuse some of the loop bodies together to avoid extra traversals, and deforesting intermediate data structures.

In Haskell though, we tend to build up loops by composing separate combinators, as if building up a pipeline in the shell. For list operations, these then get fused away, (using build/fold fusion), meaning only a single traversal need be done.

Similarly, for array programming, we can use stream fusion to eliminate traversals over arrays, making array loops written in a compositional style as efficient as manually fused loops — assuming all the mystical laziness doesn’t get in the way.

The data parallel and uvector array libraries are implemented using stream fusion for strict arrays, so we can try this out.

Here’s an encoding of the take this sum of shifts of sums of arrays, compositionally:

    import Data.Array.Vector
    import Data.Bits

    main = print . sumU
                 . mapU ((2::Int) `shift`)
                 . zipWith3U plus3
                        (replicateU 100000000 (2::Int))
                        (replicateU 100000001 (3::Int)) $
                        (replicateU 100000002 (5::Int))
        where
            plus3 x y z = x + y + z

The nice aspect of this is the reduction in cognitive load: you build the loop up piecewise, and let the compiler do the (aggressive) thinking for you. As an exercise, try writing a C program that allocates these 3 arrays, and then uses a single loop to initialise them, add their elements, bit shift them, and sum the result.

And how well does it work? Our program above undergoes aggressive inlining (as it’s notionally pure and lazy) by GHC, and as we’ve used combinators, rather than manual loops, GHC can see how to easily combine them, yielding this single loop at the end:

    fold :: Int# -> Int# -> Int# -> Int# -> Int#
    fold x y z acc =
        case z of _ ->
            case y of _ ->
                case x of _ ->

                    fold (x +# 1)
                         (y +# 1)
                         (z +# 1)
                         (acc +# 2048)

                  100000000 -> acc
              100000001 -> acc
          100000002 -> acc

Look at that constant folding! The compiler saw the arrays were initialised with all elements to 2, 3 and 5, which were then added, and shifted. The compiler spotted this and realised we’re just describing a loop that adds 2 << 10 in a loop.

The other thing it noticed was that the values where always going to be demanded. There’s no laziness in this program. So GHC removed all the lazy Int allocations and replaced them with unboxed Int# types — raw machine integers living in registers.

There are no `let’s left in this code — so *nothing* is going to end up on the heap, and the garbage collector isn’t going to get a chance to run!

With strictness analysis and aggressive loop fusion, we end up with the code we’d write in a low level, non-GC’d language like C — we just got to write it in a very declarative, descriptive way.

This tight loop gets passed off to the code generator, yielding the final assembly,

    fold:
          cmpq        $100000000, %r8
          je  .L11
        .L5:
          cmpq        $100000001, %rdi
          je  .L11
        .L8:
          cmpq        $100000002, %rsi
          je  .L11
        .L10:
          leaq        1(%rsi), %rsi
          leaq        1(%rdi), %rdi
          leaq        1(%r8), %r8
          addq        $2048, %r9
          jmp fold

        .L11:
          movq        %r9, %rbx
          jmp *(%rbp)

Which is looks pretty good. Just a tight loop, no allocation, everything in registers.

    $ time ./zipwith3
    204800000000
    ./zipwith3  0.20s user 0.00s system 95% cpu 0.209 total

It fairly zips along.

This kind of very aggressive optimisation only makes sense if you’ve already statically typed your program, inferred that the loops are referentially transparent, and decided what algorithm each loop is actually doing. And GHC is doing this by default — which is pretty sweet.

Strictness analysis and loop fusion should make the new array libraries a competitive solution.

Two new papers!

Warning: if you’re on the ICFP program committee and want to preserve double-blind reviewing, please don’t read this post.

Phew! The ICFP deadline is past, and two papers have been submitted.

The first paper covers the full story of stream fusion, greatly extending work initially done for the ByteString library, in particular, how to fuse zips, concatMaps and list comprehensions, and we’ve implemented the entire list library to be stream fusible. The point: faster Haskell code.

The second paper describes how, using Haskell, we can write monte-carlo-based simulators that out perform optimised C implementations, using a generative approach to simulator specialisation. In particular, we generate multicore and cluster-based simulators for polymer chemistry, outperforming the existing (commercial) software in this area.

So here they are, for your functional programming satisfaction:

Stream Fusion: From Lists to Streams to Nothing at All.
Duncan Coutts, Roman Leshchinskiy and Don Stewart.

Abstract:

This paper presents an automatic fusion system, stream fusion, based on equational transformations, that fuses a wider range of functions than existing short-cut fusion systems. In particular, stream fusion is able to fuse zips, left folds and functions over nested lists, including list comprehensions. A distinguishing feature of the stream fusion framework is its simplicity: by transforming list functions to expose their structure, intermediate values are eliminated by general purpose compiler optimisations.

We have reimplemented the entire Haskell standard list library on top of our framework, providing stream fusion for Haskell lists. By allowing a wider range of functions to fuse, we see an increase in the number of occurrences of fusion in typical Haskell programs. We present benchmarks documenting time and space improvements.

And the second paper:

Generative Code Specialisation for High-Performance Monte-Carlo Simulations.
Don Stewart, Hugh Chaffey-Millar, Gabriele Keller, Manuel M. T. Chakravarty and Christopher Barner-Kowollik.

Abstract:

We address the tension between software generality and performance in the domain of scientific and financial simulations based on Monte-Carlo methods. To this end, we present a novel software architecture, centred around the concept of a specialising simulator generator, that combines and extends methods from generative programming, partial evaluation, runtime code generation, and dynamic code loading. The core tenet is that, given a fixed simulator configuration, a generator in a functional language can produce low-level code that is more highly optimised than a manually implemented generic simulator. We also introduce a skeleton, or template, capturing a wide range of Monte-Carlo methods and use it to explain how to design specialising simulator generators and how to generate parallelised simulators for multi-core and distributed-memory multiprocessors.

We evaluated the practical benefits and limitations of our approach by applying it to a highly relevant problem in computational chemistry. More precisely, we used a Markov-chain Monte-Carlo method for the study of advanced forms of polymerisation kinetics. The resulting implementation executes faster than all competing software products, while at the same time also being more general. The generative architecture allows us to cover a wider range of chemical reactions and to target a wider range of high-performance architectures (such as PC clusters and SMP multiprocessors).

We show that it is possible to outperform low-level languages with functional programming in domains with very stringent performance requirements if the domain also demands generality.