Playing with GHC’s parallel runtime

In this post you’ll get a bit of an idea how to:

  • make a Haskell program much faster by parallelising it
  • see how to analyse and use the SMP runtime flags GHC provides
  • mess with the parallel garbage collector

Ultimately we’ll make a program 4x faster on 4 cores by changing one line of code, using parallelism, and tuning the garbage collector.

Update: and since I began writing this GHC HQ (aka Simon, Simon and Satnam) have released “Runtime Support for Multicore Haskell” which finally puts on paper a lot of information that was previously just rumour. As a result, I’ve rewritten this article from scratch to use GHC 6.11 (today’s snapshot) since it is just so much faster and easier to use than 6.10.x.

Ready?

The new GHC garbage collector

The GHC 6.10 release notes contain the following text on runtime system changes:

The garbage collector can now use multiple threads in parallel. The new -gn RTS flag controls it, e.g. run your program with +RTS -g2 -RTS to use 2 threads. The -g option is implied by the usual -N option, so normally there will be no need to specify it separately, although occasionally it is useful to turn it off with -g1. Do let us know if you experience strange effects, especially an increase in GC time when using the parallel GC (use +RTS -s -RTS to measure GC time). See Section 5.14.3, “RTS options to control the garbage collector” for more details.

Interesting. Maybe this will have some impact on the shootout benchmarks.

Binary trees: single threaded

There’s one program that’s been bugging me for a while, where the garbage collector is a bottleneck: parallel binary-trees on the quad core Computer Language Benchmarks Game. This is a pretty straight forward program for testing out memory management of non-flat data types in a language runtime – and FP languages should do very well with their bump-and-allocate heaps. All you have to do is allocate and traverse a bunch of binary trees really. This kind of data:

data Tree = Nil | Node !Int !Tree !Tree

Note that the rules state we can’t use laziness to avoid making O(n) allocations at a time, so the benchmark will use a strict tree type – that’s fine – it only helps with a single core anyway. GHC will unbox those Int fields into the constructor too, with -funbox-strict-fields (should be implied by -O in my opinion). The benchmark itself is really quite easy to implement. Pattern matching makes allocating and wandering them trivial:

-- traverse the tree, counting up the nodes
check :: Tree -> Int
check Nil          = 0
check (Node i l r) = i + check l - check r

-- build a tree
make :: Int -> Int -> Tree
make i 0 = Node i Nil Nil
make i d = Node i (make (i2-1) d2) (make i2 d2)
  where i2 = 2*i
        d2 = d-1

The full code is here. So quite naive code, and fast… if we just look at this code running on the single core benchmark machine:

  1. 1.0 Java 6 -Xms64m #2
  2. 1.1 SML MLton #2
  3. 1.2 Haskell GHC
  4. 1.2 Erlang HiPE
  5. 1.3 Lisp SBCL #2
  6. 1.3 GNU gcc

Functional language implementations taking up 4 of the  top 6 slots, and edging out C (it’s even faster with lazy trees). You can try this for yourself:

whirlpool$ ghc -O2 --make A.hs
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...

whirlpool$ time ./A 16
stretch tree of depth 17	 check: -1
131072	 trees of depth 4	 check: -131072
32768	 trees of depth 6	 check: -32768
8192	 trees of depth 8	 check: -8192
2048	 trees of depth 10	 check: -2048
512	 trees of depth 12	 check: -512
128	 trees of depth 14	 check: -128
32	 trees of depth 16	 check: -32
long lived tree of depth 16	 check: -1
./A 16  1.26s user 0.03s system 100% cpu 1.291 total

I’m on a quad core Linux 2.6.26-1-amd64 x86_64 box, with:

whirlpool$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.11.20090302

If we take the value of N up to the N=20, it takes a while longer to run:

whirlpool$ time ./A 20
stretch tree of depth 21	 check: -1
2097152	 trees of depth 4	 check: -2097152
524288	 trees of depth 6	 check: -524288
131072	 trees of depth 8	 check: -131072
32768	 trees of depth 10	 check: -32768
8192	 trees of depth 12	 check: -8192
2048	 trees of depth 14	 check: -2048
512	 trees of depth 16	 check: -512
128	 trees of depth 18	 check: -128
32	 trees of depth 20	 check: -32
long lived tree of depth 20	 check: -1
./A 20  40.21s user 0.16s system 99% cpu 40.382 total

And of course we get no speed from the extra cores on the system yet. We’re only using 1/4 of the machine’s processing resources. The implementation contains no parallelisation strategy for GHC to use.

Binary trees in parallel

Since Haskell (especially pure Haskell like this) is easy to parallelise, and in general GHC Haskell is pretty zippy on multicore :-) let’s see what we can do to make this faster by parallelisation. It turns out, teaching this program to use multicore is ridiculously easy. All we have to change is one line! Where previously we computed the depth of all the trees between minN and maxN sequentially,

let vs = depth minN maxN

...

depth :: Int -> Int -> [(Int,Int,Int)]
depth d m
    | d <= m    = (2*n,d,sumT d n 0) : depth (d+2) m
    | otherwise = []
  where n = 1 `shiftL` (m - d + minN)

Which yields a list of tree results sequentially, we instead step back, and compute the separate trees in parallel using parMap:

let vs = parMap rnf id $ depth minN maxN

From Control.Parallel.Strategies, parMap forks sparks for each (expensive) computation in the list, evaluating them in parallel to normal form. This technique uses sparks – lazy futures – to hint to the runtime that it might be a good idea to evaluate each subcomputation in parallel. When the runtime spots that there are spare threads, it’ll pick up the sparks, and run them. With +RTS -N4, those sparks (in this case, 9 of them) will get scheduled over 4 cores. You can find out more about this style of parallel programming in ch24 of Real World Haskell,  in Algorithm + Strategy = Parallelism and now in the new GHC HQ runtime paper.

Running parallel binary trees

Now that we’ve modified the implementation to contain a parallel evaluation strategy,all we have to do is compile it against the threaded GHC runtime, and those sparks will be picked up by the scheduler, and dropped into real threads distributed across the cores. We can try it using 2/4 cores:

whirlpool$ ghc -O2 -threaded A.hs --make -fforce-recomp

whirlpool$ time ./A 16 +RTS -N2
stretch tree of depth 17	 check: -1
131072	 trees of depth 4	 check: -131072
32768	 trees of depth 6	 check: -32768
8192	 trees of depth 8	 check: -8192
2048	 trees of depth 10	 check: -2048
512	 trees of depth 12	 check: -512
128	 trees of depth 14	 check: -128
32	 trees of depth 16	 check: -32
long lived tree of depth 16	 check: -1
./A 16 +RTS -N2  1.34s user 0.02s system 124% cpu 1.094 total

Hmm, a little faster at N=16, and > 100% cpu. Trying again with 4 cores:

whirlpool$ time ./A 16 +RTS -N4
stretch tree of depth 17	 check: -1
131072	 trees of depth 4	 check: -131072
32768	 trees of depth 6	 check: -32768
8192	 trees of depth 8	 check: -8192
2048	 trees of depth 10	 check: -2048
512	 trees of depth 12	 check: -512
128	 trees of depth 14	 check: -128
32	 trees of depth 16	 check: -32
long lived tree of depth 16	 check: -1
./A 16 +RTS -N4  2.89s user 0.06s system 239% cpu 1.229 total

Hmm… so it got only a little faster with 2 cores at N=16, but about the same with 4 cores. At N=20 we see similar results:

whirlpool$ time ./A 20 +RTS -N4
stretch tree of depth 21	 check: -1
2097152	 trees of depth 4	 check: -2097152
524288	 trees of depth 6	 check: -524288
131072	 trees of depth 8	 check: -131072
32768	 trees of depth 10	 check: -32768
8192	 trees of depth 12	 check: -8192
2048	 trees of depth 14	 check: -2048
512	 trees of depth 16	 check: -512
128	 trees of depth 18	 check: -128
32	 trees of depth 20	 check: -32
long lived tree of depth 20	 check: -1
./A 20 +RTS -N4  96.61s user 0.93s system 239% cpu 40.778 total

So still 40s, at 239% cpu. So we made something hot. And you can see a similar result at N=20 on the current quad core shootout binary-trees entry. Jobs distributed across the cores, but not much better runtime. A little better than the single core entry, but only a little. And in the middle of the pack, and 2x slower than C!

Meanwhile, on the single core, it’s in 3rd place, ahead of C and C++. So what’s going on?

Listening to the garbage collector

We’ve parallelised this logically well, so I’m not prepared to abandon the top-level parMap strategy. Instead, let’s look deeper. One clue about what is going on is the cpu utilisation in the shootout program:

6.8 Haskell GHC #2 53.36 403,944 544 40.18 21% 45% 21% 41%

Those aren’t very good numbers – we’re using all the cores, but not very well. So the program’s doing something other than just number crunching. A good suspect is that there’s lots of GC traffic happening (after all, a lot of trees are being allocated!). We can confirm this hunch with +RTS -sstderr which prints lots of interesting statistics about what the program did:

whirlpool$ time ./A 16 +RTS -N4 -sstderr

./A 16 +RTS -N4 -sstderr
     946,644,112 bytes allocated in the heap
     484,565,352 bytes copied during GC
       8,767,512 bytes maximum residency (23 sample(s))
          95,720 bytes maximum slop
              27 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0:   674 collections,     0 parallel,  0.54s,  0.55s elapsed
  Generation 1:    23 collections,    22 parallel,  0.57s,  0.16s elapsed

  Parallel GC work balance: 1.56 (17151829 / 10999322, ideal 4)

  Task  0 (worker) :  MUT time:   0.36s  (  0.39s elapsed)
                      GC  time:   0.28s  (  0.13s elapsed)
  Task  1 (worker) :  MUT time:   0.67s  (  0.43s elapsed)
                      GC  time:   0.14s  (  0.14s elapsed)
  Task  2 (worker) :  MUT time:   0.01s  (  0.43s elapsed)
                      GC  time:   0.09s  (  0.08s elapsed)
  Task  3 (worker) :  MUT time:   0.00s  (  0.43s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)
  Task  4 (worker) :  MUT time:   0.31s  (  0.43s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)
  Task  5 (worker) :  MUT time:   0.22s  (  0.43s elapsed)
                      GC  time:   0.60s  (  0.37s elapsed)

  SPARKS: 7 (7 converted, 0 pruned)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.02s  (  0.43s elapsed)
  GC    time    1.12s  (  0.71s elapsed)
  EXIT  time    0.32s  (  0.03s elapsed)
  Total time    2.45s  (  1.17s elapsed)

  %GC time      45.5%  (60.7% elapsed)

  Alloc rate    708,520,343 bytes per MUT second

  Productivity  54.5% of total user, 114.1% of total elapsed

gc_alloc_block_sync: 35082
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 1123
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 6318
gen[1].steps[0].sync_large_objects: 0
./A 16 +RTS -N4 -sstderr  2.76s user 0.08s system 241% cpu 1.176 total

At N=16, the program is spending 45% of its time doing garbage collection. That’s a problem. We can also see some other things:

  • 7 sparks are being created by our parMap, all of which are turned into real threads
  • The parallel GC does get a chance to run in parallel 22 times.

And at N=20, the benchmark number, things aren’t any better:

  19,439,350,240 bytes allocated in the heap
  21,891,579,896 bytes copied during GC
     134,688,800 bytes maximum residency (89 sample(s))
         940,344 bytes maximum slop
             376 MB total memory in use (6 MB lost due to fragmentation)

  Generation 0: 14576 collections,     0 parallel, 20.67s, 20.62s elapsed
  Generation 1:    89 collections,    88 parallel, 36.33s,  9.20s elapsed

  SPARKS: 9 (9 converted, 0 pruned)

  %GC time      64.0%  (74.8% elapsed)

So yikes, we’re wasting a lot of time cleaning up after ourselves (though happily our par strategy isn’t wasting any fizzled sparks). Diving into the GC docs, we see:

Bigger heaps work better with parallel GC, so set your -H value high (3 or more times the maximum residency)

Ok. Let’s try to get that number down.

Helping out the GC

We can see how much to make a guess at by looking at the maximum residency stats. A good start might be 400M:

whirlpool$ time ./A 20 +RTS -N4 -H400M
stretch tree of depth 21	 check: -1
2097152	 trees of depth 4	 check: -2097152
524288	 trees of depth 6	 check: -524288
131072	 trees of depth 8	 check: -131072
32768	 trees of depth 10	 check: -32768
8192	 trees of depth 12	 check: -8192
2048	 trees of depth 14	 check: -2048
512	 trees of depth 16	 check: -512
128	 trees of depth 18	 check: -128
32	 trees of depth 20	 check: -32
long lived tree of depth 20	 check: -1
./A 20 +RTS -N4 -H400M  35.25s user 0.42s system 281% cpu 12.652 total

Ok, so that was pretty easy. Runtime has gone from 40s to 12s, and why? Looking at +RTS -sstderr:

  %GC time       6.8%  (18.6% elapsed)
  Generation 0:    86 collections,     0 parallel,  2.07s,  2.23s elapsed
  Generation 1:     3 collections,     2 parallel,  0.34s,  0.10s elapsed

GC time is down under 10% too, which is a good rule. For the original N=16, with its smaller number of trees, which was taking 1.29s, is now down to:

whirlpool$ time ./A 16 +RTS -N4 -H400M
stretch tree of depth 17     check: -1
131072     trees of depth 4     check: -131072
32768     trees of depth 6     check: -32768
8192     trees of depth 8     check: -8192
2048     trees of depth 10     check: -2048
512     trees of depth 12     check: -512
128     trees of depth 14     check: -128
32     trees of depth 16     check: -32
long lived tree of depth 16     check: -1
./A 16 +RTS -N4 -H400M  1.26s user 0.38s system 285% cpu 0.575 total

So this is a reasonable stopping point.

The lessons

  • parMap can be quite effective and easy as a parallelisation strategy
  • if you’ve a reasonable parallelisation strategy, but not getting the performance, check what the GC is doing.

And as a final remark, we can look forward to what’s around the corner for GHC:

12.1 Independent GC

… We fully intend to pursue CPU-independent GC in the future … moving to more explicitly-separate heap regions is a more honest reflection of the underlying memory architecture …

So hopefully soon each core will be collecting its own binary trees.

References

Complete details of the new GC are in the paper, and particularly the new RTS paper:

And as a final teaser, more on the multicore Haskell story this week:

Haskell as fast as C: working at a high altitude for low level performance

After the last post about high performance, high level programming, Slava Pestov, of Factor fame, wondered whether it was generally true that “if you want good performance you have to write C in your language”. It’s a good question to ask of a high level language.

In this post I want to show how, often, we can answer “No”. That by working at a higher abstraction level we can get the same low level performance, by exploiting the fact that the compiler knows a lot more about what our code does. We can teach the compiler to better understand the problem domain, and in doing so, enable the compiler to optimise the code all the way down to raw assembly we want.

Specifically, we’ll exploit stream fusion — a new compiler optimisation that removes intermediate data structures produced in function pipelines to produce very good code, yet which encourages a very high level coding style. The best of high level programming, combined with best of raw low level throughput.

The micro-benchmark: big list mean

This post is based around a simple microbenchmark developed in the last post of this series: computing the mean of a list of floating point values. It’s a tiny program, but one with interesting properties: the list is enormous (10^9 values) — so dealing with it effectively requires laziness of some kind, and a naive implementation is far too inefficient, so we have to be smarter with our loops.

The list is generated via an enumeration — a key property the compiler can later exploit — and we have two reference implementations on hand. One in Haskell, using a manually fused, tail recursive style, with a worker/wrapper transformation applied:

    mean :: Double -> Double -> Double
    mean n m = go 0 0 n
        where
            go :: Double -> Int -> Double -> Double
            go s l x | x > m      = s / fromIntegral l
                     | otherwise  = go (s+x) (l+1) (x+1)

    main = do
        [d] <- map read `fmap` getArgs
        printf "%f\n" (mean 1 d)

And a straight forward translation to C, with all data structures fused away:

    #include <stdio.h>
    #include <stdlib.h>

    int main(int argc, char **argv) {

            double d = atof(argv[1]);

            double n;
            long long a; // 64 bit machine
            double b;

            // go_s17J :: Double# -> Int# -> Double# -> Double#
            for (n = 1,
                 a = 0,
                 b = 0; n <= d; b+=n,
                                n++,
                                a++)
                ;

            printf("%f\n", b / a);

            return 0;
    }

The C will serve as a control — a lower bound to which we want to reach. Both reference programs compute the same thing, in the same time, as they have pretty much identical runtime representations:

    $ gcc -O2 t.c
    $ time ./a.out 1e9
    500000000.067109
    ./a.out 1e9  1.75s user 0.00s system 99% cpu 1.757 total

and

    $ ghc -O2 A.hs -optc-O2 -fvia-C --make 
    $ time ./A 1e9
    500000000.067109
    ./A 1e9  1.76s user 0.00s system 100% cpu 1.760 total

In the previous post we looked at how and why the Haskell version is able to compete directly with C here, when working at a low level (explicit recursion). Now let’s see how to fight the battle at high altitude.

Left folds and deforestation

Clearly, writing in a directly recursive style is a reliable path to fast, tight loops, but there’s a programmer burden: we had to sit and think about how to combine the list generator with its consumer. For cheaper cognitive load we’d really like to keep the logically separate loops … literally separate, by composing a list generator with a list consumer. Something like this:

    mean n m = s / fromIntegral l
      where
        (s, l) = foldl' k (0, 0) [n .. m]
        k (s, l) a = (s+a, l+1)

The fold is the key. This is a nice decoupling of the original recursive loop — the generator is abstracted out again into an enumeration, and the process of traversing the list once, accumulating two values (the sum and the length), is made clear. This is the kind of code we want to write. However, there are some issues at play that make this not quite the required implementation (as it doesn’t yield the same runtime representation as the control implementations):

  • The tuple accumulator is too lazy for a tail recursive loop
  • The accumulator type is an overly expensive unbounded Integer

The fix is straightforward: just use a strict pair type for nested accumulators:

    data P = P !Double !Int

    mean :: Double -> Double -> Double
    mean n m = s / fromIntegral l
      where
        P s l = foldl' k (P 0 0) [n .. m]
        k (P s l) a = P (s+a) (l+1)

Strict products, of atomic types, have a great property: when used like this they can be represented as sets of register variables (compile with -funbox-strict-fields). The P data type is essentially an abstract mapping of from P’s fields into two registers for use as loop accumulators. We can be fairly certain the fold will now compile to a loop whose accumulating parameters are passed and returned (thanks to the “constructed product result” optimisation), in registers.

This is easy to confirm via ghc-core. Compiling the code, we see the result below. The ‘#’ suffix indicates the Int and Double types are unlifted — they won’t be stored on the heap, and will most likely be kept in registers. The return type, an unlifted pair (#, #), is also excellent: it means the result returned from the function will be kept in registers as well (we don’t need to allocate at all to get the fold’s state in or out of the function!). So now we’re getting closer to the optimal representation. The core we get:

    go :: Double# -> Int# -> [Double] -> (# Double#, Int# #)
    go a b c = case c of
          []     -> (# a, b #)
          x : xs -> case x of
                        D# y -> go (+## a y) (+# b 1) xs

This will run in constant space, as the lazy list is demanded (so the first part of the problem is solved), and the result will be accumulated in registers (so that part is fine). But there’s still see a performance penalty here — the list of Doubles is still concretely, lazily constructed. GHC wasn’t able to spot that the generator of sequential values could be combined with the fold that consumes it, avoiding the list construction entirely. This will cost us quite a bit in memory traffic in such a tight loop:

    $ time ./B 1e9
    500000000.067109
    ./B 1e9  66.92s user 0.38s system 99% cpu 1:07.45 total

Right, so we got an answer, but dropping out to the heap to get the next list node is an expensive bottleneck in such a tight loop, as expected.

The deeper problem here is that the compiler isn’t able to combine left folds with list generators into a single loop. GHC can do this for right folds, as described in the original paper for this optimisation, A Short Cut to Deforestation . That is, if you compose certain functions written as right folds GHC will automatically remove intermediate lists between them, yielding a single loop that does the work of both, without any list being constructed. This will work nicely for pipelines of the following functions: head, filter, iterate, repeat, take, and, or, any, all, concat, foldr, concatMap, zip, zipWith, ++, map, and list comprehensions.

But it won’t work for some key left folds such as foldl, foldl’, foldl1, length, minimum, maximum, sum and product. Another downside is that the enumFromTo list generator only fuses for Int, Char and Integer types — an oversight in the current implementation.

What we need is a different deforestation strategy — one that teaches the compiler how to see through the left fold.

Stream fusion

An alternative (and new) deforestation optimisation for sequence types, such as lists or arrays, is stream fusion, which overcomes the foldr bias of GHC’s old method. As GHC allows custom optimisations to be written by users in libraries this is cheap to try out — the new optimisation comes bundled in the libraries of various new data structures: the fusible list package, and a new fusible arrays libray. It is also available for distributed parallel arrays, as part of the data parallel Haskell project, from which the non-parallel fusible arrays are derived.

There are two main enabling technologies that have brought about this abundance of data structure deforestation optimisations in Haskell: user-specified term rewriting rules, and ubiquitous, pure higher order functions. If you don’t have the former the burden of adding new optimisations by hacking the compiler is too high for mortals, and if you don’t have the latter — HOFs with guaranteed purity — there simply won’t be enough fusion opportunities to make the transformation worthwhile. Happily, GHC Haskell has both — in particular, it can be more aggressive as it knows the loops will have no side effects when we radically rearrange the code. Fun, fun, fun.

So, taking the fusible arrays library, we can rewrite our fold example as, precisely:

    import System.Environment
    import Data.Array.Vector
    import Text.Printf

    data P = P !Double !Int

    mean :: Double -> Double -> Double
    mean n m = s / fromIntegral l
      where
        P s l = foldlU k (P 0 0) (enumFromToFracU n m)
        k (P s l) a = P (s+a) (l+1)

    main = do
        [d] <- map read `fmap` getArgs
        printf "%f\n" (mean 1 d)

Which is practically identical to the naive list version, we simply replaced foldl’ with foldlU, and [n .. m] with an explict enumeration (as list enumerator syntax isn’t available). GHC compiles the fold down to:

  RuleFired
    1 streamU/unstreamU

  go :: Double# -> Int# -> Double# -> (# Double#, Int# #)
  go= \ a b c ->
     case >## c limit of
        False -> go (+## a c) (+# b 1) (+## c 1.0)
        True -> (# a, b #)

Look how the list generator and consumer have been combined into a single loop with 3 register variables, corresponding to the sum, the length, and the next element to generate in the list. The rule firing indicates that an intermediate array was removed from the program. Our “array program” allocates no arrays — as the compiler is able to see through the problem all the way from array generation to consumption! The final assembly looks really good (ghc -funbox-strict-fields -O2 -fvia-C -optc-O2):

    s1rD_info:
      ucomisd     5(%rbx), %xmm6
      ja  .L87
      addsd       %xmm6, %xmm5
      addq        $1, %rsi
      addsd       .LC2(%rip), %xmm6
      jmp s1rD_info

And the key test: the assembly from the high level approach is practically identical to the hand fused, worker/wrapper applied low level implementation:

    s1bm_info:
      ucomisd     8(%rbp), %xmm6
      ja  .L73
      addsd       %xmm6, %xmm5
      addq        $1, %rbx
      addsd       .LC0(%rip), %xmm6
      jmp s1bm_info

And to the original C implementation:

    .L5:
        addsd   %xmm1, %xmm3
        addq    $1, %rax
        addsd   %xmm2, %xmm1
        ucomisd %xmm1, %xmm0
        jae .L5

As a result the performance should be excellent:

    $ time ./C 1e9
    500000000.067109
    ./UFold 1e9  1.77s user 0.01s system 100% cpu 1.778 total

If we look at the runtime statistics (+RTS -sstderr) we see:

     16,232 bytes allocated in the heap
      %GC time       0.0%  (0.0% elapsed)

So 16k allocated in total, and the garbage collector was never invoked. There was simply no abstraction penalty in this program! In fact, the abstraction level made possible the required optimisations.

How it works

Here’s a quick review of how this works. For the full treatment, see the paper.

The first thing to do is to represent arrays as abstract unfolds over a sequence type, with some hidden state component and a final length:

    data Stream a = exists s. Stream (s -> Step s a) !s Int

    data Step s a = Done
                  | Yield !a !s
                  | Skip     !s

This type lets us represent traversals and producers of lists as simple non-recursive stepper functions — abstract loop bodies — which the compiler already knows how to optimise really well. Streams are basically non-recursive functions that either return an empty stream, or yield an element and the state required to get future elements, or they skip an element (this lets us implement control flow between steppers by switching state around).

We’ll then need a way to turn arrays into streams:

    streamU :: UA a => UArr a -> Stream a
    streamU !arr = Stream next 0 n
      where
        n = lengthU arr

        next i | i == n    = Done
               | otherwise = Yield (arr `indexU` i) (i+1)

And a way to turn streams back into (unlifted, ST-encapsulated) arrays:

    --
    -- Fill a mutable array and freeze it
    --
    unstreamU :: UA a => Stream a -> UArr a
    unstreamU st@(Stream next s n) = newDynU n (\marr -> unstreamMU marr st)

    --
    -- Fill a chunk of mutable memory by unfolding a stream
    --
    unstreamMU :: UA a => MUArr a s -> Stream a -> ST s Int
    unstreamMU marr (Stream next s n) = fill s 0
      where
        fill s i = case next s of
             Done       -> return i
             Skip s'    -> fill s' i
             Yield x s' -> writeMU marr i x >> fill s' (i+1)

This fills a chunk of memory by unfolding the yielded elements of the stream. Now, the key optimisation to teach the compiler:

    {-# RULES

    "streamU/unstreamU" forall s.
      streamU (unstreamU s) = s

      #-}

And that’s it. GHC now knows how to remove any occurrence of an array construction followed by its immediate consumption.

We can write an enumerator and foldl for streams:

    enumFromToFracS :: Double -> Double -> Stream Double
    enumFromToFracS n m = Stream next n (truncate (m - n))
      where
        lim = m + 1/2

        next s | s >  lim     = Done
               | otherwise    = Yield s (s+1)

    foldS :: (b -> a -> b) -> b -> Stream a -> b
    foldS f z (Stream next s _) = fold z s
      where
        fold !z s = case next s of
                     Yield x !s' -> fold (f z x) s'
                     Skip    !s' -> fold z s'
                     Done        -> z

And we’re almost done. Now to write concrete implementations of foldU and enumFromToFracU is done by converting arrays to streams and folding and enumerating those instead:

    enumFromToU :: (UA a, Integral a) => a -> a -> UArr a
    enumFromToU start end = unstreamU (enumFromToS start end)

    foldlU :: UA a => (b -> a -> b) -> b -> UArr a -> b
    foldlU f z = foldS f z . streamU

So, if we write a program that composes foldlU and enumFromToU, such as our big mean problem:

    = foldlU k (P 0 0) (enumFromToFracU n m)

The compiler will immediately inline the definitions to:

    = foldS k (P 0 0) . streamU . unstreamU . enumFromToS start end

The fusion rule kicks in at this point, and we’ve killed off the intermediate array:

    = foldS k (P 0 0) . enumFromToS start end

Now we have a single loop, the foldS, which takes a non-recursive list generator as an argument. GHC then squooshes away all the intermediate Step constructors, leaving a final loop with just the list generator index, and the pair of state values for the sum and length:

    go :: Double# -> Int# -> Double# -> (# Double#, Int# #)
    go= \ a b c ->
        case >## c limit of
            False -> go (+## a c) (+# b 1) (+## c 1.0)
            True  -> (# a, b #)

Exactly the low level code we wanted — but written for us by the compiler. And the final performance tells the story:

    $ time ./C 1e9
    500000000.067109
    ./UFold 1e9  1.77s user 0.01s system 100% cpu 1.778 total

A more complicated version

We can try a more complicated variant, numerically stable mean:

    #include <stdio.h>
    #include <stdlib.h>

    int main(int argc, char **argv) {

        double d = atof(argv[1]);

        double n;
        long long a; // 64 bit machine
        double b;

        // go :: Double# -> Int# -> Double# -> Double#
        for (n = 1,
             a = 1,
             b = 0; n <= d; b+= (n - b) / a,
                            n++,
                            a++)
            ;

            printf("%f\n", b);

            return 0;
    }

Implemented in Haskell as a single fold, which computes the stable average each time around:

    import Text.Printf
    import System.Environment
    import Data.Array.Vector

    mean :: UArr Double -> Double
    mean = fstT . foldlU k (T 0 1)
        where
            k (T b a) n = T b' (a+1)
                where b' = b + (n - b) / fromIntegral a

    data T = T !Double !Int

    fstT (T a _) = a

    main = do
        [d] <- map read `fmap` getArgs
        printf "%f\n" (mean (enumFromToFracU 1 d))

We can work a higher level than the C code, which already manually fuses away the list generation and consumption. The end result is the same thing though:

    $ ghc-core F.hs -O2 -optc-O2 -fvia-C -funbox-strict-fields
    $ time ./F 1e9                      
    500000000.500000
    ./F 1e9  19.28s user 0.00s system 99% cpu 19.279 total

    $ gcc -O2 u.c
    $ time ./u 1e9
    500000000.500000
    ./u 1e9  19.09s user 0.00s system 100% cpu 19.081 total

So the fast inner loops are fine, and we get to use a high level language for all the surrounding glue. That’s just super.

So, to answer Slava’s question: yes, we often can get the performance of C, while working at a higher altitude. And to do so we really want to exploit all the additional information we have at hand when working more abstractly. And the user doesn’t have to know this stuff — the library already does all the thinking. Knowledge is (optimisation) power. Thinking at a higher level enables the compiler to do more powerful things, leaving less work for the programmer. It’s all win.