Evolving faster Haskell programs

March 9, 2009March 9, 2009 ~ Don Stewart ~ 16 Comments

This post is about using a genetic algorithm (GA) search technique to search for faster Haskell programs. We’ll use the GA to automatically evolve better:

and briefly consider automatic optimizing for:

So this is about how to make your already-fast Haskell programs faster without doing the hard work yourself. I’ll walk through the approach of using a GA library to breed solutions, and show performance improvements in already hand-optimized programs submitted to the Computer Language Benchmarks Game found by a GA search.

As a taste, the GA found an inlining hint combination resulting in an 18% reduction in runtime for the parallel k-nucleotide benchmark (a program that had already had extensive hand optimization!). Sweet.

Background

A modern optimizing compiler like GHC is a complex beast, with a barrage of optimizations available to transform code from lambda calculus to assembly. GHC follows a compilation-by-transformation approach, doing as much as possible of its code improvement via “correctness-preserving, and hopefully performance-improving, program transformations”. Deciding when to perform some particular transformation is hard, so instead the compiler has many, many tunable flags, as well as allowing hints in the source in the form of pragmas to let the programmer make domain-specific optimizations that may not be generally applicable.

Since the compiler doesn’t always get the thresholds for optimization right, exposing, for example, inlining hints to the programmer can have significant benefits when the programmer knows something more about how the code is to be used. I’ve seen factors of 10x to 100x performance improvements in inner loops with careful overriding of the default inliner heuristics used by the “Illustrious Inliner” (in Data.Binary)

The problem is the complexity of it all. If we have n inlinable functions or compiler flags, there’s 2^n combinations of inlining suggestions we can give to the compiler (double that if we start disabling inlining, and even more if we start setting particular phases). Even if the programmer has a few heursitics in mind to help with pruning, the search space is still huge. For these reasons, it is hard to know precisely when a particular flag, option, or inlining hint will be of benefit (and the same goes for parallelism hints, and strictness hints).

So, let’s have the computer traverse the search space for us.

Acovea

With a large search space for our optimization problem, one classic technique is to use an evolutionary algorithm to minimize for some cost. And for breeding the best set of compiler flags for a given program, we can use an off-the-shelf solution: acovea

ACOVEA (Analysis of Compiler Options via Evolutionary Algorithm) implements a genetic algorithm to find the “best” options for compiling programs with the GNU Compiler Collection (GCC) C and C++ compilers

Given a specification of the available flags, acovea uses these as variables to fill out an optimization search space, which it then traverses, using GA techniques to hang on to useful flags, breeding them to find a semi-optimal combination.acovea is relatively simple to use:

Compile the C++ libs (libevocosm, libcoyotl and libacovea)
Develop a specification of the flags to tune for your compiler (or reuse the defaults for GCC)
Then launch the “runacovea” wrapper script on your program
Go away for a few hours

Come back and you’ll be presented with suggested optimistic and pessimistic flags, best flag combinations, generally good combinations, and measurements against any baselines you set up. I’ve used acovea in the past (for optimising the GCC flags used in a polymer chemistry simulation) and in this post we’ll see if we can adapt it to solve other kinds of optimization problems.

An example: optimising a C program

Acovea comes with some benchmark programs to test out how it works.

First, if it doesn’t have a specification for your compiler, you’ll need to make one. A compiler specification is just an xml file with a list of all the flags you want to try permuting. Here’s one quick one I made for GCC 4.3 on a core 2 duo. It sets up some baseline flag combinations that tend to be good (-O, -O2, -O3) and then lists different flags to try.

The input program must only print to standard output its “fitness”. A value indicating how good this program was. This is the value the solver will try to minimize. By default we will time the program’s run, and have it print that time as its fitness. Smaller fitness numbers mean faster programs.

We can then run the evolver with a given input program and compiler spec as arguments,

$ time runacovea -input fftbench.c -config gcc43_core2.acovea -p 5 -n 5 -g 5

For a quick test like this, we’ll limit the size of the population of programs, the number of them, and the number of generations to run, to avoid the search taking too long. The result when run looks something like (when run on the fftbench.c distributed with acovea):

$ runacovea -input fftbench.c -config gcc43_core2.acovea -p 5 -n 5 -g 5
Acovea 5.1.1 (compiled Feb 28 2009 09:57:51)
Evolving Better Software

Invented by Scott Robert Ladd         (scott.ladd@coyotegulch.com)
            Coyote Gulch Productions  (http://www.coyotegulch.com)

   test application: fftbench.c
        test system: paprika
 config description: gcc 4.x Core 2 Duo (x86_64) (version 1.2.0)
 test configuration: gcc43_core2.acovea
     acovea version: 5.1.1
    evocosm version: 3.1.0
application version: gcc 4.3.3

generation 1 complete, average fitness: 0.938262
generation 2 complete, average fitness: 0.716008
generation 3 complete, average fitness: 0.642554
generation 4 complete, average fitness: 0.73271
generation 5 complete, average fitness: 0.695639

Acovea’s turning on and off GCC flags, and using the GA approach to find better solutions. The end result is a number of good and bad flags (and by how much):

Optimistic options:
                              -ftree-dce  (1.874)
                              -ftree-dse  (1.874)
                              -ftree-sra  (2.092)
                                  -fgcse  (1.656)
                       -fstrict-aliasing  (1.874)
                            -fsched-spec  (1.656)
                      -ffinite-math-only  (1.656)

Pessimistic options:
                        -fschedule-insns  (-2.263)
                           -ffloat-store  (-2.481)
                      -funroll-all-loops  (-2.263)
           -fbranch-target-load-optimize  (-1.61)
     -freschedule-modulo-scheduled-loops  (-2.045)
                            -mfpmath=387  (-1.61)
                            -mfpmath=sse  (-1.828)

As well as a graph of how these do against our baselines of -O1 -O2 and -O3:

A relative graph of fitnesses:

     Acovea's Best-of-the-Best: ************************************                  (0.537113)
       Acovea's Common Options: **********************************************        (0.694453)
                           -O1: **************************************************    (0.743317)
                           -O2: ***********************************                   (0.524936)
                           -O3: *************************************                 (0.549206)
               -O3 -ffast-math: ***********************************                   (0.522497)
                           -Os: ************************************************      (0.72606)

And the command line to use to get that best measurement (with many of these noise with such a short search):

gcc -lrt -lm -std=gnu99 -O1 -march=core2 -fno-merge-constants -fno-defer-pop -fno-if-conversion2 -floop-optimize -ftree-dce -ftree-dse -ftree-lrs -ftree-sra -fcse-follow-jumps -fcse-skip-blocks -fgcse -fexpensive-optimizations -frerun-loop-opt -fpeephole2 -fstrict-aliasing -fstrict-overflow -fdelete-null-pointer-checks -freorder-blocks -fthread-jumps -fgcse-lm -fsched-interblock -funit-at-a-time -falign-functions -falign-labels -ftree-pre -fgcse-after-reload -fomit-frame-pointer -fno-inline -ftracer -fsplit-ivs-in-unroller -funroll-loops -fgcse-sm -freschedule-modulo-scheduled-loops -ftree-loop-ivcanon -mieee-fp -minline-all-stringops -mfpmath=387 -fno-math-errno -funsafe-math-optimizations -fno-trapping-math -ffinite-math-only -fcx-limited-range -o /tmp/ACOVEA30281033 fftbench.c

So in this short 5 minute run, we found a combination of flags that was pretty close to -O3. If we let it run overnight, it might well find a good 10-20% on our best generic defaults. Fun stuff!

Evolving a faster Haskell program

We can do the same thing with a Haskell compiler too. First, we need a specification of the GHC’s optimisation flags.To start with, let’s just use the tool to answer a couple of simple questions when developing production Haskell code:

should I use -O1 or -O2?
should I use the C or native backend?

To answer these questions all we’ll start with a simple (incomplete) GHC specification file with just those flags available, like so:

    <prime command="ghc"
           flags="--make -v0 -fforce-recomp ACOVEA_OPTIONS -o ACOVEA_OUTPUT ACOVEA_INPUT" />
    <baseline description="ghc -O2 -funbox-strict-fields -fvia-C -optc-O3"
              command="ghc"
              flags="--make -fforce-recomp -O2 -funbox-strict-fields -fvia-C
                       -optc-O3 -optc-march=core2 -o ACOVEA_OUTPUT ACOVEA_INPUT" />
    <baseline description="ghc -O2 -funbox-strict-fields -fasm"
              command="ghc"
              flags="--make -fforce-recomp -O2 -funbox-strict-fields -fasm -o ACOVEA_OUTPUT ACOVEA_INPUT" />
    <flags>
        <flag type="simple" value="-O" />
        <flag type="simple" value="-O2" />
        <flag type="simple" value="-fasm" />
        <flag type="simple" value="-fvia-C" />
        <flag type="simple" value="-optc-O1" />
        <flag type="simple" value="-optc-O2" />
        <flag type="simple" value="-optc-O3" />
        <flag type="simple" value="-fexcess-precision" />
    </flags

A couple of things to notice here:

by default, we’ll use –make -fforce-recomp -v0 to allow full recompilation and linking
we leave all optimisations off by default

as a baseline, we’ll use the known “good” flags

-O2 -funbox-strict-fields -fvia-C -optc-O3 -optc-march=core2

-O2 -funbox-strict-fields -fasm

Letting us crank up all the optimisations, and pick between the GHC backends to use. With more time, we can traverse a larger search space, and start including more speculative GHC flags (like -funliberate-case-threshold). Also, if we really have time on our hands, we can include all the GCC flags as well! (-optc-…).

Optimizations in GHC can have a huge impact, which is good when we’re searching for them, but GHC is also somewhat problematic, as (afaik) there are optimizations baked into the -O and -O2 levels that we can’t turn on or off via flags. As a result we must always include -O and -O2 as available optimisations in our spec.

Some sample GHC flag specifications are here:

Note the last one defines a monster search space of 2^120 flag combinations.

Timing a Haskell program

So now we’ve got a spec for GHC, let’s try to see if it can find some sensible flags to optimize our program. We’ll use as input an obsolete language shootout benchmark – recursive – since it’s small. I’m hoping it will tell me that either -O2 -fasm or -O2 -fvia-C -optc-O3 is sensible.

First, we have to modify the program to emit its fitness, not some other output (but we have to be careful to also not avoid doing work … pesky lazy languages). To do this, I change the ‘main’ function to contain the following wrapper:

import Text.Printf
import Control.Exception
import System.CPUTime
import Control.Parallel.Strategies
import Control.Monad
import System.Environment
import System.Posix.Resource

n_str = "9"

main = do
    setResourceLimit ResourceCPUTime (ResourceLimits (ResourceLimit 10) (ResourceLimit 10))
    start <- getCPUTime

   ... do guts of program ...

    end   <- getCPUTime
    let diff = (fromIntegral (end - start)) / (10^12)
    printf "%0.4f" (diff :: Double)
    return ()

I’ve set an arbitrary upper limit of 10 seconds on the program (which Acovea seems to take into account as a failure), and then we measure the cpu time the program gets. I also have to be careful to replace any IO functions with code that forces the data to be evaluated, but doesn’t print it out. `rnf` comes in handy here. We also have to modify the program to parse its arguments from a string, not the command line.

So now our program prints out its fitness (in cpu itme) when run:

$ ghc -O2 --make A.hs
Linking A ...

$ time ./A
0.3366

./A  0.34s user 0.00s system 97% cpu 0.350 total

I expect -O2 -fasm to be around the best we can for this program (possibly -O2 -fvia-C -optc-O3).

Evolving a better set of GHC flags

We can now bring the two together and use the GA lib to find a good set of flags for this program. Note: GAs are slow to converge on a good solution.

$ runacovea -input A.hs -config ghc.simple.acovea -p 5 -n 10 -g 10

generation 1 complete, average fitness: 1.62956
generation 2 complete, average fitness: 1.10506
generation 3 complete, average fitness: 0.808408
generation 5 complete, average fitness: 0.548018
generation 7 complete, average fitness: 0.458966
generation 8 complete, average fitness: 0.458832
generation 9 complete, average fitness: 0.387232
generation 10 complete, average fitness: 0.376632

Acovea's Best-of-the-Best:
   ghc --make -v0 -fforce-recomp -O -O2 -optc-O2 -optc-O3 -fexcess-precision -o /tmp/ACOVEA13339961 A.hs 

Via C baseline:
   ghc --make -fforce-recomp -O2 -funbox-strict-fields -fvia-C -optc-O3 -optc-march=core2 -o /tmp/ACOVEA6310E069 A.hs 

Via native codegen baseline:
   ghc --make -fforce-recomp -O2 -funbox-strict-fields -fasm -o /tmp/ACOVEA19AF21AF A.hs 

A relative graph of fitnesses:

     Acovea's Best-of-the-Best: ****                                                  (0.3433)
       Acovea's Common Options: **************************************************    (4.1731)
                         Via C: *****                                                 (0.4433)
                     Via -fasm: ****                                                  (0.3733)

Cool. Now the results are interesting:

the difference between no optimizations and the optimized result is more than a factor of 10.
the best measurement was taken with -O2 -fexcess-precision (this uses the native code generator, not the C backend)
the final results include noise (e.g. without -fvia-C the -optc- flags have no effect)
GHC’s native code gen outperformed the GCC backend on this code.

It would be useful to have a “shrinking” phase at the end to remove noise (the way QuickCheck does). But for now we can do that by hand, meaning that acovea believes

ghc -O2 -fasm -fexcess-precision

is the way to go here. Let’s check the assembly for the best variants. For just the fibonacci function, where GHC first specialises it to Double, (where most of the benchmark’s time is spent) we get from the native code backend:

Main_zdwfib1_info:
.Lc1Iv:
  leaq -16(%rbp),%rax
  cmpq %r14,%rax
  jb .Lc1Ix
  ucomisd .Ln1IC(%rip),%xmm5
  jae .Lc1Iz
  movsd .Ln1ID(%rip),%xmm5
  jmp *(%rbp)
.Lc1Ix:
  movl $Main_zdwfib1_closure,%ebx
  jmp *-8(%r13)
.Lc1Iz:
  movsd %xmm5,%xmm0
  subsd .Ln1IE(%rip),%xmm0
  movsd %xmm5,-8(%rbp)
  movsd %xmm0,%xmm5
  movq $s1z8_info,-16(%rbp)
  addq $-16,%rbp
  jmp Main_zdwfib1_info

$ time ./A
0.3333
./A  0.33s user 0.00s system 99% cpu 0.339 total

Using my baseline -O2 -funbox-strict-fields -fvia-C -optc-O3 -optc-march=core2 options, we get, very interestingly,

Main_zdwfib1_info:
  leaq        -16(%rbp), %rax
  movq        %rbp, %rdx
  cmpq        %r14, %rax
  jb  .L54
  ucomisd     .LC1(%rip), %xmm5
  jb  .L58
.L55:
  movsd       %xmm5, -8(%rdx)
  movq        $s1z8_info, -16(%rbp)
  subsd       .LC2(%rip), %xmm5
  subq        $16, %rbp
  jmp Main_zdwfib1_info

$ time ./A
0.4366
./A  0.44s user 0.00s system 98% cpu 0.441 total

Which is certainly smaller, but also slower! And for the version suggested by Acovea, using just -O2 -fecess-precision (-fasm):

Main_zdwfib1_info:
.Lc1Iv:
  leaq -16(%rbp),%rax
  cmpq %r14,%rax
  jb .Lc1Ix
  ucomisd .Ln1IC(%rip),%xmm5
  jae .Lc1Iz
  movsd .Ln1ID(%rip),%xmm5
  jmp *(%rbp)
.Lc1Ix:
  movl $Main_zdwfib1_closure,%ebx
  jmp *-8(%r13)
.Lc1Iz:
  movsd %xmm5,%xmm0
  subsd .Ln1IE(%rip),%xmm0
  movsd %xmm5,-8(%rbp)
  movsd %xmm0,%xmm5
  movq $s1z8_info,-16(%rbp)
  addq $-16,%rbp
  jmp Main_zdwfib1_info

Which looks identical to -fasm and runs in the same time. So -fexcess-precision is noise here. Acovea declares -fasm beat the C backend here. That’s interesting: we’d already assumed -fvia-C was best for this benchmark, but that looks to be wrong. Progress!

We’ll now look at a full scale example: finding the best inlining strategy for some already highly optimised code.

Evolving the inliner

As we talked about before, the GHC inliner is a complex thing, but one with lots of optimization potential. By default, GHC tries to inline things using some magic SimonPJ heuristics, described in http://darcs.haskell.org/ghc/compiler/simplCore/Simplify.lhs. Since it’s often useful to override the inliner’s default strategy, GHC allows us to place INLINE and NOINLINE pragmas on functions, like so:

key_function :: Int -> String -> (Bool, Double)
{-# INLINE key_function #-}

Note you can also disable inlining on a function, or add a phase annotation to say in which optimization phase of the compiler the inlining should be fired. Looking at these is further work.

In high performance code, inlining can enable all sorts of interesting new optimizations to fire, and in some cases can turn GHC into a sort of whole program optimizer, by inlining entire libraries into the user code, and then specializing them precisely for that particular use. Fusion-enabled libraries like uvector work like this. The problem in general code though is knowing when an INLINE is going to help. And for this we want to get computer support.

The main trick to have acovea program our INLINE pragmas is to make sure we can turn them on and off via compiler flags. To do this, we’ll use CPP. That is, each potential INLINE point will be lifted into a CPP symbol, that acovea can then switch from its specification file.

In the source we’ll identify each inlining site with a new CPP symbol:

key_function :: Int -> String -> (Bool, Double)
INLINE_1

and then build a custom acovea xml file for the inline points in our program. By default, our command will disable all inlining points:

<prime command="ghc"
 flags="-cpp -v0 -optP-w --make -fforce-recomp -O2 -DINLINE_1= ACOVEA_OPTIONS -o ACOVEA_OUTPUT ACOVEA_INPUT" />

GHC will now call cpp first, with a bunch of acovea-determined INLINE points redefined via a flag:

<flag type="simple" value="-DINLINE_1={-# INLINE key_function #-}"          />

So now acovea can run our program, flicking switches to turn on and off inlining at each point we suggested it look at. It’ll then measure the result and try to find the best combination.

Results

Initially I tried this approach on two programs the Computer Language Benchmarks Game. The results are now available on the shootout itself at the time of writing. The faster programs (after inline optimization) are first. Note the k-nucleotide is a multicore parallel program.

First, n-body,

1.6	Haskell GHC #2	31.87	1,996	1717	31.87	0% 0% 0% 100%
1.6	Haskell GHC	32.07	2,008	1687	32.07	0% 100% 0% 0%

For k-nucleotide:

2.1	Haskell GHC #4	115.00	482,624	2751	46.48	27% 63% 73% 68%
2.6	Haskell GHC* #3*	121.83	427,224	2749	56.42	35% 57% 47% 64%

In both cases the result found by the GA was an improvement over the hand optimized (and inlined) version. In the case of k-nucleotide, it was 18% faster over the hand optimized version. For nbody, it was a marginal improvement (one better inlining site was found).

The specifications to acovea I used and both input programs are here:

nbody.hs (acovea spec)
knuc.hs (acovea spec)

For nbody, acovea decided that:

Optimistic options:
         -DINLINE_23={-# INLINE vel1 #-}  (2.973)

Pessimistic options:
      -DINLINE_24={-# INLINE advance #-}  (-3.238)

That is, there’s a free win if we inline vel1 (which was a let-bound variable used in two places). Instead of saving the result and reusing the value, we actually get marginally faster code if we just duplicate that small computation.

A relative graph of fitnesses:

     Acovea's Best-of-the-Best: *************************************************     (3.4864)
       Acovea's Common Options: *************************************************     (3.4631)
                      -optc-O3: **************************************************    (3.5031)

The improvement is marginal, but that one inlining site (vel1) is enough to make a consistent small difference. At first, acovea appeared to have found a dramatic improvement (from 3.4s to 2.1s). It quickly decided that it was a good idea to inline this:

cursor :: IORef (Ptr Double)
cursor = unsafePerformIO $ newIORef planets

Which sure makes the program faster, but means that ‘planets’ is no longer a global mutable array. The GA stumbled into the unsafePerformIO / INLINING trap!

For k-nucleotide, a more complex program, the results were more interesting, and more impressive. It suggested that:

Optimistic options:

     -DINLINE_1={-# INLINE htPayload #-}  (1.801)
    -DINLINE_2={-# INLINE allocEntry #-}  (1.556)
  -DINLINE_6={-# INLINE entryMatches #-}  (1.801)
      -DINLINE_9={-# INLINE calcHash #-}  (1.801)

Pessimistic options:

   -DINLINE_23={-# INLINE hashGenome #-}  (-2.106)

Which is interesting, since htPayload was not identified as a good inlining site in the original program (by hand), and hasGenome was! The hand optimized version was going awry. The final suggestions were:

-DINLINE_1={-# INLINE htPayload #-}
-DINLINE_2={-# INLINE allocEntry #-}
-DINLINE_6={-# INLINE entryMatches #-}
-DINLINE_8={-# INLINE totalEntriesOf #-}
-DINLINE_9={-# INLINE calcHash #-}
-DINLINE_13={-# INLINE wordSize #-}
-DINLINE_14={-# INLINE htTraverse #-}
-DINLINE_20={-# INLINE htInc #-}

only, corresponding to an improvement over the defaults of:

     Acovea's Best-of-the-Best: *******************************************           (2.8265)
       Acovea's Common Options: ****************************************              (2.6832)
                      -optc-O3: **************************************************    (3.2831)

Interestingly, the best-of-the-best was measured at the end as worse than the common options. Applying these inlining hints, and the result is now on the shootout as GHC #4.

And that’s it, we’re using a GA to evolve the inliner!

What next?

If we’re going to program the inliner, it can be beneficial to make transformations on functions that allow more inlining. In particular, manual worker/wrapper. There’s been some work already on using a genetic algorithm to evolve optimal strictness hints for GHC using a similar approach (hints that can be turned on and off). This should also be doable via acovea, using CPP once again to flick the switch.

Next steps are also to automate the construction of acovea specifications and running of timed Haskell programs, using a tool to emit the inlining, flag and runtime specs for a program. I also don’t know if using something like simulated annealing would give better results, or results sooner.

There are a number of GHC runtime flags that are interesting to mess with (heap size, number of generations, scheduling ticks, and for multicore programs, the number of cores to use, and how to tweak the parallel garbage collector. These are all runtime flags, and acovea only really lets us set things as ‘compile time’ flags – we’d have to bake them into the executable (doable with a C inclusion).

Finally, Tim Harris and Satnam Singh outlined an approach to discovering good `par` points to add paralleliism to Haskell programs, in Feedback Directed Implicit Parallelism (.pdf). They annotated the runtime to write logs of the actual costs of subexpressions in a profiling phase, and then used that information to place `par` hints on expressions in the source code. It seems to me that we could use the GA approach to breed `par` hints as well, discovering implicit parallelism in existing programs.

Coincidentally, after writing this article, I found that Tyson Whitehead had suggested the GA approach to improving inlining on the GHC mailing list only a couple of weeks ago. Also, there’s been some prior research on genetic algorithms for finding compiler heuristic values, and inliners in particular. See, for example, papers by John Cavasos
(e.g. “Automatic Tuning of Inlining Heuristics”, John Cavazos and Michael F. P. O’Boyle.). I’m sure there’s other work.

Playing with GHC’s parallel runtime

March 4, 2009March 4, 2009 ~ Don Stewart ~ 8 Comments

In this post you’ll get a bit of an idea how to:

make a Haskell program much faster by parallelising it
see how to analyse and use the SMP runtime flags GHC provides
mess with the parallel garbage collector

Ultimately we’ll make a program 4x faster on 4 cores by changing one line of code, using parallelism, and tuning the garbage collector.

Update: and since I began writing this GHC HQ (aka Simon, Simon and Satnam) have released “Runtime Support for Multicore Haskell” which finally puts on paper a lot of information that was previously just rumour. As a result, I’ve rewritten this article from scratch to use GHC 6.11 (today’s snapshot) since it is just so much faster and easier to use than 6.10.x.

Ready?

The new GHC garbage collector

The GHC 6.10 release notes contain the following text on runtime system changes:

The garbage collector can now use multiple threads in parallel. The new -gn RTS flag controls it, e.g. run your program with +RTS -g2 -RTS to use 2 threads. The -g option is implied by the usual -N option, so normally there will be no need to specify it separately, although occasionally it is useful to turn it off with -g1. Do let us know if you experience strange effects, especially an increase in GC time when using the parallel GC (use +RTS -s -RTS to measure GC time). See Section 5.14.3, “RTS options to control the garbage collector” for more details.

Interesting. Maybe this will have some impact on the shootout benchmarks.

Binary trees: single threaded

There’s one program that’s been bugging me for a while, where the garbage collector is a bottleneck: parallel binary-trees on the quad core Computer Language Benchmarks Game. This is a pretty straight forward program for testing out memory management of non-flat data types in a language runtime – and FP languages should do very well with their bump-and-allocate heaps. All you have to do is allocate and traverse a bunch of binary trees really. This kind of data:

data Tree = Nil | Node !Int !Tree !Tree

Note that the rules state we can’t use laziness to avoid making O(n) allocations at a time, so the benchmark will use a strict tree type – that’s fine – it only helps with a single core anyway. GHC will unbox those Int fields into the constructor too, with -funbox-strict-fields (should be implied by -O in my opinion). The benchmark itself is really quite easy to implement. Pattern matching makes allocating and wandering them trivial:

-- traverse the tree, counting up the nodes
check :: Tree -> Int
check Nil          = 0
check (Node i l r) = i + check l - check r

-- build a tree
make :: Int -> Int -> Tree
make i 0 = Node i Nil Nil
make i d = Node i (make (i2-1) d2) (make i2 d2)
  where i2 = 2*i
        d2 = d-1

The full code is here. So quite naive code, and fast… if we just look at this code running on the single core benchmark machine:

Functional language implementations taking up 4 of the top 6 slots, and edging out C (it’s even faster with lazy trees). You can try this for yourself:

whirlpool$ ghc -O2 --make A.hs
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...

whirlpool$ time ./A 16
stretch tree of depth 17	 check: -1
131072	 trees of depth 4	 check: -131072
32768	 trees of depth 6	 check: -32768
8192	 trees of depth 8	 check: -8192
2048	 trees of depth 10	 check: -2048
512	 trees of depth 12	 check: -512
128	 trees of depth 14	 check: -128
32	 trees of depth 16	 check: -32
long lived tree of depth 16	 check: -1
./A 16  1.26s user 0.03s system 100% cpu 1.291 total

I’m on a quad core Linux 2.6.26-1-amd64 x86_64 box, with:

whirlpool$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.11.20090302

If we take the value of N up to the N=20, it takes a while longer to run:

whirlpool$ time ./A 20
stretch tree of depth 21	 check: -1
2097152	 trees of depth 4	 check: -2097152
524288	 trees of depth 6	 check: -524288
131072	 trees of depth 8	 check: -131072
32768	 trees of depth 10	 check: -32768
8192	 trees of depth 12	 check: -8192
2048	 trees of depth 14	 check: -2048
512	 trees of depth 16	 check: -512
128	 trees of depth 18	 check: -128
32	 trees of depth 20	 check: -32
long lived tree of depth 20	 check: -1
./A 20  40.21s user 0.16s system 99% cpu 40.382 total

And of course we get no speed from the extra cores on the system yet. We’re only using 1/4 of the machine’s processing resources. The implementation contains no parallelisation strategy for GHC to use.

Binary trees in parallel

Since Haskell (especially pure Haskell like this) is easy to parallelise, and in general GHC Haskell is pretty zippy on multicore :-) let’s see what we can do to make this faster by parallelisation. It turns out, teaching this program to use multicore is ridiculously easy. All we have to change is one line! Where previously we computed the depth of all the trees between minN and maxN sequentially,

let vs = depth minN maxN

...

depth :: Int -> Int -> [(Int,Int,Int)]
depth d m
    | d <= m    = (2*n,d,sumT d n 0) : depth (d+2) m
    | otherwise = []
  where n = 1 `shiftL` (m - d + minN)

Which yields a list of tree results sequentially, we instead step back, and compute the separate trees in parallel using parMap:

let vs = parMap rnf id $ depth minN maxN

From Control.Parallel.Strategies, parMap forks sparks for each (expensive) computation in the list, evaluating them in parallel to normal form. This technique uses sparks – lazy futures – to hint to the runtime that it might be a good idea to evaluate each subcomputation in parallel. When the runtime spots that there are spare threads, it’ll pick up the sparks, and run them. With +RTS -N4, those sparks (in this case, 9 of them) will get scheduled over 4 cores. You can find out more about this style of parallel programming in ch24 of Real World Haskell, in Algorithm + Strategy = Parallelism and now in the new GHC HQ runtime paper.

Running parallel binary trees

Now that we’ve modified the implementation to contain a parallel evaluation strategy,all we have to do is compile it against the threaded GHC runtime, and those sparks will be picked up by the scheduler, and dropped into real threads distributed across the cores. We can try it using 2/4 cores:

whirlpool$ ghc -O2 -threaded A.hs --make -fforce-recomp

whirlpool$ time ./A 16 +RTS -N2
stretch tree of depth 17	 check: -1
131072	 trees of depth 4	 check: -131072
32768	 trees of depth 6	 check: -32768
8192	 trees of depth 8	 check: -8192
2048	 trees of depth 10	 check: -2048
512	 trees of depth 12	 check: -512
128	 trees of depth 14	 check: -128
32	 trees of depth 16	 check: -32
long lived tree of depth 16	 check: -1
./A 16 +RTS -N2  1.34s user 0.02s system 124% cpu 1.094 total

Hmm, a little faster at N=16, and > 100% cpu. Trying again with 4 cores:

whirlpool$ time ./A 16 +RTS -N4
stretch tree of depth 17	 check: -1
131072	 trees of depth 4	 check: -131072
32768	 trees of depth 6	 check: -32768
8192	 trees of depth 8	 check: -8192
2048	 trees of depth 10	 check: -2048
512	 trees of depth 12	 check: -512
128	 trees of depth 14	 check: -128
32	 trees of depth 16	 check: -32
long lived tree of depth 16	 check: -1
./A 16 +RTS -N4  2.89s user 0.06s system 239% cpu 1.229 total

Hmm… so it got only a little faster with 2 cores at N=16, but about the same with 4 cores. At N=20 we see similar results:

whirlpool$ time ./A 20 +RTS -N4
stretch tree of depth 21	 check: -1
2097152	 trees of depth 4	 check: -2097152
524288	 trees of depth 6	 check: -524288
131072	 trees of depth 8	 check: -131072
32768	 trees of depth 10	 check: -32768
8192	 trees of depth 12	 check: -8192
2048	 trees of depth 14	 check: -2048
512	 trees of depth 16	 check: -512
128	 trees of depth 18	 check: -128
32	 trees of depth 20	 check: -32
long lived tree of depth 20	 check: -1
./A 20 +RTS -N4  96.61s user 0.93s system 239% cpu 40.778 total

So still 40s, at 239% cpu. So we made something hot. And you can see a similar result at N=20 on the current quad core shootout binary-trees entry. Jobs distributed across the cores, but not much better runtime. A little better than the single core entry, but only a little. And in the middle of the pack, and 2x slower than C!

Meanwhile, on the single core, it’s in 3rd place, ahead of C and C++. So what’s going on?

Listening to the garbage collector

We’ve parallelised this logically well, so I’m not prepared to abandon the top-level parMap strategy. Instead, let’s look deeper. One clue about what is going on is the cpu utilisation in the shootout program:

6.8

Haskell GHC #2

53.36

403,944

544

40.18

21% 45% 21% 41%

Those aren’t very good numbers – we’re using all the cores, but not very well. So the program’s doing something other than just number crunching. A good suspect is that there’s lots of GC traffic happening (after all, a lot of trees are being allocated!). We can confirm this hunch with +RTS -sstderr which prints lots of interesting statistics about what the program did:

whirlpool$ time ./A 16 +RTS -N4 -sstderr

./A 16 +RTS -N4 -sstderr
     946,644,112 bytes allocated in the heap
     484,565,352 bytes copied during GC
       8,767,512 bytes maximum residency (23 sample(s))
          95,720 bytes maximum slop
              27 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0:   674 collections,     0 parallel,  0.54s,  0.55s elapsed
  Generation 1:    23 collections,    22 parallel,  0.57s,  0.16s elapsed

  Parallel GC work balance: 1.56 (17151829 / 10999322, ideal 4)

  Task  0 (worker) :  MUT time:   0.36s  (  0.39s elapsed)
                      GC  time:   0.28s  (  0.13s elapsed)
  Task  1 (worker) :  MUT time:   0.67s  (  0.43s elapsed)
                      GC  time:   0.14s  (  0.14s elapsed)
  Task  2 (worker) :  MUT time:   0.01s  (  0.43s elapsed)
                      GC  time:   0.09s  (  0.08s elapsed)
  Task  3 (worker) :  MUT time:   0.00s  (  0.43s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)
  Task  4 (worker) :  MUT time:   0.31s  (  0.43s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)
  Task  5 (worker) :  MUT time:   0.22s  (  0.43s elapsed)
                      GC  time:   0.60s  (  0.37s elapsed)

  SPARKS: 7 (7 converted, 0 pruned)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.02s  (  0.43s elapsed)
  GC    time    1.12s  (  0.71s elapsed)
  EXIT  time    0.32s  (  0.03s elapsed)
  Total time    2.45s  (  1.17s elapsed)

  %GC time      45.5%  (60.7% elapsed)

  Alloc rate    708,520,343 bytes per MUT second

  Productivity  54.5% of total user, 114.1% of total elapsed

gc_alloc_block_sync: 35082
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 1123
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 6318
gen[1].steps[0].sync_large_objects: 0
./A 16 +RTS -N4 -sstderr  2.76s user 0.08s system 241% cpu 1.176 total

At N=16, the program is spending 45% of its time doing garbage collection. That’s a problem. We can also see some other things:

7 sparks are being created by our parMap, all of which are turned into real threads
The parallel GC does get a chance to run in parallel 22 times.

And at N=20, the benchmark number, things aren’t any better:

  19,439,350,240 bytes allocated in the heap
  21,891,579,896 bytes copied during GC
     134,688,800 bytes maximum residency (89 sample(s))
         940,344 bytes maximum slop
             376 MB total memory in use (6 MB lost due to fragmentation)

  Generation 0: 14576 collections,     0 parallel, 20.67s, 20.62s elapsed
  Generation 1:    89 collections,    88 parallel, 36.33s,  9.20s elapsed

  SPARKS: 9 (9 converted, 0 pruned)

  %GC time      64.0%  (74.8% elapsed)

So yikes, we’re wasting a lot of time cleaning up after ourselves (though happily our par strategy isn’t wasting any fizzled sparks). Diving into the GC docs, we see:

Bigger heaps work better with parallel GC, so set your -H value high (3 or more times the maximum residency)

Ok. Let’s try to get that number down.

Helping out the GC

We can see how much to make a guess at by looking at the maximum residency stats. A good start might be 400M:

whirlpool$ time ./A 20 +RTS -N4 -H400M
stretch tree of depth 21	 check: -1
2097152	 trees of depth 4	 check: -2097152
524288	 trees of depth 6	 check: -524288
131072	 trees of depth 8	 check: -131072
32768	 trees of depth 10	 check: -32768
8192	 trees of depth 12	 check: -8192
2048	 trees of depth 14	 check: -2048
512	 trees of depth 16	 check: -512
128	 trees of depth 18	 check: -128
32	 trees of depth 20	 check: -32
long lived tree of depth 20	 check: -1
./A 20 +RTS -N4 -H400M  35.25s user 0.42s system 281% cpu 12.652 total

Ok, so that was pretty easy. Runtime has gone from 40s to 12s, and why? Looking at +RTS -sstderr:

  %GC time       6.8%  (18.6% elapsed)
  Generation 0:    86 collections,     0 parallel,  2.07s,  2.23s elapsed
  Generation 1:     3 collections,     2 parallel,  0.34s,  0.10s elapsed

GC time is down under 10% too, which is a good rule. For the original N=16, with its smaller number of trees, which was taking 1.29s, is now down to:

whirlpool$ time ./A 16 +RTS -N4 -H400M
stretch tree of depth 17     check: -1
131072     trees of depth 4     check: -131072
32768     trees of depth 6     check: -32768
8192     trees of depth 8     check: -8192
2048     trees of depth 10     check: -2048
512     trees of depth 12     check: -512
128     trees of depth 14     check: -128
32     trees of depth 16     check: -32
long lived tree of depth 16     check: -1
./A 16 +RTS -N4 -H400M  1.26s user 0.38s system 285% cpu 0.575 total

So this is a reasonable stopping point.

The lessons

parMap can be quite effective and easy as a parallelisation strategy
if you’ve a reasonable parallelisation strategy, but not getting the performance, check what the GC is doing.

And as a final remark, we can look forward to what’s around the corner for GHC:

12.1 Independent GC

… We fully intend to pursue CPU-independent GC in the future … moving to more explicitly-separate heap regions is a more honest reflection of the underlying memory architecture …

So hopefully soon each core will be collecting its own binary trees.

References

Complete details of the new GC are in the paper, and particularly the new RTS paper:

Parallel generational-copying garbage collection with a block-structured heap,Simon Marlow, Tim Harris, Roshan P. James, Simon Peyton Jones, International Symposium on Memory Management 2008.
Runtime Support for Multicore Haskell, Simon Marlow, Simon Peyton Jones and Satnam Singh. Submitted to ICFP 09.

And as a final teaser, more on the multicore Haskell story this week: