March 2009 – Page 2 – Control.Monad.Writer

In this post you’ll get a bit of an idea how to:

make a Haskell program much faster by parallelising it
see how to analyse and use the SMP runtime flags GHC provides
mess with the parallel garbage collector

Ultimately we’ll make a program 4x faster on 4 cores by changing one line of code, using parallelism, and tuning the garbage collector.

Update: and since I began writing this GHC HQ (aka Simon, Simon and Satnam) have released “Runtime Support for Multicore Haskell” which finally puts on paper a lot of information that was previously just rumour. As a result, I’ve rewritten this article from scratch to use GHC 6.11 (today’s snapshot) since it is just so much faster and easier to use than 6.10.x.

Ready?

The new GHC garbage collector

The GHC 6.10 release notes contain the following text on runtime system changes:

The garbage collector can now use multiple threads in parallel. The new -gn RTS flag controls it, e.g. run your program with +RTS -g2 -RTS to use 2 threads. The -g option is implied by the usual -N option, so normally there will be no need to specify it separately, although occasionally it is useful to turn it off with -g1. Do let us know if you experience strange effects, especially an increase in GC time when using the parallel GC (use +RTS -s -RTS to measure GC time). See Section 5.14.3, “RTS options to control the garbage collector” for more details.

Interesting. Maybe this will have some impact on the shootout benchmarks.

Binary trees: single threaded

There’s one program that’s been bugging me for a while, where the garbage collector is a bottleneck: parallel binary-trees on the quad core Computer Language Benchmarks Game. This is a pretty straight forward program for testing out memory management of non-flat data types in a language runtime – and FP languages should do very well with their bump-and-allocate heaps. All you have to do is allocate and traverse a bunch of binary trees really. This kind of data:

data Tree = Nil | Node !Int !Tree !Tree

Note that the rules state we can’t use laziness to avoid making O(n) allocations at a time, so the benchmark will use a strict tree type – that’s fine – it only helps with a single core anyway. GHC will unbox those Int fields into the constructor too, with -funbox-strict-fields (should be implied by -O in my opinion). The benchmark itself is really quite easy to implement. Pattern matching makes allocating and wandering them trivial:

-- traverse the tree, counting up the nodes
check :: Tree -> Int
check Nil          = 0
check (Node i l r) = i + check l - check r

-- build a tree
make :: Int -> Int -> Tree
make i 0 = Node i Nil Nil
make i d = Node i (make (i2-1) d2) (make i2 d2)
  where i2 = 2*i
        d2 = d-1

The full code is here. So quite naive code, and fast… if we just look at this code running on the single core benchmark machine:

Functional language implementations taking up 4 of the top 6 slots, and edging out C (it’s even faster with lazy trees). You can try this for yourself:

whirlpool$ ghc -O2 --make A.hs
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...

whirlpool$ time ./A 16
stretch tree of depth 17	 check: -1
131072	 trees of depth 4	 check: -131072
32768	 trees of depth 6	 check: -32768
8192	 trees of depth 8	 check: -8192
2048	 trees of depth 10	 check: -2048
512	 trees of depth 12	 check: -512
128	 trees of depth 14	 check: -128
32	 trees of depth 16	 check: -32
long lived tree of depth 16	 check: -1
./A 16  1.26s user 0.03s system 100% cpu 1.291 total

I’m on a quad core Linux 2.6.26-1-amd64 x86_64 box, with:

whirlpool$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.11.20090302

If we take the value of N up to the N=20, it takes a while longer to run:

whirlpool$ time ./A 20
stretch tree of depth 21	 check: -1
2097152	 trees of depth 4	 check: -2097152
524288	 trees of depth 6	 check: -524288
131072	 trees of depth 8	 check: -131072
32768	 trees of depth 10	 check: -32768
8192	 trees of depth 12	 check: -8192
2048	 trees of depth 14	 check: -2048
512	 trees of depth 16	 check: -512
128	 trees of depth 18	 check: -128
32	 trees of depth 20	 check: -32
long lived tree of depth 20	 check: -1
./A 20  40.21s user 0.16s system 99% cpu 40.382 total

And of course we get no speed from the extra cores on the system yet. We’re only using 1/4 of the machine’s processing resources. The implementation contains no parallelisation strategy for GHC to use.

Binary trees in parallel

Since Haskell (especially pure Haskell like this) is easy to parallelise, and in general GHC Haskell is pretty zippy on multicore :-) let’s see what we can do to make this faster by parallelisation. It turns out, teaching this program to use multicore is ridiculously easy. All we have to change is one line! Where previously we computed the depth of all the trees between minN and maxN sequentially,

let vs = depth minN maxN

...

depth :: Int -> Int -> [(Int,Int,Int)]
depth d m
    | d <= m    = (2*n,d,sumT d n 0) : depth (d+2) m
    | otherwise = []
  where n = 1 `shiftL` (m - d + minN)

Which yields a list of tree results sequentially, we instead step back, and compute the separate trees in parallel using parMap:

let vs = parMap rnf id $ depth minN maxN

From Control.Parallel.Strategies, parMap forks sparks for each (expensive) computation in the list, evaluating them in parallel to normal form. This technique uses sparks – lazy futures – to hint to the runtime that it might be a good idea to evaluate each subcomputation in parallel. When the runtime spots that there are spare threads, it’ll pick up the sparks, and run them. With +RTS -N4, those sparks (in this case, 9 of them) will get scheduled over 4 cores. You can find out more about this style of parallel programming in ch24 of Real World Haskell, in Algorithm + Strategy = Parallelism and now in the new GHC HQ runtime paper.

Running parallel binary trees

Now that we’ve modified the implementation to contain a parallel evaluation strategy,all we have to do is compile it against the threaded GHC runtime, and those sparks will be picked up by the scheduler, and dropped into real threads distributed across the cores. We can try it using 2/4 cores:

whirlpool$ ghc -O2 -threaded A.hs --make -fforce-recomp

whirlpool$ time ./A 16 +RTS -N2
stretch tree of depth 17	 check: -1
131072	 trees of depth 4	 check: -131072
32768	 trees of depth 6	 check: -32768
8192	 trees of depth 8	 check: -8192
2048	 trees of depth 10	 check: -2048
512	 trees of depth 12	 check: -512
128	 trees of depth 14	 check: -128
32	 trees of depth 16	 check: -32
long lived tree of depth 16	 check: -1
./A 16 +RTS -N2  1.34s user 0.02s system 124% cpu 1.094 total

Hmm, a little faster at N=16, and > 100% cpu. Trying again with 4 cores:

whirlpool$ time ./A 16 +RTS -N4
stretch tree of depth 17	 check: -1
131072	 trees of depth 4	 check: -131072
32768	 trees of depth 6	 check: -32768
8192	 trees of depth 8	 check: -8192
2048	 trees of depth 10	 check: -2048
512	 trees of depth 12	 check: -512
128	 trees of depth 14	 check: -128
32	 trees of depth 16	 check: -32
long lived tree of depth 16	 check: -1
./A 16 +RTS -N4  2.89s user 0.06s system 239% cpu 1.229 total

Hmm… so it got only a little faster with 2 cores at N=16, but about the same with 4 cores. At N=20 we see similar results:

whirlpool$ time ./A 20 +RTS -N4
stretch tree of depth 21	 check: -1
2097152	 trees of depth 4	 check: -2097152
524288	 trees of depth 6	 check: -524288
131072	 trees of depth 8	 check: -131072
32768	 trees of depth 10	 check: -32768
8192	 trees of depth 12	 check: -8192
2048	 trees of depth 14	 check: -2048
512	 trees of depth 16	 check: -512
128	 trees of depth 18	 check: -128
32	 trees of depth 20	 check: -32
long lived tree of depth 20	 check: -1
./A 20 +RTS -N4  96.61s user 0.93s system 239% cpu 40.778 total

So still 40s, at 239% cpu. So we made something hot. And you can see a similar result at N=20 on the current quad core shootout binary-trees entry. Jobs distributed across the cores, but not much better runtime. A little better than the single core entry, but only a little. And in the middle of the pack, and 2x slower than C!

Meanwhile, on the single core, it’s in 3rd place, ahead of C and C++. So what’s going on?

Listening to the garbage collector

We’ve parallelised this logically well, so I’m not prepared to abandon the top-level parMap strategy. Instead, let’s look deeper. One clue about what is going on is the cpu utilisation in the shootout program:

6.8

Haskell GHC #2

53.36

403,944

544

40.18

21% 45% 21% 41%

Those aren’t very good numbers – we’re using all the cores, but not very well. So the program’s doing something other than just number crunching. A good suspect is that there’s lots of GC traffic happening (after all, a lot of trees are being allocated!). We can confirm this hunch with +RTS -sstderr which prints lots of interesting statistics about what the program did:

whirlpool$ time ./A 16 +RTS -N4 -sstderr

./A 16 +RTS -N4 -sstderr
     946,644,112 bytes allocated in the heap
     484,565,352 bytes copied during GC
       8,767,512 bytes maximum residency (23 sample(s))
          95,720 bytes maximum slop
              27 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0:   674 collections,     0 parallel,  0.54s,  0.55s elapsed
  Generation 1:    23 collections,    22 parallel,  0.57s,  0.16s elapsed

  Parallel GC work balance: 1.56 (17151829 / 10999322, ideal 4)

  Task  0 (worker) :  MUT time:   0.36s  (  0.39s elapsed)
                      GC  time:   0.28s  (  0.13s elapsed)
  Task  1 (worker) :  MUT time:   0.67s  (  0.43s elapsed)
                      GC  time:   0.14s  (  0.14s elapsed)
  Task  2 (worker) :  MUT time:   0.01s  (  0.43s elapsed)
                      GC  time:   0.09s  (  0.08s elapsed)
  Task  3 (worker) :  MUT time:   0.00s  (  0.43s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)
  Task  4 (worker) :  MUT time:   0.31s  (  0.43s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)
  Task  5 (worker) :  MUT time:   0.22s  (  0.43s elapsed)
                      GC  time:   0.60s  (  0.37s elapsed)

  SPARKS: 7 (7 converted, 0 pruned)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.02s  (  0.43s elapsed)
  GC    time    1.12s  (  0.71s elapsed)
  EXIT  time    0.32s  (  0.03s elapsed)
  Total time    2.45s  (  1.17s elapsed)

  %GC time      45.5%  (60.7% elapsed)

  Alloc rate    708,520,343 bytes per MUT second

  Productivity  54.5% of total user, 114.1% of total elapsed

gc_alloc_block_sync: 35082
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 1123
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 6318
gen[1].steps[0].sync_large_objects: 0
./A 16 +RTS -N4 -sstderr  2.76s user 0.08s system 241% cpu 1.176 total

At N=16, the program is spending 45% of its time doing garbage collection. That’s a problem. We can also see some other things:

7 sparks are being created by our parMap, all of which are turned into real threads
The parallel GC does get a chance to run in parallel 22 times.

And at N=20, the benchmark number, things aren’t any better:

  19,439,350,240 bytes allocated in the heap
  21,891,579,896 bytes copied during GC
     134,688,800 bytes maximum residency (89 sample(s))
         940,344 bytes maximum slop
             376 MB total memory in use (6 MB lost due to fragmentation)

  Generation 0: 14576 collections,     0 parallel, 20.67s, 20.62s elapsed
  Generation 1:    89 collections,    88 parallel, 36.33s,  9.20s elapsed

  SPARKS: 9 (9 converted, 0 pruned)

  %GC time      64.0%  (74.8% elapsed)

So yikes, we’re wasting a lot of time cleaning up after ourselves (though happily our par strategy isn’t wasting any fizzled sparks). Diving into the GC docs, we see:

Bigger heaps work better with parallel GC, so set your -H value high (3 or more times the maximum residency)

Ok. Let’s try to get that number down.

Helping out the GC

We can see how much to make a guess at by looking at the maximum residency stats. A good start might be 400M:

whirlpool$ time ./A 20 +RTS -N4 -H400M
stretch tree of depth 21	 check: -1
2097152	 trees of depth 4	 check: -2097152
524288	 trees of depth 6	 check: -524288
131072	 trees of depth 8	 check: -131072
32768	 trees of depth 10	 check: -32768
8192	 trees of depth 12	 check: -8192
2048	 trees of depth 14	 check: -2048
512	 trees of depth 16	 check: -512
128	 trees of depth 18	 check: -128
32	 trees of depth 20	 check: -32
long lived tree of depth 20	 check: -1
./A 20 +RTS -N4 -H400M  35.25s user 0.42s system 281% cpu 12.652 total

Ok, so that was pretty easy. Runtime has gone from 40s to 12s, and why? Looking at +RTS -sstderr:

  %GC time       6.8%  (18.6% elapsed)
  Generation 0:    86 collections,     0 parallel,  2.07s,  2.23s elapsed
  Generation 1:     3 collections,     2 parallel,  0.34s,  0.10s elapsed

GC time is down under 10% too, which is a good rule. For the original N=16, with its smaller number of trees, which was taking 1.29s, is now down to:

whirlpool$ time ./A 16 +RTS -N4 -H400M
stretch tree of depth 17     check: -1
131072     trees of depth 4     check: -131072
32768     trees of depth 6     check: -32768
8192     trees of depth 8     check: -8192
2048     trees of depth 10     check: -2048
512     trees of depth 12     check: -512
128     trees of depth 14     check: -128
32     trees of depth 16     check: -32
long lived tree of depth 16     check: -1
./A 16 +RTS -N4 -H400M  1.26s user 0.38s system 285% cpu 0.575 total

So this is a reasonable stopping point.

The lessons

parMap can be quite effective and easy as a parallelisation strategy
if you’ve a reasonable parallelisation strategy, but not getting the performance, check what the GC is doing.

And as a final remark, we can look forward to what’s around the corner for GHC:

12.1 Independent GC

… We fully intend to pursue CPU-independent GC in the future … moving to more explicitly-separate heap regions is a more honest reflection of the underlying memory architecture …

So hopefully soon each core will be collecting its own binary trees.

References

Complete details of the new GC are in the paper, and particularly the new RTS paper:

Parallel generational-copying garbage collection with a block-structured heap,Simon Marlow, Tim Harris, Roshan P. James, Simon Peyton Jones, International Symposium on Memory Management 2008.
Runtime Support for Multicore Haskell, Simon Marlow, Simon Peyton Jones and Satnam Singh. Submitted to ICFP 09.

And as a final teaser, more on the multicore Haskell story this week:

Month: March 2009

Playing with GHC’s parallel runtime