No monadic headaches: multi-core concurrency is easy in Haskell

Over on Good Math, Bad Math, MarkCC writes about Erlang and Haskell:

Haskell systems don’t, in general, parallelize well. They’re particularly bad for the kind of very coarse thread-based concurrency that we need to program for on multi-core computers, or distributed systems.

I suspect Mark hasn’t written much multicore concurrent Haskell code, as he resorts to the usual “monads are hard” argument, rather than offering example code. However, here in the real world, concurrent programming in Haskell is pretty much identical to Erlang, though with nicer syntax, in a more modern language.

Importantly, there are no monadic headaches involved (monads don’t even enter the picture). And the results often out-perform Erlang (and most other languages), due to native code compilation with type erasure, and better data structure support, provided by Haskell libraries (particular for strings) simply not available in Erlang.

So let’s look at the basic tools at our disposal here are:

Starting first with a look at the Erlang code. Mark starts with an introduction to syntax:

    -module(map).
    -export(map/2, two/1).

    map(F, []) -> [];
    map(F,[Car|Cdr]) ->
       NewCar = apply(F, [Car]),
       NewCdr = map(F,Cdr),
       [NewCar|NewCdr].

    two(X) -> 2*X.

Which looks a bit like an antique Haskell, with funny numbers instead of a type system (in the export list). Rewriting in Haskell:

    module Map (map, two) where

    map _ []     = []
    map f (x:xs) = f x : map f xs

    two = (2*)

Ok, so simpler and cleaner (I mean, really, using `apply’ for function application!?)

Now, let’s look at the Erlang for forking a simple process, that receives messages in a loop:

    -module(echo).
    -export([start/0, loop/0]).

    start() ->
      spawn(echo, loop, []).

    loop() ->
       receive
          {From, Message} ->
             From ! Message,
             loop()
       end.

This translates directly to Haskell’s forkIO and MVars for communication: We just launch the main thread here, which forks a second Haskell thread, which in turn sits receiving messages and printing them:

    main = do
        box <- newEmptyMVar
        forkIO (echo box)

    echo c = forever $ do
        msg <- takeMVar c
        print msg

‘spawn’ maps to ‘forkIO’, and Erlang’s builtin message passing maps to a communication mechanism of your choice in Haskell. We chose MVars here, but you could also have gone with

To run such a threaded Haskell program across multiple cores (this coarse gained, multicore concurrency Mark talks about), we need to nothing more than compile in the SMP runtime:

    $ ghc -threaded -O Concurrent.hs

Now that’s compiled to real native code, on an smp parallel runtime! We can run this over as many cores as we have, and GHC will schedule threads across the cores:

    $ ./a.out +RTS -N2

There are no monadic headaches. Concurrent and parallel multicore programming in Haskell is simple, efficient and easy!

Since its so easy, and has such little impact on the structure of your Haskell programs, you can start speculatively supporting multicore hardware: your Haskell program will utilise as many real cores as it needs, without needing to recompile and modify the code. Just change the value of `N’, and throw around forkIO liberally, much as you would in Erlang.

So let’s do something useful with this, how about a little program that computes primes and fibonacci numbers? We’ll just fork processes to compute prime numbers and fibonacci numbers, and have the main thread lazily print results as they’re found:

    import Control.Concurrent.Chan
    import Control.Concurrent

    -- Fork some computation processes, print their results
    main = do
        primes <- run primes
        fibs   <- run fibonacci
        mapM_ print $ zip primes fibs

      -- fork a process, return any messages it produces as a list
      where
        run f = do
            c <- newChan
            l <- getChanContents c
            forkIO (writeList2Chan c f)
            return l

    -- A function to compute primes
    primes = sieve [2..]
        where
           sieve (p : xs) = p : sieve [x | x <- xs, x `mod` p > 0]

    -- A function to compute fibonacci numbers
    fibonacci = 0 : 1 : zipWith (+) fibonacci (tail fibonacci)

Very importantly, we see our computational processes are just pure Haskell code, and our main function is some IO skin, as usual. This is almost identical to code you would write ignoring concurrency: nothing scary is needed to write parallel Haskell!

Compiling this with the parallel runtime:

        $ ghc -threaded -O Server.hs

And running it on across 2 cores:

    $ time ./Z +RTS -N2
    (2,0)
    (3,1)
    (5,1)
    (7,2)
    (11,3)
    (13,5)
    (17,8)
    (19,13)
    (23,21)
    (29,34)
    (31,55)
    (37,89)
    (41,144)
    (43,233)
    (47,377)
    (53,610)
    (59,987)
    (61,1597)
    (67,2584)
    ^C interrupted

Easy! The key points to take away:

  • parallel and concurrent code is easy in Haskell, much as in Erlang
  • multi-core concurrency is well supported, out of the box
  • concurrent Haskell is damn fast

And most importantly, there are no monadic headaches!

Two new papers!

Warning: if you’re on the ICFP program committee and want to preserve double-blind reviewing, please don’t read this post.

Phew! The ICFP deadline is past, and two papers have been submitted.

The first paper covers the full story of stream fusion, greatly extending work initially done for the ByteString library, in particular, how to fuse zips, concatMaps and list comprehensions, and we’ve implemented the entire list library to be stream fusible. The point: faster Haskell code.

The second paper describes how, using Haskell, we can write monte-carlo-based simulators that out perform optimised C implementations, using a generative approach to simulator specialisation. In particular, we generate multicore and cluster-based simulators for polymer chemistry, outperforming the existing (commercial) software in this area.

So here they are, for your functional programming satisfaction:

Stream Fusion: From Lists to Streams to Nothing at All.
Duncan Coutts, Roman Leshchinskiy and Don Stewart.

Abstract:

This paper presents an automatic fusion system, stream fusion, based on equational transformations, that fuses a wider range of functions than existing short-cut fusion systems. In particular, stream fusion is able to fuse zips, left folds and functions over nested lists, including list comprehensions. A distinguishing feature of the stream fusion framework is its simplicity: by transforming list functions to expose their structure, intermediate values are eliminated by general purpose compiler optimisations.

We have reimplemented the entire Haskell standard list library on top of our framework, providing stream fusion for Haskell lists. By allowing a wider range of functions to fuse, we see an increase in the number of occurrences of fusion in typical Haskell programs. We present benchmarks documenting time and space improvements.

And the second paper:

Generative Code Specialisation for High-Performance Monte-Carlo Simulations.
Don Stewart, Hugh Chaffey-Millar, Gabriele Keller, Manuel M. T. Chakravarty and Christopher Barner-Kowollik.

Abstract:

We address the tension between software generality and performance in the domain of scientific and financial simulations based on Monte-Carlo methods. To this end, we present a novel software architecture, centred around the concept of a specialising simulator generator, that combines and extends methods from generative programming, partial evaluation, runtime code generation, and dynamic code loading. The core tenet is that, given a fixed simulator configuration, a generator in a functional language can produce low-level code that is more highly optimised than a manually implemented generic simulator. We also introduce a skeleton, or template, capturing a wide range of Monte-Carlo methods and use it to explain how to design specialising simulator generators and how to generate parallelised simulators for multi-core and distributed-memory multiprocessors.

We evaluated the practical benefits and limitations of our approach by applying it to a highly relevant problem in computational chemistry. More precisely, we used a Markov-chain Monte-Carlo method for the study of advanced forms of polymerisation kinetics. The resulting implementation executes faster than all competing software products, while at the same time also being more general. The generative architecture allows us to cover a wider range of chemical reactions and to target a wider range of high-performance architectures (such as PC clusters and SMP multiprocessors).

We show that it is possible to outperform low-level languages with functional programming in domains with very stringent performance requirements if the domain also demands generality.