Daily Haskell: Download and analyse logs, then generate sparklines

June 24, 2008January 21, 2009 ~ Don Stewart ~ Leave a comment

This is the first post in a series of “Daily Haskell” posts, about getting those everyday tasks done in Haskell by gluing together libraries on Hackage. It is a series about Haskell as glue, with a tour of the libraries thrown in. Today, a quick reflection on what brought Haskell into this “glue” phase of its existence, and the first “daily Haskell” program: hackage-sparks, a script to produce sparklines: . First, an overview of how we got here, for new Haskellers.

The practice of Hackage programming has shifted dramatically in the past 12 months. It used to be that you had to roll your own HTTP client, or log file parser, or graph generator, using libraries like Parsec or Arrows to make things clean and elegant. But by 2008 the daily practice of Haskell programming is dominated primarily by glue: combining existing libraries in new ways, using cabal-install to gather the components, and letting your simply and quickly get your work done.

Three key developments brought about this shift, and I want to quickly go over them (all aspiring Haskellers should know these tools!).

A single, common build system: Cabal

The first key development started in 2004, when Isaac Potoczny-Jones , (now my boss here at Galois) saw a chronic lack of a single Haskell build system. The few and the brave were rolling their own library archives with Makefiles and autoconf, but there was no way to check if other Haskell dependencies were around, and no agreed upon way to construct the build system. Everybody did their own thing, and all were broken in some way.

Isaac started hacking, quickly getting code out to the community. Things built from there, and now, 4 years later, we’ve a grand solution: Cabal, the common platform for building Haskell applications and libraries. By writing a simple, declarative specification for what your Haskell code provides, Cabal is able to abstract out all the nitty gritty of actually preprocessing, compiling, optimising, linking and installing your apps and tools. Now even Haskell newbies can construct perfect libraries and apps, and crucially, they can simply and correctly reuse the libraries of others, with all dependencies checked and satisifed. The type system enforces that the glue is of the appropriate strength for what we’re combining, and purity ensures libraries we import don’t monkey around in code we’ve already written. How simple, modular, scalable programming should be.

Centralised library repository: Hackage

The second big step really took off in March last year, after the Oxford Haskell Hackathon, when Hackage went live. Like CPAN before it, this provides a single, centralised repository of all the Haskell code fit to package. Dependencies are described, a standard interface is presented, and key, for developers, online, cross-referenced documentation of all the library APIs is provided. In this way, developers can contribute their libraries to the central API pool, and have it integrated and documented with others work.

The online documentation is crucial for spreading knowledge of library APIs, because, as this is a purely functional, strongly typed language, once you see the type signature for a library function, you’ve all the information you need to integrate it safely in your code, and start using it. Just check the function types — they tell you everything you need to know about how to use it.

Since it went live we’ve had over 550 libraries and tools uploaded to Hackage, with everything from XML parsers, GUIs, databases, to 3D shooter games, bioinformatics software, midi synthesises and perl 6 implementations now easily available. Almost everything you might need for your daily work is now there in one form or another, and if its not, it is a small matter of rolling an FFI binding and uploading your cabalised package to add to the pile. Please do so!

One stop Haskell installation: cabal install

The final piece of the build, host and distribution puzzle fell into place in June, with the release of cabal install, an apt or pacman like tool that automated all dependency resolution, downloading and building of libraries and applications. From a single command, for example:

    $ cabal install hackage-sparks

The tool will chase down, build and install everything needed for your application.

The centre of the Haskell universe is now focused on Hackage, with projects adding the code base, combinging libraries into new forms, packaging up, and downloading from, this wealth of code. If you’re not yet using it, run, don’t walk, to Hackage, get cabal-install, and get coding! If you’re a user of a distro with good Haskell support, like, say, Gentoo, or Arch Linux, you’ll already have cabal-install in native packaged form.

Daily Haskell: Sparlkines, Log files and Tagsoup

Down to work. With all these uploads happening on Hackage, late Friday I wanted to summarise somehow how active month-by-month, and day-by-day Hackage is, in a concise format, suitable for presentation on the front page of haskell.org.

One nice way to do this is via sparklines, simple, concise, dense graphics that fit inside a sentence. I’d like to condense the hackage log files into such graphs, and have them available on Hackage, thanks to Hitesh Jasani. Some examples or and you get the idea.

To do all this we’ll need to do three things:

Download the log file
Analyse and group the logs by date into months and days
Spit out .png files containg rendered graphs

The libraries we’ll use for this are:

The upload logs are on http://hackage.haskell.org, and have the form:

    Mon Jun 23 09:03:05 PDT 2008 AudreyTang Pugs 6.2.13.3
    Mon Jun 23 09:34:07 PDT 2008 UweSchmidt hxt 8.1.0
    Mon Jun 23 11:50:45 PDT 2008 JeremyShaw AGI 1.1.1

The code is straight forward, and shouldn’t take more than 10 minutes to write. Just a quick script. First, import the libs we want.

    -- Some basics
    import Data.List
    import Data.Maybe

    -- Time and locale handling
    import System.Time
    import System.Locale

    -- Diretory and filepaths
    import System.Directory
    import System.FilePath

    -- Parsers for time strings
    import System.Time.Parse

    -- Easy HTTP downloads
    import Text.HTML.Download

    -- Sparkline graphcss
    import Graphics.Rendering.HSparklines

This is how Haskell as glue works. Pull in everything, and roll some list glue between components. Now, some constants:

    -- Where our log files live
    url = "http://hackage.haskell.org/packages/archive/log"

    -- Filenames of our generated graphs
    png1 = "hackage-monthly.png"
    png2 = "hackage-daily.png"

Yeah, no type declarations. Type inference for just getting the job done.

Visiting hackage, and look at the API for hsparkslines, we see it is possible to define a custom graph style for our sparklines, so let’s define a bar graph with a grey background:.

    graph = barSpark { bgColor = rgb 0xEE 0xEE 0xEE }

Now, our script proper. Grab the pwd, and download the log file:

    main = do
        pwd <- getCurrentDirectory
        src <- openURL url

Yeah, that’s how you download a page of the internets. Easy.

Now, we start the glue logic. Break the log file into lines, parse them into proper dates, using this API, and sort them by date, ignoring any parse failures:

        let dates = catMaybes . sort . map parse . lines $ src

Cool, we got a lot done there. Now, find today’s date, and use the list groupBy to cluster our individual uploads into groups of days and months:

        let today     = last dates
            permonth  = groupBy month dates
            thismonth = groupBy day . filter (month today) $ dates

We defined some helper functions here to let us compare by year and month, and year, month, day, at the bottom of ‘main’:

      where
        parse = parseCalendarTime defaultTimeLocale "%c"

        month a b = ctYear a == ctYear b && ctMonth a == ctMonth b
        day   a b = month a b && ctDay a == ctDay b

‘month’ is nice, as it can be passed to both groupBy, and filter, letting us group on months, or filter things that match today.

Almost done, now count the number of uploads in each month or day group, converting the lengths into Floats ready for graphing,

        monthlies = map genericLength permonth
        dailies   = map genericLength thismonth

Now, we just use the ‘make’ function from hssparklines, which takes a graph style and a list of points in the columns. The result is an Image value, which can be immediately written to disk:

        graph1 <- make graph monthlies
        graph2 <- make (graph { limits = (0,20) }) dailies

        savePngFile png1 graph1
        savePngFile png2 graph2

Done! Now tell the user what we wrote:

        putStrLn $ "Wrote: " ++ pwd </> png1
        putStrLn $ "Wrote: " ++ pwd </> png2

And that’s it. Haskell logic gluing together network code pulling in online logs, analysing them, and graphing the results. Simple, and all strongly typed, pure glue.

Running this in ghci, or from the command line:

    $ hackagesparks
    Wrote: /home/dons/dons/src/hackage-sparks/hackage-monthly.png
    Wrote: /home/dons/dons/src/hackage-sparks/hackage-daily.png

The last step is to construct a .cabal file for the program, and upload it to hackage, specifying the program and its deps:

    name:                hackage-sparks
    version:             0.1
    homepage:            http://code.haskell.org/~dons/code/hackage-sparks
    license:             BSD3
    license-file:        LICENSE
    author:              Don Stewart
    maintainer:          dons@galois.com
    category:            Graphics
    synopsis:            Generate sparkline graphs of hackage statistics
    description:         Generate sparkline graphs of hackage statistics
    cabal-version:       >= 1.2
    build-type:          Simple

    executable hackagesparks
        main-is:         Main.hs
        build-depends:   base >= 3, old-locale, old-time, directory,
                         hsparklines, tagsoup, parsedate, filepath

Bundling up the .cabal file and the source, I uploaded this script to Hackage, so you can get it via cabal-install. The cabal file you can generate these using mkcabal.

And that’s it. Job done. Sparklines for the uploads are visible on the haskell.org frontpage.

I hope you get a sense for how, with the build and distrubtion infrastructure of cabal, and the wealth of libs on hackage, it’s cheap to roll Haskell solutions to your everyday scripting problems, yielding rock solid, native-code compiled, strongly typed scripts that just work. No fuss, no mess, just getting the job done. In the coming weeks I hope to post more of these daily Haskell scripts, covering more and more of Hackage, and giving an insight into what Haskell for the working programmer is like.

Haskell as fast as C: working at a high altitude for low level performance

June 4, 2008January 21, 2009 ~ Don Stewart ~ 9 Comments

After the last post about high performance, high level programming, Slava Pestov, of Factor fame, wondered whether it was generally true that “if you want good performance you have to write C in your language”. It’s a good question to ask of a high level language.

In this post I want to show how, often, we can answer “No”. That by working at a higher abstraction level we can get the same low level performance, by exploiting the fact that the compiler knows a lot more about what our code does. We can teach the compiler to better understand the problem domain, and in doing so, enable the compiler to optimise the code all the way down to raw assembly we want.

Specifically, we’ll exploit stream fusion — a new compiler optimisation that removes intermediate data structures produced in function pipelines to produce very good code, yet which encourages a very high level coding style. The best of high level programming, combined with best of raw low level throughput.

The micro-benchmark: big list mean

This post is based around a simple microbenchmark developed in the last post of this series: computing the mean of a list of floating point values. It’s a tiny program, but one with interesting properties: the list is enormous (10^9 values) — so dealing with it effectively requires laziness of some kind, and a naive implementation is far too inefficient, so we have to be smarter with our loops.

The list is generated via an enumeration — a key property the compiler can later exploit — and we have two reference implementations on hand. One in Haskell, using a manually fused, tail recursive style, with a worker/wrapper transformation applied:

    mean :: Double -> Double -> Double
    mean n m = go 0 0 n
        where
            go :: Double -> Int -> Double -> Double
            go s l x | x > m      = s / fromIntegral l
                     | otherwise  = go (s+x) (l+1) (x+1)

    main = do
        [d] <- map read `fmap` getArgs
        printf "%f\n" (mean 1 d)

And a straight forward translation to C, with all data structures fused away:

    #include <stdio.h>
    #include <stdlib.h>

    int main(int argc, char **argv) {

            double d = atof(argv[1]);

            double n;
            long long a; // 64 bit machine
            double b;

            // go_s17J :: Double# -> Int# -> Double# -> Double#
            for (n = 1,
                 a = 0,
                 b = 0; n <= d; b+=n,
                                n++,
                                a++)
                ;

            printf("%f\n", b / a);

            return 0;
    }

The C will serve as a control — a lower bound to which we want to reach. Both reference programs compute the same thing, in the same time, as they have pretty much identical runtime representations:

    $ gcc -O2 t.c
    $ time ./a.out 1e9
    500000000.067109
    ./a.out 1e9  1.75s user 0.00s system 99% cpu 1.757 total

and

    $ ghc -O2 A.hs -optc-O2 -fvia-C --make 
    $ time ./A 1e9
    500000000.067109
    ./A 1e9  1.76s user 0.00s system 100% cpu 1.760 total

In the previous post we looked at how and why the Haskell version is able to compete directly with C here, when working at a low level (explicit recursion). Now let’s see how to fight the battle at high altitude.

Left folds and deforestation

Clearly, writing in a directly recursive style is a reliable path to fast, tight loops, but there’s a programmer burden: we had to sit and think about how to combine the list generator with its consumer. For cheaper cognitive load we’d really like to keep the logically separate loops … literally separate, by composing a list generator with a list consumer. Something like this:

    mean n m = s / fromIntegral l
      where
        (s, l) = foldl' k (0, 0) [n .. m]
        k (s, l) a = (s+a, l+1)

The fold is the key. This is a nice decoupling of the original recursive loop — the generator is abstracted out again into an enumeration, and the process of traversing the list once, accumulating two values (the sum and the length), is made clear. This is the kind of code we want to write. However, there are some issues at play that make this not quite the required implementation (as it doesn’t yield the same runtime representation as the control implementations):

The tuple accumulator is too lazy for a tail recursive loop
The accumulator type is an overly expensive unbounded Integer

The fix is straightforward: just use a strict pair type for nested accumulators:

    data P = P !Double !Int

    mean :: Double -> Double -> Double
    mean n m = s / fromIntegral l
      where
        P s l = foldl' k (P 0 0) [n .. m]
        k (P s l) a = P (s+a) (l+1)

Strict products, of atomic types, have a great property: when used like this they can be represented as sets of register variables (compile with -funbox-strict-fields). The P data type is essentially an abstract mapping of from P’s fields into two registers for use as loop accumulators. We can be fairly certain the fold will now compile to a loop whose accumulating parameters are passed and returned (thanks to the “constructed product result” optimisation), in registers.

This is easy to confirm via ghc-core. Compiling the code, we see the result below. The ‘#’ suffix indicates the Int and Double types are unlifted — they won’t be stored on the heap, and will most likely be kept in registers. The return type, an unlifted pair (#, #), is also excellent: it means the result returned from the function will be kept in registers as well (we don’t need to allocate at all to get the fold’s state in or out of the function!). So now we’re getting closer to the optimal representation. The core we get:

    go :: Double# -> Int# -> [Double] -> (# Double#, Int# #)
    go a b c = case c of
          []     -> (# a, b #)
          x : xs -> case x of
                        D# y -> go (+## a y) (+# b 1) xs

This will run in constant space, as the lazy list is demanded (so the first part of the problem is solved), and the result will be accumulated in registers (so that part is fine). But there’s still see a performance penalty here — the list of Doubles is still concretely, lazily constructed. GHC wasn’t able to spot that the generator of sequential values could be combined with the fold that consumes it, avoiding the list construction entirely. This will cost us quite a bit in memory traffic in such a tight loop:

    $ time ./B 1e9
    500000000.067109
    ./B 1e9  66.92s user 0.38s system 99% cpu 1:07.45 total

Right, so we got an answer, but dropping out to the heap to get the next list node is an expensive bottleneck in such a tight loop, as expected.

The deeper problem here is that the compiler isn’t able to combine left folds with list generators into a single loop. GHC can do this for right folds, as described in the original paper for this optimisation, A Short Cut to Deforestation . That is, if you compose certain functions written as right folds GHC will automatically remove intermediate lists between them, yielding a single loop that does the work of both, without any list being constructed. This will work nicely for pipelines of the following functions: head, filter, iterate, repeat, take, and, or, any, all, concat, foldr, concatMap, zip, zipWith, ++, map, and list comprehensions.

But it won’t work for some key left folds such as foldl, foldl’, foldl1, length, minimum, maximum, sum and product. Another downside is that the enumFromTo list generator only fuses for Int, Char and Integer types — an oversight in the current implementation.

What we need is a different deforestation strategy — one that teaches the compiler how to see through the left fold.

Stream fusion

An alternative (and new) deforestation optimisation for sequence types, such as lists or arrays, is stream fusion, which overcomes the foldr bias of GHC’s old method. As GHC allows custom optimisations to be written by users in libraries this is cheap to try out — the new optimisation comes bundled in the libraries of various new data structures: the fusible list package, and a new fusible arrays libray. It is also available for distributed parallel arrays, as part of the data parallel Haskell project, from which the non-parallel fusible arrays are derived.

There are two main enabling technologies that have brought about this abundance of data structure deforestation optimisations in Haskell: user-specified term rewriting rules, and ubiquitous, pure higher order functions. If you don’t have the former the burden of adding new optimisations by hacking the compiler is too high for mortals, and if you don’t have the latter — HOFs with guaranteed purity — there simply won’t be enough fusion opportunities to make the transformation worthwhile. Happily, GHC Haskell has both — in particular, it can be more aggressive as it knows the loops will have no side effects when we radically rearrange the code. Fun, fun, fun.

So, taking the fusible arrays library, we can rewrite our fold example as, precisely:

    import System.Environment
    import Data.Array.Vector
    import Text.Printf

    data P = P !Double !Int

    mean :: Double -> Double -> Double
    mean n m = s / fromIntegral l
      where
        P s l = foldlU k (P 0 0) (enumFromToFracU n m)
        k (P s l) a = P (s+a) (l+1)

    main = do
        [d] <- map read `fmap` getArgs
        printf "%f\n" (mean 1 d)

Which is practically identical to the naive list version, we simply replaced foldl’ with foldlU, and [n .. m] with an explict enumeration (as list enumerator syntax isn’t available). GHC compiles the fold down to:

  RuleFired
    1 streamU/unstreamU

  go :: Double# -> Int# -> Double# -> (# Double#, Int# #)
  go= \ a b c ->
     case >## c limit of
        False -> go (+## a c) (+# b 1) (+## c 1.0)
        True -> (# a, b #)

Look how the list generator and consumer have been combined into a single loop with 3 register variables, corresponding to the sum, the length, and the next element to generate in the list. The rule firing indicates that an intermediate array was removed from the program. Our “array program” allocates no arrays — as the compiler is able to see through the problem all the way from array generation to consumption! The final assembly looks really good (ghc -funbox-strict-fields -O2 -fvia-C -optc-O2):

    s1rD_info:
      ucomisd     5(%rbx), %xmm6
      ja  .L87
      addsd       %xmm6, %xmm5
      addq        $1, %rsi
      addsd       .LC2(%rip), %xmm6
      jmp s1rD_info

And the key test: the assembly from the high level approach is practically identical to the hand fused, worker/wrapper applied low level implementation:

    s1bm_info:
      ucomisd     8(%rbp), %xmm6
      ja  .L73
      addsd       %xmm6, %xmm5
      addq        $1, %rbx
      addsd       .LC0(%rip), %xmm6
      jmp s1bm_info

And to the original C implementation:

    .L5:
        addsd   %xmm1, %xmm3
        addq    $1, %rax
        addsd   %xmm2, %xmm1
        ucomisd %xmm1, %xmm0
        jae .L5

As a result the performance should be excellent:

    $ time ./C 1e9
    500000000.067109
    ./UFold 1e9  1.77s user 0.01s system 100% cpu 1.778 total

If we look at the runtime statistics (+RTS -sstderr) we see:

     16,232 bytes allocated in the heap
      %GC time       0.0%  (0.0% elapsed)

So 16k allocated in total, and the garbage collector was never invoked. There was simply no abstraction penalty in this program! In fact, the abstraction level made possible the required optimisations.

How it works

Here’s a quick review of how this works. For the full treatment, see the paper.

The first thing to do is to represent arrays as abstract unfolds over a sequence type, with some hidden state component and a final length:

    data Stream a = exists s. Stream (s -> Step s a) !s Int

    data Step s a = Done
                  | Yield !a !s
                  | Skip     !s

This type lets us represent traversals and producers of lists as simple non-recursive stepper functions — abstract loop bodies — which the compiler already knows how to optimise really well. Streams are basically non-recursive functions that either return an empty stream, or yield an element and the state required to get future elements, or they skip an element (this lets us implement control flow between steppers by switching state around).

We’ll then need a way to turn arrays into streams:

    streamU :: UA a => UArr a -> Stream a
    streamU !arr = Stream next 0 n
      where
        n = lengthU arr

        next i | i == n    = Done
               | otherwise = Yield (arr `indexU` i) (i+1)

And a way to turn streams back into (unlifted, ST-encapsulated) arrays:

    --
    -- Fill a mutable array and freeze it
    --
    unstreamU :: UA a => Stream a -> UArr a
    unstreamU st@(Stream next s n) = newDynU n (\marr -> unstreamMU marr st)

    --
    -- Fill a chunk of mutable memory by unfolding a stream
    --
    unstreamMU :: UA a => MUArr a s -> Stream a -> ST s Int
    unstreamMU marr (Stream next s n) = fill s 0
      where
        fill s i = case next s of
             Done       -> return i
             Skip s'    -> fill s' i
             Yield x s' -> writeMU marr i x >> fill s' (i+1)

This fills a chunk of memory by unfolding the yielded elements of the stream. Now, the key optimisation to teach the compiler:

    {-# RULES

    "streamU/unstreamU" forall s.
      streamU (unstreamU s) = s

      #-}

And that’s it. GHC now knows how to remove any occurrence of an array construction followed by its immediate consumption.

We can write an enumerator and foldl for streams:

    enumFromToFracS :: Double -> Double -> Stream Double
    enumFromToFracS n m = Stream next n (truncate (m - n))
      where
        lim = m + 1/2

        next s | s >  lim     = Done
               | otherwise    = Yield s (s+1)

    foldS :: (b -> a -> b) -> b -> Stream a -> b
    foldS f z (Stream next s _) = fold z s
      where
        fold !z s = case next s of
                     Yield x !s' -> fold (f z x) s'
                     Skip    !s' -> fold z s'
                     Done        -> z

And we’re almost done. Now to write concrete implementations of foldU and enumFromToFracU is done by converting arrays to streams and folding and enumerating those instead:

    enumFromToU :: (UA a, Integral a) => a -> a -> UArr a
    enumFromToU start end = unstreamU (enumFromToS start end)

    foldlU :: UA a => (b -> a -> b) -> b -> UArr a -> b
    foldlU f z = foldS f z . streamU

So, if we write a program that composes foldlU and enumFromToU, such as our big mean problem:

    = foldlU k (P 0 0) (enumFromToFracU n m)

The compiler will immediately inline the definitions to:

    = foldS k (P 0 0) . streamU . unstreamU . enumFromToS start end

The fusion rule kicks in at this point, and we’ve killed off the intermediate array:

    = foldS k (P 0 0) . enumFromToS start end

Now we have a single loop, the foldS, which takes a non-recursive list generator as an argument. GHC then squooshes away all the intermediate Step constructors, leaving a final loop with just the list generator index, and the pair of state values for the sum and length:

    go :: Double# -> Int# -> Double# -> (# Double#, Int# #)
    go= \ a b c ->
        case >## c limit of
            False -> go (+## a c) (+# b 1) (+## c 1.0)
            True  -> (# a, b #)

Exactly the low level code we wanted — but written for us by the compiler. And the final performance tells the story:

    $ time ./C 1e9
    500000000.067109
    ./UFold 1e9  1.77s user 0.01s system 100% cpu 1.778 total

A more complicated version

We can try a more complicated variant, numerically stable mean:

    #include <stdio.h>
    #include <stdlib.h>

    int main(int argc, char **argv) {

        double d = atof(argv[1]);

        double n;
        long long a; // 64 bit machine
        double b;

        // go :: Double# -> Int# -> Double# -> Double#
        for (n = 1,
             a = 1,
             b = 0; n <= d; b+= (n - b) / a,
                            n++,
                            a++)
            ;

            printf("%f\n", b);

            return 0;
    }

Implemented in Haskell as a single fold, which computes the stable average each time around:

    import Text.Printf
    import System.Environment
    import Data.Array.Vector

    mean :: UArr Double -> Double
    mean = fstT . foldlU k (T 0 1)
        where
            k (T b a) n = T b' (a+1)
                where b' = b + (n - b) / fromIntegral a

    data T = T !Double !Int

    fstT (T a _) = a

    main = do
        [d] <- map read `fmap` getArgs
        printf "%f\n" (mean (enumFromToFracU 1 d))

We can work a higher level than the C code, which already manually fuses away the list generation and consumption. The end result is the same thing though:

    $ ghc-core F.hs -O2 -optc-O2 -fvia-C -funbox-strict-fields
    $ time ./F 1e9                      
    500000000.500000
    ./F 1e9  19.28s user 0.00s system 99% cpu 19.279 total

    $ gcc -O2 u.c
    $ time ./u 1e9
    500000000.500000
    ./u 1e9  19.09s user 0.00s system 100% cpu 19.081 total

So the fast inner loops are fine, and we get to use a high level language for all the surrounding glue. That’s just super.

So, to answer Slava’s question: yes, we often can get the performance of C, while working at a higher altitude. And to do so we really want to exploit all the additional information we have at hand when working more abstractly. And the user doesn’t have to know this stuff — the library already does all the thinking. Knowledge is (optimisation) power. Thinking at a higher level enables the compiler to do more powerful things, leaving less work for the programmer. It’s all win.