It’s deeply satisfying when the stream fusion optimization kicks in, in GHC Haskell. It is a powerful illustration of how the absence of side effects makes the compiler’s job so much easier, and lets it be far more aggressive.
Here’s a little example to show the real expressive power that we get when we have representation-changing optimizations like fusion to rely on. A simple example showing how good GHC is doing at this kind of stuff, these days: add a sequence of numbers. I’ve kept it super minimal so you can mentally compute each transformation to see GHC at work. (Of course there are lots of other ways to write trivial computations like this, but anything more substantial will make the assembly much harder to follow. So for teaching purposes, we’ll keep the source problem simple. Now, you can also use any of the 100+ other functions that fuse to construct all sorts of interesting programs with no closed form solution. Get at it!).
We begin with a pipeline of multiple recursive functions, brought together with function composition. As high level as it gets, and any junior Haskell program should be able to write stuff like this:
import qualified Data.Vector as U main = print . U.sum $ U.enumFromTo 1 (100000000 :: Int)
Now, semantically, there is an intermediate 100 million element array being created by enumFromTo. Try doing that in your favorite strict language. However…
… fusion kicks in, and GHC turns our program, via the source-to-source optimization phase, into a single tail recursive function, using unboxed values only .. with no intermediate array! No heap allocaiton, no stack, only one loop. A different algorithm!
loop :: Int# -> Int# -> Int# loop x y = case y <= 100000000 of False -> x True -> loop (x + y) (y + 1)
The compiler reordered our computations pretty wildly, and it can do this because there are no side effects by default — making large scale transformations like this far more feasible. I used ghc 6.13 with the llvm patches, and ghc -Odph -fllvm –make A.hs
The programmer didn’t have to think about recursion at all in the original program, let alone manual unboxing. Thanks GHC!
And now this tail recursive function gets compiled into an imperative loop via the codeGen phase. First, with the native code backend to GHC:
loop: .Lc216: cmpq $10000000,%rsi jle .Lc219 movq %r14,%rbx movq (%rbp),%rax jmp *(%rax) .Lc219: addq %rsi,%r14 incq %rsi jmp loop
And again, with the LLVM backend to GHC:
loop: leaq 1(%rsi), %rax addq %rsi, %r14 cmpq $10000001, %rax jge .LBB1_5 addq $2, %rsi addq %rax, %r14 test: # %tailrecurse cmpq $10000001, %rsi jl loop
And that’s it. Our kernel code is as low level and un-functional as you get, but our input program was the most classic of high level approaches: a pipeline of functions held together with composition, with intermediate data structures all over the place.
And it performs!
With -fasm: mean: 92.9 ms:
With -fllvm: mean: 79.3ms (15% improvement with -fllvm):
Maybe our compiler is — sometimes — sufficiently smart.
To me, the fusion optimization is the epitome of what functional programming is about. I can write code at a very high level, throwing functions and composition around like candy, and the compiler will make stunning, complexity-changing transformations on my code, so it just runs damn fast. Can your language — functional or otherwise — do this?
16 thoughts on “Fusion makes functional programming fun!”
Wow, Haskell keeps getting better perf !
Don, a different area that I’m potentially interested in with Haskell is meta-programming. I know about Template Haskell and compared with Lisp macros I find it very awkward and far less expressive.
Is there a better way of doing meta-programming with Haskell ?
This is deeply unsatisfying. If this is written in C with the “fusion” done by hand, GCC is able to optimise the result of the loop to a constant. Why is the Haskell compiler unable to do that?
@John While GHC performs some huge optimizations like loop fusion, it doesn’t constant fold through loops. GCC 4.x series added this optimization for constant loops in C.
Now, that’s a low level optimization the LLVM backend *should* be able to do, though I’ve not tried that yet. And it would be beneficial, since GHC did the hard work to get the program into a single loop in the first place.
For the sake of this example, better to replace the limit in this program with an ‘n’ read from the user.
Ah, but, I would contend there is a dark side to this (well, not dark… just… tricky). It’s already been acknowledged that laziness can make it extremely difficult for programmers to reason about performance. I would contend that these kind of optimizations, as incredibly impressive as they are, also make it more difficult to predict, apriori, exactly how a piece of code will perform, without understanding, in detail, what kinds of transformations the compiler is capable of. After all, looking at the original snippet of code naively, I would’ve never guessed that the compiler could transform that so drastically. Now imagine a far more complex algorithm where these sorts of optimizations are possible.
In a way, it’s kind’ve like a *far* more advanced version of strength reduction in more naive compilers. A lot of young developers were once trained that you should replace constant multiplies with shifts were appropriate because they were faster, not realizing that the compiler can already perform that sort of transformation for you. Fortunately, once you realize the compiler will perform strength reduction, you can go back to writing nice, clear code using multiplies, knowing that the code will be optimized appropriately, as it’s fairly trivial to understand.
Now imagine trying to predict this kind of optimization.
Thankfully fusion is remarkably easy to predict. If you see the compostion of any of these functions, they will fuse into a single loop.
But nothing competes with having a good understanding of compilers.
Could you maybe clarify what you mean by the 100 fuseable functions available. The link you share points to the Vector datatype. Does that mean only operations on Vectors fuse?
Looking at the docs of Vector it never mentions the fuseabilty provided by the datatype. How is one supposed to know what will fuse without looking at the Core output of GHC, or waiting for another detailed blogpost from yourself!
Fusion is tied to particular sequence types. The vector library is the best example, as it is built soley from fusible functions. This is mentioned on the main page: http://hackage.haskell.org/package/vector
In general, if you’re using vector you don’t need to care about fusion — since it’s all designed that way (same for the existing fusion mechanisms for lists).
Now, if you’re really interested (say you’re doing some high performance numerics), then you’d use a tool like ghc-core to watch for fusion sites. GHC will let you know what it finds.
Thanks for the clarification. That is really quite a powerful feature, I see why you are so excited about it.
I want to personally encourage you to keep writing these posts about GHC. My day job lands my thoughts square in the land of C, but whenever one of these posts makes it on to the news aggregator sites I find myself breaking out GHCi yet again.
Hopefully one of these days I won’t have to put it away.
Thanks Chris. That’s gratifying!
@Brett We are already way beyond humans being able to comprehend the performance of a piece of code. Compiler optimizations or not.
Even if the code perfectly represents the steps the CPU would take you still has to consider the effect of cache misses and other hard to predict realities of the execution context.
So just let it go. Assume the code will be optimized as good as theoretically possible by the compiler and runtime and focus on making it as readable and maintainable as possible instead.
Fusion seems nice in theory, but never seems to be applicable to situations in my programs. I often have pipelines of operations, but they are rarely combined with direct composition.
For example, I have a program that would seem to be an ideal candidate for fusion. I have lots of arrays that get added to other arrays, multiplied by constants, mapped over by transformers, etc.
However, this is a kind of EDSL where the arrays are stored in an environment (a StateT basically acting as a Reader). They get pulled out, manipulated, and put back in to dynamically scope over the called code.
— ignoring the possibility of Nothing from Map.lookup
a = local (Map.insert “sig” . Vector.map (*2) . Map.lookup “sig”) b
b = local (Map.insert “sig” . Vector.map (/2) . Map.lookup “sig”) c
c = …
I’m pretty sure in this case the two maps won’t fuse since GHC would have to figure out that from the point of view of ‘c’, the intermediate vector created by ‘a’ is not needed.
I have thought of storing the environment not as (Map String Vector) but as (Map String ([Vector -> Vector], Vector]). Then when someone actually wants the value of a vector, they pull it out and do a ‘foldr (.) id funcs vec’. That would avoid intermediate vectors at the cost of duplicated computation if the intermediate vector is used after all.
But I’m not sure that would fuse either, since I don’t know if rules will kick in when the functions are in a list like that.
Ideas? Advice? I’m curious about the limitations of fusion and how one can structure a program to expose more things to it.
Of course this sort of defeats the purpose of “write naturally and the compiler will make it fast”, but jumping through a few composition hoops would still be easier than manual fusion. Or would it? What if I store [Either (Int -> Int) (Vector -> Vector)], then manually fuse adjacent Lefts?
Is that the ghc-core tool that doesn’t work on Windows?
I find fusion very exciting also. I just wish it wasn’t so hard to figure out whether it’s happening or not. ;-)
In paragraph 4 you mention “Try doing that in your favorite strict language”. The reason you can fuse is purity, not evaluation strategy. The reason pure languages allow us to fuse is because the only observable effect (non-termination) is commutative, so computations can effectively be reordered in programs without observable differences.
I meant more that allocating a strict array like this will kill performance, unless you swtich to lazy sequences (or have a deforestation pass).