Roman Leschinkskiy has released the 0.5 version of vector, the future standard non-parallel arrays library forGHC Haskell. This post covers some of the differences between it and uvector, and what to watch for when migrating code to use vector.
The summary is — as of Feb 15, 2010 — you can move to vector now. In almost all cases you will get identical performance to uvector, but with a nicer interface. There are currently a few small gaps in the API, and a couple of performance tweaks are needed to particular functions, but they should not affect most people (and likely will be fixed in coming days). Note that you should use the -Odph optimization flag for the most reliable results.
vector is one of the results of the multi-year Data Parallel Haskell project, to develop high performance parallel bulk array processing for Haskell, allowing us to do very fast arrays (that is, transparently multicore parallel array processing, outperforming C or C++ by using cores more efficiently.)
While this project concentrates on data parallelism, it has also lead to new approaches to flat, sequential arrays for Haskell. The code has been spun-off in two different packages, which replace the fifteen year old array package with faster, more flexible types and functions:
These two libraries share a common origin, but have different engineering goals. Both libraries make heavy use of loop fusion based on streams to achieve excellent performance. (You can read more about that in a separate post).
uvector is a conservative attempt to develop fast, unboxed arrays with a flexible interface based on fusion, to replace Data.Array in the short term, while vector was immature. uvector has been in active service for about two years now, filling a gap while we waited for vector to mature. uvector has several users now, including haskell-criterion haskell-histogram-fill haskell-hnn haskell-monte-carlo haskell-mwc-random haskell-pqueue-mtlhaskell-safe-freeze haskell-statistics haskell-statistics-fusion haskell-uvector-algorithms. These packages in the medium term should consider moving to vector.
The vector library is far more ambitious, aiming to be the standard array library for high performance problems in all circumstances. It cleanly supports:
- mutable arrays
- immutable arrays
- storable types
As for uvector, unboxed representations are specialized at compile time via type families, and fusion is used throughout the interface. Unlike uvector, vector supports boxed arrays, and provides inplace fusion of mutable array operations.
If you need transparently parallel arrays, you should consider the dph package, distributed with GHC.
uvector is stable, and has gone into maintainance mode only. If you like it, you can safely continue to use it for the foreseeable future, though any performance improvements in the fusion or array types developed in vector will not be backported to uvector.
vector, as of 0.5, has been declared “beta”. You can begin migrating code to use it.
Migrating your code
I’ve just finished porting the uvector micro benchmark suite to vector, and have the following notes on how to migrate your code to use unboxed arrays in the vector package.
The old uvector namespace was Data.Array.Vector with U suffix appended to names. That goes away, and instead you should:
import qualified Data.Vector as U
Most function names are identical, so we have in vector the obvious counterparts to uvector. All these are basically unchanged:
U.length U.null U.empty U.cons U.snoc U.replicate U.head
U.last U.init U.tail U.take U.drop U.map U.filter U.elem U.notElem
U.product U.sum U.maximum U.minimum U.foldl1 U.foldl U.dropWhile U.break
U.find U.all U.any U.and U.or U.maximumBy U.minimumBy
Some function names have changed:
U.++ replaces appendU
U.! replaces indexU
Some functions are missing:
There are many new functions on vectors, in particular, on mutable arrays, and for bulk operations (backpermute, reverse, accum).
Notably, vector supports boxed types, so you can more easily store Haskell values in fusible arrrays (so you can have, e.g. Integer arrays).
I found only a few differences in performance compared to uvector, and have notified Roman. These shouldn’t affect many users currently, and will likely disappear in coming days.
First, compile with -Odph instead of -O2, this fixes some optimization issues with zips, and probably other things.
Functions to watch out for:
- zip, zipwith, zipwith3 — uvector was a lot faster (10x) in simple programs — however, moving to -Odph fixes zips entirely.
- Different fusion results for ’empty’ (kind of a corner case)
- eq seems to be about twice as slow. Unsure why.
- Bools don’t seem to be bit packed? At least, Bool unboxed arrays seem a bit slower than in uvector.
- U.last appears to be optimized differently, though doesn’t affect performance.
- Pipelines ending in ‘null’ (another corner case) are fused differently (slightly worse performance).
And that’s about it.
As of this post, I’m officially declaring uvector to be in maintainance-only mode, and will be working to improve vector.
2 thoughts on “Migrating from uvector to vector”
Some additional stuff:
Most regularly named vector functions have bounds checking, with an ‘unsafeFoo’ version that doesn’t do bounds checking. However, there are compile-time options that let you toggle bounds checking on and off, as well as toggling bounds checking for the unsafe versions. (It’s too bad this can’t be toggled at program compile time instead of library compile time.)
Mutable arrays are parameterized over the monad type, so they can be used in both ST and IO. The parameter given to the array is the state-thread type, so in IO, types look like:
MVector mv e => mv RealWorld e -> IO whatever
while in ST they look like:
MVector mv e => mv s e -> ST s whatever
You can also be parametric in the monad, like so:
(PrimMonad m, MVector mv e) => mv (PrimState m) e -> m whatever
However, there seem to be some flukes in the 6.12 GHC optimizer with regard to using this with IO. Intermediate code generation (and thus performance) for ST seems much better, so it’s probably advisable to stick to it as much as possible, currently.
I note that the Vector API uses lazy tuples where UVector has a strict pair type (:*:). Has the benefit of strictness here been determined negligible, or was this a friendliness-of-API consideration? Does the optimizer perhaps come to our rescue here?