So you want to hack Haskell for the Google Summer of Code

The Google Summer of Code is a great chance to work on open source projects as a student, and get training from some experience hackers, wonderfully sponsored by Google. Haskell.org will be participating for its 5th year.

If you’re thinking about working on Haskell projects, you should certainly be reading:

Here are some of the things to think about before you decide to submit a proposal to help out Haskell.org this summer, to help you make stronger proposals. This is purely my opinion, and might not necessarily reflect the opinion of all other mentors.

We have limited resources as a community, and the Google Summer of Code has been instrumental in bringing new tools and libraries to our community. Some notable projects from the past few years include:

  • The GHCi debugger
  • Improvements to Haddock
  • Major work on Cabal
  • Generalizing Parsec (parsec3)
  • Shared Libraries for GHC
  • Language.C

These student projects have gone on to have wide impact, or brought new capabilities to the Haskell community. The pattern has been towards work on the most popular tools and libraries, or work on new areas that Haskell programmers are demanding work on. Rarely do we fund development of applications in Haskell, instead concentrating on general infrastructure that supports all Haskell projects.

To succeed with your project proposal, you need to propose doing work on the most important “blockers” for Haskell’s use. Work that:

  • Brings the most value to the community
  • Addresses widespread need
  • Will impact many other libraries and tools
  • Doesn’t rewrite things in Haskell for their own sake
  • Is feasible in 3 months

It can be hard to find projects of the right size — important enough the work will make a difference — and thus attract the attention of mentors (who vote on your proposal), but is still feasible to achieve in 3 months.

To help get a sense of the mood of what the community thinks is “hot” (though not necessarily important), we set up a crowd voting site for ideas on Reddit. But be wary, the crowd can get excited about silly things, and doesn’t necessarily have good business sense.

The list of projects can help you get a sense for what Haskellers are thinking about this year. Not everything here will is feasible for a summer project though, so be sure to get advice!

Some of the social factors to consider in your application:

  • You have one or two mentors with good expertise in the area. Ideally you’re already lining up the mentors to help watch your work.
  • You’re hanging out in #haskell or on the Haskell Reddit, your blog is
    on Planet Haskell, or you’re going to be at the hackathons.

The more involved you are in the community, the more likely you’ll have a sense of what the most important work to propose is.

And of course you need to have skills to pay the bills:

  • You should have demonstrated competence in Haskell (e.g. you’ve
    uploaded things to Hackage)
  • Demonstrated discipline in open source — self-motivation

As a guide to what reviewers will be thinking about when reading your proposal, you should be able to answer (at least) the following questions from the proposal alone:

  • What are you trying to accomplish?
  • How is it done in Haskell now, and with what limitations?
  • Are there similar projects in other languages? What related work is there?
  • If successful, what difference will it make? Will it enable new Haskell business cases? Performance improvements to many libraries? Better accessibility of Hackage code?
  • What are the mid-term and final results you’re hoping to achieve?
  • How will the result of your work affect other projects?

So, now you know what we’re interested in, start writing that proposal!

Fast mutable collections for Haskell: more benchmarks

http://www.haskell.org/haskellwiki/HacPDX

Today at HacPDX I did more work on the Judy bindings described yesterday. Here is judy 0.2.1 up (lunchtime release). These are bindings to the (in)famous judy arrays library, allowing Haskell keys and values to be stored in fast, mutable maps and hash structures. You can currently index via words (i.e. you have to hash keys yourself), and you can store value types or pointers to Haskell values allocated with stable pointers.

Today, I started by filling out the API some more, and benchmarking of how each API function scales. The functions I’m interested in are:

size ::  JudyL a -> IO Int
insert :: JA a => Key -> a -> JudyL a -> IO ()
lookup :: JA a => Key      -> JudyL a -> IO (Maybe a)
delete ::         Key      -> JudyL a -> IO ()

Let’s see how the performance of these functions scales with input. Benchmarking is performed via criterion (newly released to the public!) with random input keys and values generated from the mersenne twister.

Implementation methodology

My general strategy with constructing a high performance Haskell library is to:

  • Pick a candidate benchmark set.
  • Pick a good candidate data structure to make the benchmark faster.
  • Carefully implement the type, or the binding to the foreign type and its representation in Haskell
  • Carefully abstract out the unsafeties in any low level interface via type or language features.
  • Function-by-function, check by eye the optimized code that GHC produces (with ghc-core) (in particular, inlining, specialization and unboxing).
  • Function-by-function, benchmark the absolute performance as input grows (with criterion), with reference to model structures.
  • Use QuickCheck to show correctness against a model, with HPC to confirm code coverage of the test.
  • Document extensively with haddock.
  • Collect ideas on how to generalize the interface (via type classes, more flexible types, representation-changing interfaces via type families).

size

The size function tells you how many keys are in the Judy array. It has O(1) complexity, as we see from empirical measurement, here with randomized arrays with Int values and keys. We get the same cost whether it is 100k, 1M or 10 million elements in the structure. size is fast and scales.

size-100k-densities-800x200size-1M-densities-800x200size-10M-densities-800x200lookup

A common operation on associative arrays is lookup(). It better be fast, and not degrade.

lookup-100k-densities-800x200size-1M-densities-800x200

size-10M-densities-800x200If we plot the mean lookup times against N, we’ll see the lookup time grows slowly (taking twice as long when the data is 100 times bigger). (N.B. I forgot to change the titles on the above graphs).

delete

Delete has an almost identical profile to insert, unsurprisingly. Time to delete an element grows very slowly with N, and in absolute terms if very fast.

delete-100k-densities-800x200

delete-1m-densities-800x200

delete-10m-densities-800x200