Programming Haskell: argument handling and a complete cat

This is part three in a series of tutorials on programming Haskell. You can get up to speed by reading:

Today we’ll look more into how Haskell interacts with its environment, robust command line argument parsing, and writing a complete program.

Administrivia

Some clarifications on issues raised by yesterday’s article.

Chunks

One issue pointed out was that the ‘chunk’ function was missing. This was semi-intentional. Anyway, this function just splits a list into ‘n’ chunks:

    chunk n xs = chunk' i xs
      where
        chunk' _ [] = []
        chunk' n xs = a : chunk' n b where (a,b) = splitAt n xs

        i = ceiling (fromIntegral (length xs) / fromIntegral n)

You may be able to write a neater one..

Solipsistic philosophers

Also, an amusing thread appeared on reddit, regarding solipsistic philosopher programs:

Of course, a good optimizing compiler will replace your solipsistic philosopher with a no-op.

A good optimising compiler, or any Haskell compiler :-) Since results that are never required are not computed in Haskell, due to laziness, we can write high performance solipsism simulators all day long:

    main = do
        let largestNumber = last [1..]
        return ()

Running our simulation of the philosopher-mathematician pondering some thoughts on large numbers:

    $ ghc A.hs
    $ ./a.out
    ./a.out  0.00s user 0.01s system 100% cpu 0.003 total

Ok. Enough jokes. Show me the code!

Getting to work

Yesterday we implemented a few toy unix programs, including ‘cat’. Today we’ll look at writing a complete cat program, but with a focus on interacting properly with the environment and being careful about command line handling. For our running examples, we’ll consider the ‘cat’ and ‘tac’ programs. The basic spec for ‘cat’ is:

The cat utility reads files sequentially, writing them to the standard output. The file operands are processed in command-line order. If file is a single dash (`-‘) or absent, cat reads from the standard input.

It’s the ‘id’ function of the unix shell. BSD ‘cat.c’ is a 255 line C program. From the man page we can see it does more than just concatenate files. It can also:

  • Numbers the output lines, starting at 1.
  • Squeezes multiple adjacent empty lines
  • Displays non-printing characters so they are visible.

Let’s start by looking at the command line argument processing code.

Getting in arguments

The basic way to get arguments in a Haskell program is provided by the System.Environment library. We can use the getArgs function:

    Prelude> :m + System.Environment

    Prelude System.Environment> :t getArgs
    getArgs :: IO [String]

    Prelude System.Environment> do a <- getArgs; print a
    []

Which is empty, since we didn’t provide any arguments!

In a small program we can implement all argument handling using just ‘getArgs’, and some simple list functions. For example, here’s a basic ‘tac’ program, to reverse its input:

    import System.Environment
    import System.Exit

    main = getArgs >>= parse >>= putStr . tac

    tac  = unlines . reverse . lines

    parse ["-h"] = usage   >> exit
    parse ["-v"] = version >> exit
    parse []     = getContents
    parse fs     = concat `fmap` mapM readFile fs

    usage   = putStrLn "Usage: tac [-vh] [file ..]"
    version = putStrLn "Haskell tac 0.1"
    exit    = exitWith ExitSuccess
    die     = exitWith (ExitFailure 1)

This program concatenates and prints the contents of files in reverse (or reads from stdin with no arguments), along with a couple of basic command line flags for version and help strings. It’s also reasonably careful about setting exit status on finishing, using the functions from System.Exit. The actual core algorithm for ‘tac’ is a nice pure Haskell function, and really all the hard work is done processing the command line args.

Some example use:

    $ ./tac -h
    Usage: tac [-vh] [file ..]

    $ ./tac -v
    Haskell tac 0.1

    $ ./tac A.hs B.hs
    return n
    print 1
    n <- getLine
    g = do
    ...
    import System.Exit
    import System.Environment

    $ ./tac  < A.hs 
    die     = exitWith (ExitFailure 1)
    exit    = exitWith ExitSuccess
    version = putStrLn "Haskell tac 0.1"
    ...
    import System.Exit
    import System.Environment

As you can see, once compiled it behaves like a normal unix utility, properly dealing with stdin, with file arguments and the shell.

Note that getArgs doesn’t return the program name. To get that we use:

    Prelude System.Environment> :t getProgName
    getProgName :: IO String

The environment

Many programs also make use of environment variables. We can get access to the full shell ‘env’ using

    Prelude System.Environment> :t getEnvironment
    getEnvironment :: IO [(String, String)]

Which returns an association list, mapping environment variables to their values. We can stick this list straight into an efficient Map structure, for later use. Here’s an interactive example:

    Prelude System.Environment> env <- do e <- getEnvironment; return (Data.Map.fromList e)

which we could also write as:

    env <- Data.Map.fromList `fmap` getEnvironment

Once we’ve got the environment in a useful Map, we can inspect it using Map lookups:

    Prelude System.Environment> :t env
    env :: Data.Map.Map String String

    Prelude System.Environment> :t Data.Map.lookup
    Data.Map.lookup :: (Ord k, Monad m) => k -> Data.Map.Map k a -> m a

That is, the lookup function takes some key, ‘k’, and a Map from keys to elements of type ‘a’, and returns an element, if found, in some monad.

More on failure

You may recall from the first tutorial that the Map ‘lookup’ function will fail if the key is not found. The particular way you wish it to fail depends on which monad you use. You can tell this from the type of lookup. The

    lookup :: (Monad m) => ... m a

syntax indicates that lookup is polymorphic in its monad: it will work for any monad type, and its behaviour is determined by the particular instance of the monad interface you ask for. When a lookup fails, it calls the ‘fail’ function for the monad you’re using. When a lookup is successful, it calls the ‘return’ function of the same monad. Being ‘polymorphic in a monad’ really just means that it will call which particular concrete monad ‘subclass’ you happen to be using.

Looking at the various useful monads for this, we can choose which failure behaviour we would prefer. Here’s the implementation of the ‘fail’ interface for a variety of monads. It’s up to you to pick which behaviour you’d like.

For Maybes, we get the null value, Nothing, on failure:

    instance  Monad Maybe  where
        return   = Just
        fail _   = Nothing

For Eithers, we get an error string:

    instance (Error e) => Monad (Either e) where
        return   = Right
        fail s   = Left (strMsg s)

For lists, we get the empty list on failure:

    instance  Monad []  where
        return x = [x]
        fail _   = []

And for IO we get an exception thrown:

    instance  Monad IO  where
        fail    = ioError . userError
        return  = returnIO

So, depending on the type signature, the compiler will statically pick one of these ‘fail’s to use on a lookup failing at runtime. For example, to fail with a null value, we’d use the Maybe monad:

    Prelude System.Environment> Data.Map.lookup "USER" env :: Maybe String
    Just "dons"

    Prelude System.Environment> Data.Map.lookup "LUSER" env :: Maybe String
    Nothing

Using Nothing for fatal errors isn’t the best practice for large programs, since you usually need to know what failed. For a string-annotated Nothing, we can use the Either monad:

    Prelude System.Environment> :m + Control.Monad.Error

    Prelude System.Environment Control.Monad.Error> Data.Map.lookup "LUSER" env :: Either String String
    Left "Data.Map.lookup: Key not found"

Which is more useful. To fail with a proper exception we’d use the IO monad:

    Prelude System.Environment> Data.Map.lookup "LUSER" env :: IO String
    *** Exception: user error (Data.Map.lookup: Key not found)

We’ll now turn to a more flexible approach to argument parsing.

GetOpt

The base Haskell library comes with an implementation of getopt, a useful library for standardised argument handling. Let’s implement the argument handling of the unix ‘cat’ program using this lib.

A type for flags

The first thing to do is define a data type representing the valid flags. First, let’s import all the libraries I’ll use:

    import Control.Monad
    import Data.Char
    import Data.List
    import System.Console.GetOpt
    import System.Environment
    import System.Exit
    import System.IO
    import Text.Printf

Now in a new file, Cat.hs, we’ll write:

    data Flag
        = Blanks                -- -b
        | Dollar                -- -e 
        | Squeeze               -- -s
        | Tabs                  -- -t
        | Unbuffered            -- -u
        | Invisible             -- -v
        | Number                -- -n
        | Help                  -- --help
        deriving (Eq,Ord,Enum,Show,Bounded)

The ‘data’ keyword defines a new data type, ‘Flag’, which can have one of several values. Such a type is often called a sum (or union) type. So ‘Flag’ is a new user-defined type, just like other types, such as Bool or Int. The identifiers on the right hand side of the | are the types constructors. That is, values which have type ‘Flag’. We ask the compiler to also derive some instances of various common classes for us (so we don’t have to write the code ourselves).

With just this we can already start playing around with the flag data type in GHCi:

    > :reload
    > :m + Data.List
    > let s = [Number, Squeeze, Unbuffered, Squeeze]

    *Main Data.List> let s = [Number, Squeeze, Unbuffered, Squeeze]

    *Main Data.List> sort s
    [Squeeze,Squeeze,Unbuffered,Number]

    *Main Data.List> nub s
    [Number,Squeeze,Unbuffered]

    *Main Data.List> map fromEnum s
    [6,2,4,2]

    *Main Data.List> [Blanks .. ]
    [Blanks,Dollar,Squeeze,Tabs,Unbuffered,Invisible,Number,Help]

User defined data types are really first class citizens in Haskell, and behave just like the ‘inbuilt’ types.

Binding to command line flags

The next step is to associate some particular command line strings with each abstract flag. We do this by writing a list of ‘Option’s, which tie long and short argument flags to the particular abstract Flag value we need, and also associated a help string with each flag:

    flags =
       [Option ['b'] []       (NoArg Blanks)
            "Implies the -n option but doesn't count blank lines."
       ,Option ['e'] []       (NoArg Dollar)
            "Implies the -v option and also prints a dollar sign (`$') at the end of each line."
       ,Option ['n'] []       (NoArg Number)
            "Number the output lines, starting at 1."
       ,Option ['s'] []       (NoArg Squeeze)
            "Squeeze multiple adjacent empty lines, causing the output to be single spaced."
       ,Option ['t'] []       (NoArg Tabs)
            "Implies the -v option and also prints tab characters as `^I'."
       ,Option ['u'] []       (NoArg Unbuffered)
            "The output is guaranteed to be unbuffered (see setbuf(3))."
       ,Option ['v'] []       (NoArg Invisible)
            "Displays non-printing characters so they are visible."
       ,Option []    ["help"] (NoArg Help)
            "Print this help message"
       ]

Parsing the flags

To actually turn the list of command line flags getArgs gives us, into a useful list of abstract Flag values, we use the ‘getOpt’ function, which returns a triple consisting of flags that were set, a list of any non-flag arguments, and a list of error messages. First we need a couple of libraries:

And now to parse the ‘cat’ argument grammar, we would use:

    parse argv = case getOpt Permute flags argv of

        (args,fs,[]) -> do
            let files = if null fs then ["-"] else fs
            if Help `elem` args
                then do hPutStrLn stderr (usageInfo header flags)
                        exitWith ExitSuccess
                else return (nub (concatMap set args), files)

        (_,_,errs)      -> do
            hPutStrLn stderr (concat errs ++ usageInfo header flags)
            exitWith (ExitFailure 1)

        where header = "Usage: cat [-benstuv] [file ...]"

If the arguments don’t make sense, we fail with a usage message, and set the exit status to 1. The final list of flags to use, and any files to open, can be returned to main for processing now:

    main = do
        (as, fs) <- getArgs >>= parse
        putStrLn $ "Flags: " ++ show as
        putStrLn $ "Files: " ++ show fs

We can now test out the argument parsing code:

    $ ghc Cat.hs

    $ ./a.out          
    Flags: []
    Files: []

    $ ./a.out A.hs Z.hs
    Flags: []
    Files: ["A.hs","Z.hs"]

Ok, files are good. How about the flags that imply other flags?

    $ ./a.out -b A.hs Z.hs
    Flags: [Number,Blanks]
    Files: ["A.hs","Z.hs"]

    $ ./a.out -btvu A.hs Z.hs
    Flags: [Number,Blanks,Tabs,Invisible,Unbuffered]
    Files: ["A.hs","Z.hs"]

Good. And invalid flags:

    $ ./a.out -i A.hs Z.hs
    unrecognized option `-i'
    Usage: cat [-benstuv] [file ...]
      -b          Implies the -n option but doesn't count blank lines.
      -e          Implies the -v option and also prints a dollar sign (`$') at the end of each line.
      -n          Number the output lines, starting at 1.
      -s          Squeeze multiple adjacent empty lines, causing the output to be single spaced.
      -t          Implies the -v option and also prints tab characters as `^I'.
      -u          The output is guaranteed to be unbuffered (see setbuf(3)).
      -v          Displays non-printing characters so they are visible.
          --help  Print this help message

Ok, that was pretty easy. Now let’s try to implement these functions!

Implementing cat

So now we have to map those abstract flag values to real behaviour. I’ll start with the easy ones first.

If -u is set, we turn off all buffering. After that, we map a ‘cat’ function over each file. So the program’s ‘main’ is just:

    main = do
        (args, files) <- getArgs >>= parse
        when (Unbuffered `elem` args) $ hSetBuffering stdout NoBuffering
        mapM_ (cat args) files

Where ‘cat’ will process the files one at a time. ‘cat’ is where all the hard work is done.

Most of the operations the cat program does requires access to each line of the file. We also need to be able to handle the special file name, “-“. What we’d like to do is separate out any IO, from operations on each file’s content. To do this we’ll write a higher order function, ‘withFile’, which takes a filename, opens it, splits it into lines and applies a function to the contents of the file, before writing the result to stdout:

    withFile s f = putStr . unlines . f . lines =<< open s
      where
        open f = if f == "-" then getContents else readFile f

Now we can implement the pure ‘cat’ function, implementing the cat program’s functionality. Firstly, if there are no command line flags, the ‘cat’ function does nothing to the input:

    cat [] f = withFile f id

That is, it applies the ‘id’ function to the stream generated by withFile. That was easy.

Now, if there are some arguments, we’ll need to process them. This can be a little tricky, since the effect of the command line flags are cumulative, and we better process them in the right order. What is that order? Well, from experimentation :-) it seems that (if all flags are enabled) ‘cat’ proceed to:

  • first squeeze any blank lines;
  • then any visibility flags are processed;
  • then line numbering occurs;
  • then, finally, any visible newlines are printed as ‘$’.

The visibility flags transform non-printing characters into a visible representation. The key to coding this up is recognising that its just a functional pipeline. So we can write it as:

    cat as f = withFile f (newline . number . visible as)

Where ‘visible’ renders any non-printing chars. Then we number the resulting lines (if the arguments are set), and then finally make any remaining newlines visible. Note that the core of the algorithm does no IO. It’s a pure function from [String] -> [String]. Now the implementation of ‘number’:

      where
        number  s    = if Blanks `elem` as then numberSome s else ifset Number numberAll s
        newline s    = ifset Dollar (map (++"$")) s
        visible as s = foldl' (flip render) s as
        ifset a f    = if a `elem` as then f else id

Here we actually handle all the data traversal. And use a little helper function, ‘ifset’, to conditionally execute a function if the corresponding command line is set. Note that slight trickiness involving numbering: either we number all lines, or number the non blank lines, but not both.

‘render’, the function to print invisible characters, is just:

    render Squeeze   = map head . group
    render Tabs      = map $ concatMap (c -> if c == 't' then "^I" else [c])
    render Invisible = map $ concatMap visible
      where
        visible c | c == 't' || isPrint c = [c]
                  | otherwise              = init . tail . show $ c
    render _ = id

And the numbering function:

    numberLine      = printf "%6d  %s"
    numberAll s     = zipWith numberLine [(1 :: Integer)..] s
    numberSome s    = reverse . snd $ foldl' draw (1,[]) s
      where
        draw (n,acc) s
                | all isSpace s = (n,   s : acc)
                | otherwise     = (n+1, numberLine n s : acc)

And we’re done! In the end, our entire implementation is some 89 lines of code, of which 60 are to do with importing modules, or command line argument parsing. The actual heart of the program is fairly tiny in the end.

Let’s run the code.

    $ ghc -O Cat.hs -o cat

Check it actually prints its arguments:

    $ ./cat Cat.hs | head
    import System.Console.GetOpt
    import System.IO
    import System.Exit
    import System.Environment
    import Data.List
    import Data.Char
    import Control.Monad
    import Text.Printf

    main = do

Or multiple arguments:

    $ ./cat Cat.hs /usr/share/dict/words | tail
    zymotoxic
    zymurgy
    Zyrenian
    Zyrian
    Zyryan
    zythem
    Zythia
    zythum
    Zyzomys
    Zyzzogeton

Does it number lines:

    $ ./cat -n Cat.hs  | tail
        80  
        81      (_,_,errs)      -> do
        82          hPutStrLn stderr (concat errs ++ usageInfo header flags)
        83          exitWith (ExitFailure 1)
        84  
        85      where header = "Usage: cat [-benstuv] [file ...]"
        86  
        87            set Dollar = [Dollar, Invisible]
        88            set Tabs   = [Tabs,   Invisible]
        89            set f      = [f]

What about non blank lines:

    $ ./cat -b Cat.hs | tail

        72      (_,_,errs)      -> do
        73          hPutStrLn stderr (concat errs ++ usageInfo header flags)
        74          exitWith (ExitFailure 1)

        75      where header = "Usage: cat [-benstuv] [file ...]"

        76            set Dollar = [Dollar, Invisible]
        77            set Tabs   = [Tabs,   Invisible]
        78            set f      = [f]

How about some visibility flags:

    $ cat -eb Cat.hs | tail
    $
        72      (_,_,errs)      -> do$
        73          hPutStrLn stderr (concat errs ++ usageInfo header flags)$
        74          exitWith (ExitFailure 1)$
    $
        75      where header = "Usage: cat [-benstuv] [file ...]"$
    $
        76            set Dollar = [Dollar, Invisible]$
        77            set Tabs   = [Tabs,   Invisible]$
        78            set f      = [f]$

And turning on all the flags:

    $ cat -bnvste Cat.hs | tail
    $
        72      (_,_,errs)      -> do$
        73          hPutStrLn stderr (concat errs ++ usageInfo header flags)$
        74          exitWith (ExitFailure 1)$
    $
        75      where header = "Usage: cat [-benstuv] [file ...]"$
    $
        76            set Dollar = [Dollar, Invisible]$
        77            set Tabs   = [Tabs,   Invisible]$
        78            set f      = [f]$

Nice!

Summary

Well, in the end I didn’t get on to exception handling, or the use of bytestring to improve performance further. However, we have implemented (95%) of the unix ‘cat’ program, including all argument handling and functionality, in about an hour and a half.

Once it typechecked, the code just worked, except for one bug where I originally rendered newline before counting lines (simply because the spec was underspecified). Lesson: you can start writing your unix scripts in Haskell right now. They’ll be flexible, clean, and easy to maintain. And most of all, fun to write!

Hopefully next time we’ll look into using bytestrings for processing larger volumes of data, and the use of exception handling to deal with unusual errors.

The complete source

And just for reference, there’s the complete source:

    import System.Console.GetOpt
    import System.IO
    import System.Exit
    import System.Environment
    import Data.List
    import Data.Char
    import Control.Monad
    import Text.Printf

    main = do
        (args, files) <- getArgs >>= parse
        when (Unbuffered `elem` args) $ hSetBuffering stdout NoBuffering
        mapM_ (cat args) files

    withFile s f = putStr . unlines . f . lines =<< open s
      where
        open f = if f == "-" then getContents else readFile f

    cat [] f = withFile f id
    cat as f = withFile f (newline . number . visible as)
      where
        number  s    = if Blanks `elem` as then numberSome s else ifset Number numberAll s
        newline s    = ifset Dollar (map (++"$")) s
        visible as s = foldl' (flip render) s as
        ifset a f    = if a `elem` as then f else id

    render Squeeze   = map head . group
    render Tabs      = map $ concatMap (c -> if c == 't' then "^I" else [c])
    render Invisible = map $ concatMap visible
      where
        visible c | c == 't' || isPrint c = [c]
                  | otherwise              = init . tail . show $ c
    render _ = id

    numberLine      = printf "%6d  %s"
    numberAll s     = zipWith numberLine [(1 :: Integer)..] s
    numberSome s    = reverse . snd $ foldl' draw (1,[]) s
      where
        draw (n,acc) s
                | all isSpace s = (n,   s : acc)
                | otherwise     = (n+1, numberLine n s : acc)

    data Flag
        = Blanks                -- -b
        | Dollar                -- -e 
        | Squeeze               -- -s
        | Tabs                  -- -t
        | Unbuffered            -- -u
        | Invisible             -- -v
        | Number                -- -n
        | Help                  -- --help
        deriving (Eq,Ord,Enum,Show,Bounded)

    flags =
       [Option ['b'] []       (NoArg Blanks)
            "Implies the -n option but doesn't count blank lines."
       ,Option ['e'] []       (NoArg Dollar)
            "Implies the -v option and also prints a dollar sign (`$') at the end of each line."
       ,Option ['n'] []       (NoArg Number)
            "Number the output lines, starting at 1."
       ,Option ['s'] []       (NoArg Squeeze)
            "Squeeze multiple adjacent empty lines, causing the output to be single spaced."
       ,Option ['t'] []       (NoArg Tabs)
            "Implies the -v option and also prints tab characters as `^I'."
       ,Option ['u'] []       (NoArg Unbuffered)
            "The output is guaranteed to be unbuffered (see setbuf(3))."
       ,Option ['v'] []       (NoArg Invisible)
            "Displays non-printing characters so they are visible."
       ,Option []    ["help"] (NoArg Help)
            "Print this help message"
       ]

    parse argv = case getOpt Permute flags argv of
        (args,fs,[]) -> do
            let files = if null fs then ["-"] else fs
            if Help `elem` args
                then do hPutStrLn stderr (usageInfo header flags)
                        exitWith ExitSuccess
                else return (nub (concatMap set args), files)

        (_,_,errs)      -> do
            hPutStrLn stderr (concat errs ++ usageInfo header flags)
            exitWith (ExitFailure 1)

        where header = "Usage: cat [-benstuv] [file ...]"

              set Dollar = [Dollar, Invisible]
              set Tabs   = [Tabs,   Invisible]
              set f      = [f]

Programming Haskell: string processing (with a dash of concurrency)

This is part two in a series of tutorials on programming Haskell. You can get up to speed by reading yesterday’s introductory article.

Today we’ll look more into the basic tools at our disposal in the Haskell language, in particular, operations for doing IO and playing with files and strings.

Administrivia

Before we get started, I should clarify a small point raised by yesterday’s article. One issue I forgot to mention was that there are slight differences between running Haskell in ghci, the bytecode interpreter, and compiling it to native code with GHC.

Haskell programs are executed by evaluating the special ‘main’ function.

    import Data.List

    mylength = foldr (const (+1)) 0
    main = print (mylength "haskell")

To compile this to native code, we would feed the source file to the compiler:

    $ ghc A.hs
    $ ./a.out 
    7

For a faster turnaround, we can run the code directly through the bytecode interpreter, GHCi, using the ‘runhaskell’ program:

    $ runhaskell A.hs
    7

GHCi, the interactive Haskell environment, is a little bit different. As it is an interactive system, GHCi must execute your code sequentially, as you define each line. This is different to normal Haskell, where the order of definition is irrelevant. GHCi effectively executes your code inside a do-block. Therefore you can use the do-notation at the GHCi prompt to define new functions:

    $ ghci
    Prelude> :m + Data.List

    Prelude> let mylength = foldr (const (+1)) 0

    Prelude> :t mylength
    mylength :: [a] -> Integer

    Prelude> mylength "haskell"
    7

For this tutorial I will be developing code in a source file, and either compiling it as above, or loading the source file into GHCi for testing. To load a source file into GHCi, we do:

    $ ghci    
    Prelude> :load A.hs

    *Main> :t main
    main :: IO ()

    *Main> :t mylength
    mylength :: [a] -> Integer

    *Main> mylength "foo"
    3

    *Main> main
    7

Now, let’s get into the code!

IO

As the Camel Book says:

Unless you’re using artificial intelligence to model a solipsistic philosopher, your program needs some way to communicate with the outside world.

In yesterday’s tutorial, I briefly introduced ‘readFile’, for reading a String from a file on disk. Let’s consider now IO in more detail. The most common IO operations are defined in the System.IO library.

For the most basic stdin/stdout Unix-style programs in Haskell, we can use the ‘interact’ function:

    interact    ::  (String -> String) -> IO ()

This higher order function takes, as an argument, some function for processing a string (of type String -> String). It runs this function over the standard input stream, printing the result to standard output. A surprisingly large number of useful programs can be written this way. For example, we can write the ‘cat’ unix program as:

    main = interact id

Yes, that’s it! Let’s compile and run this program:

    $ ghc -O A.hs       

    $ cat A.hs | ./a.out
    main = interact id

How does this work? Firstly, ‘interact’ is defined as:

    interact f = do s <- getContents
                    putStr (f s)

So it reads a string from standard input, and writes to standard output the result of applying its argument function to that string. The ‘id’ function itself has the type:

    id :: a -> a

‘id’ is a function of one argument, of any type (the lowercase ‘a’ in the type means any type can be used in that position, i.e. it is a polymorphic function (also called a generic function in some languages)). ‘id’ takes a value of some type ‘a’, and returns a value of the same type. There’s is only one (non-trivial) function of this type:

    id a = a

So ‘interact id’ will print to the input string to standard output unmodified.

Let’s now write the ‘wc’ program:

    main    = interact count
    count s = show (length s) ++ "n"

This will print the length of the input string, that is, the number of chars:

    $ runhaskell A.hs < A.hs
    57

Line oriented IO

Only a small number of programs operate on unstructured input streams. It is far more common to treat an input stream as a list of lines. So let’s do that. To break a string up into lines, we’ll use the … ‘lines’ function, defined in the Data.List library:

    lines :: String -> [String]

The type, once again, tells the story. ‘lines’ takes a String, and breaks it up into a list of strings, splitting on newlines. To join a list of strings back into a single string, inserting newlines, we’d use the … ‘unlines’ function:

    unlines :: [String] -> String

There are also similar functions for splitting on words, namely ‘words’ and ‘unwords’. Now, an example. To count the number of lines in a file:

    main = interact (count . lines)

We can run this as:

    $ ghc -O A.hs

    $ ./a.out < A.hs
    3

Here we reuse the ‘count’ function from above, by composing it with the lines function.

On composition

This nice code reuse via composition is achieved using the (.) function, pronounced ‘compose’. Let’s look at how that works. (Feel free to skip this section, if you want to just get things done).

The (.) function is just a normal everyday Haskell function, defined as:

    (.) f g x = f (g x)

This looks a little like magic (or line noise), but its pretty easy. The (.) function simply takes two functions as arguments, along with another value. It applies the ‘g’ function to the value ‘x’, and then applies ‘f’ to the result, returning this final value. The functions may be of any type. The type of (.) is actually:

    (.) :: (b -> c) -> (a -> b) -> a -> c

which might look a bit hairy, but it essentially specifies what types of arguments make sense to compose. That is, only those where:

    f :: b -> c
    g :: a -> b
    x :: a

can be composed, yielding a new function of type:

    (f . g) :: a -> c

The nice thing is that this composition makes sense (and works) for all types a, b and c.

How does this relate to code reuse? Well, since our ‘count’ function is polymorphic, it works equally well counting the length of a string, or the length of a list of strings. Our littler ‘wc’ program is the epitome of the phrase: “higher order + polymorphic = reusable”. That is, functions which take other functions as arguments, when combined with functions that work over any type, produce great reusable ‘glue’. You only need vary the argument function to gain terrific code reuse (and the strong type checking ensures you can only reuse code in ways that work).

More on lines

Another little example, let’s reverse each line of a file (like the unix ‘rev’ command):

    main = interact (unlines . map reverse . lines)

Which when run, reverses the input lines:

    $ ./a.out < B.hs
    rahC.ataD tropmi
    ebyaM.ataD tropmi
    tsiL.ataD tropmi

So we take the input string, split it into lines, and the loop over that list of lines, reversing each of them, using the ‘map’ function. Finally, once we’ve reversed each line, we join them back into a single string with unlines, and print it out.

The ‘map’ function is a fundamental control structure of functional programming, similar to the ‘foreach’ keyword in a number of imperative languages. ‘map’ however is just a function on lists, not built in syntax, and has the type:

    map :: (a -> b) -> [a] -> [b]

That is, it takes some function, and a list, and applies that function to each element of the list, returning a new list as a result. Since loops are so common in programming, we’ll be using ‘map’ a lot. Just for reference, ‘map’ is implemented as:

    map _ []     = []
    map f (x:xs) = f x : map f xs

File IO

Operating on stdin/stdout is good for scripts (and this is how tools like sed or perl -p work), but for ‘real’ programs we’ll at least need to do some file IO. The basic operations of files are:

    readFile  :: FilePath -> IO String
    writeFile :: FilePath -> String -> IO ()

‘readFile’ takes a file name as an argument, does some IO, and returns the file’s contents as a string. ‘writeFile’ takes a file name, a string, and does some IO (writing that string to the file), before returning the void (or unit) value, ().

We could implement a ‘cp’ program on files, as:

    import System.Environment

    main = do
        [f,g] <- getArgs
        s     <- readFile f
        writeFile g s

Running this program:

    $ ghc -O A.hs

    $ ./a.out A.hs Z.hs

    $ cat Z.hs
    import System.Environment

    main = do
        [f,g] <- getArgs
        s     <- readFile f
        writeFile g s

Since we’re doing IO (the type of readFile and writeFile enforce this), the code runs inside a do-block, using the IO monad. “Using the IO monad” just means that we wish to use an imperative, sequential order of evaluation. (As an aside, a wide range of other monads exist, for programming different program evaluation strategies, such as Prolog-style backtracking, or continutation-based evaluation. All of imperative programming is just one subset of possible evaluation strategies you can use in Haskell, via monads).

In do-notation, whenever you wish to run an action, for its side effect, and save the result to a variable, you would write:

    v <- action

For example, to run the ‘readFile’ action, which has the side effect of reading a file from disk, we say:

    s <- readFile f

Finally, we can use the ‘appendFile’ function to append to an existing file.

File Handles

The most generic interface to files is provided via Handles. Sometimes we need to keep a file open, for multiple reads or writes. To do this we use Handles, an abstraction much like the underlying system’s file handles.

To open up a file, and get its Handle, we use:

    openFile :: FilePath -> IOMode -> IO Handle

So to open a file for reading only, in GHCi:

    Prelude System.IO> h <- openFile "A.hs" ReadMode
    {handle: A.hs}

Which returns a Handle onto the file “A.hs”. We can read a line from this handle:

    Prelude System.IO> hGetLine h
    "main = do"

To close a Handle, and flush the buffer:

    hClose :: Handle -> IO ()

Once a Handle is closed, we can no longer read from it:

    Prelude System.IO> hClose h
    Prelude System.IO> hGetLine h
    *** Exception: A.hs: hGetLine: illegal operation (handle is closed)

We can also flush explicitly with:

    hFlush :: Handle -> IO ()

Other useful operations for reading from Handles:

    hGetChar     :: Handle -> IO Char
    hGetLine     :: Handle -> IO [Char]
    hGetContents :: Handle -> IO [Char]

To write to Handles:

    hPutChar    :: Handle -> Char -> IO ()
    hPutStr     :: Handle -> [Char] -> IO ()
    hPutStrLn   :: Handle -> [Char] -> IO ()
    hPrint      :: Show a => Handle -> a -> IO ()

Some other useful actions:

    hSeek     :: Handle -> SeekMode -> Integer -> IO ()
    hTell     :: Handle -> IO Integer
    hFileSize :: Handle -> IO Integer
    hIsEOF    :: Handle -> IO Bool

An example: spell checking

Here is a small example of combining the Data.Set and List data structures from yesterday’s tutorial, with more IO operations. We’ll implement a little spell checker, building the dictionary in a Set data type. First, some libraries to import:

    import System.Environment
    import Control.Monad
    import Data.Set

And the complete program:

    main = do
        [s] <- getArgs
        f   <- readFile "/usr/share/dict/words"
        g   <- readFile s
        let dict = fromList (lines f)
        mapM_ (spell dict) (words g)

    spell d w = when (w `notMember` d) (putStrLn w)

Running this program, on its own source, and it reports the following words are not found in the dictionary:

    $ ghc -O Spell.hs -o spell

    $ ./spell A.hs
    Data.Char
    =
    <-
    (map
    toUpper
    n)
    =
    <-
    getLine
    1

Writing the results out

If we wanted to write the results out to a temporary file, we can do so. Let’s import a couple of other modules:

    import Data.Set
    import Data.Maybe
    import Text.Printf
    import System.IO
    import System.Environment
    import System.Posix.Temp

Refactoring the main code to separate out the reading and writing phases in to their own function, we end up with the core code:

    main = do
        (f, g) <- readFiles
        let dict = fromList (lines f)
            errs = mapMaybe (spell dict) (words g)
        write errs

    spell d w | w `notMember` d = Just w
              | otherwise       = Nothing

Where reading is now its own function:

    readFiles = do
        [s] <- getArgs
        f   <- readFile "/usr/share/dict/words"
        g   <- readFile s
        return (f,g)

And writing errors out to their own file:

    write errs = do
        (t,h) <- mkstemp "/tmp/spell.XXXXXX"
        mapM_ (hPutStrLn h) errs
        hClose h
        printf "%d spelling errors written to '%s'n" (length errs) t

Pretty simple! Running this program:

    $ ghc --make -O Spell.hs -o myspell
    [1 of 1] Compiling Main             ( Spell.hs, Spell.o )
    Linking myspell ...

    $ ./myspell Spell.hs
    67 spelling errors written to '/tmp/spell.ia8256'

Extension: using SMP parallelism

Finally, just for some bonus fun … and hold on to your hat ’cause I’m going to go fast … we’ll add some parallelism to the mix.

Haskell was designed from the start to support easy parallelisation, and since GHC 6.6, multithreaded code will run transparently on multicore systems using as many cores as you specify. Let’s look at how we’d parallelise our little program to exploit multiple cores. We’ll use an explicit threading model, via Control.Concurrent. You can also make your code implicitly parallel, using Control.Parallel.Strategies, but we’ll leave that for another tutorial.

Here’s the source, for you to ponder. First some imports:

    import Data.Set hiding (map)
    import Data.Maybe
    import Data.Char
    import Text.Printf
    import System.IO
    import System.Environment
    import Control.Concurrent
    import Control.Monad

The entry point, modified to break the word list into chunks, and then dispatching a chunk to each thread:

    main = do
        (f, g, n) <- readFiles
        let dict = fromList (lines f)
            work = chunk n (words g)
        run n dict work

The ‘run’ function sets up a channel between the main thread and all children thread (‘errs’), and prints spelling errors as they arrive on the channel from the children. It then forks off ‘n’ children threads on each piece of the work list:

    run n dict work = do
        chan <- newChan
        errs <- getChanContents chan    -- errors returned back to main thread
        mapM_ (forkIO . thread chan dict) (zip [1..n] work)
        wait n errs 0

The main thread then just waits on all the threads to finish, printing any spelling errors they pass up:

    wait n xs i = when (i < n) $ case xs of
        Nothing : ys -> wait n ys $! i+1
        Just s  : ys -> putStrLn s >> wait n ys i

Each thread spell checks its own piece of the work list. If it finds a spelling error, it passes the offending word back over the channel to the main thread.

    thread chan dict (me, xs) = do
        mapM_ spellit xs
        writeChan chan Nothing

     where
        spellit w = do
            when (spell dict w) $
                writeChan chan . Just $ printf "Thread %d: %-25s" (me::Int) w

The ‘spell’ function is simplified a little:

    spell d w = w `notMember` d

which we could also write as:

    spell = flip notMember

We modify the readFiles phase to take an additional numeric command line argument, specifying the number of threads to run:

    readFiles = do
        [s,n] <- getArgs
        f     <- readFile "/usr/share/dict/words"
        g     <- readFile s
        return (f,g, read n)

We compile this with the GHC SMP parallel runtime system:

    $ ghc -O --make -threaded Spell.hs -o spell

Now, we can run ‘n’ worker threads (lightweight Haskell threads), mapped onto ‘m’ OS threads. Since I’m using a 4 core linux server, we’ll play around with 4 OS threads. First, running everything in a single thread:

    $ time ./spell test.txt 1 +RTS -N1
    ...
    Thread 1: week:                    
    Thread 1: IO!
    ./spell test.txt 1 +RTS -N1 99% cpu 2.533 total

Ok, now we change the command line flag to run it with 4 OS threads, to try to utilise all 4 cores:

    $ time ./spell 4 +RTS -N4
    ...
    Thread 2: week:                    
    Thread 3: IO!
    ./spell test.txt 4 +RTS -N4 111% cpu 2.335 total

Ok. Good… A little bit faster, uses a little bit more cpu. It turns out however the program is bound currently by the time spent in the main thread building the initial dictionary. Actual searching time is only some 10% of the running time. Nonetheless, it was fairly painless to break up the initial simple program into a parallel version.

If the program running time was extended (as the case for a server), the parallelism would be a win. Additionally, should we buy more cores for the server, all we need to is change the +RTS -N argument to the program, to start utilising these extra cores.

Next week

In upcoming tutorials we’ll look more into implicitly parallel programs, and the use of the new high performance ByteString data type for string processing.