Tags: haskell

  • evan

your vcs sucks

I've been hacking on some Haskell stuff lately that's all managed in darcs and it's reminded me of an observation I made over two years ago now (see last paragraph), which is roughly: git is used widely enough that to get pretty much anything done with free software today you'll end up using git somewhere. So if your project doesn't use git, that means I get to learn git and whatever it is your project uses. (This issue comes up a lot for me, so much that a story about it is one of the few non-technical things I host on my website.)

I used to use darcs a lot. I get it, it's different, it's got some cool mathematical models and UI stuff, git's UI is terrible. I maintain it's not different enough from git to really matter.

And now I have to dig through manpages whenever I want to figure out trivial stuff like "did I forget to add any files in this commit?" (that's darcs whatsnew --look-for-adds) or "what files are in the repo?" (I recall using some sort of "inventory" command but after a lot of poking around it appears to be darcs show files -- tricky, putting it in a subcommand).

I'm especially looking at you, bzr users: you have no excuse. At least darcs is trying a different approach to how patches are managed! As far as I can tell bzr is "like the other ones, but slower".
  • evan

ghc llvm

I read this thesis on an LLVM backend for GHC, primarily because I was curious to learn more about GHC internals. The thesis serves well as an overview of the pieces. As for the actual change, it seems to fall somewhere in between compiling via C and generating machine code directly on the sorts of tradeoffs you'd expect (generating machine code directly = more control, more performance, more work).

GHC is designed around the spineless tagless G-machine, a paper I've never gotten around to reading, but conceptually is like a VM that GHC targets and then codegens from. It has a number of values ("virtual registers") it wants to keep track of while running, like the current stack pointer and current thread object (see table 2.1 in the paper). Because they're used so frequently, GHC pins these through to machine registers. I was surprised to read that on x86 this leaves only one register free for doing computations!

That was mentioned in a section describing one place where the LLVM implementation outperforms the GHC implementation: because LLVM doesn't support this register pinning directly, the code generator instead sets things up so that the relevant values are frequently (at the entrance to each function, in fact) in the appropriate registers, with the assumption that the LLVM optimizer will leave the values alone since each function call would otherwise need to restore those values in the appropriate registers. But in fact, sometimes it does make sense to spill those to the stack, and those are exactly some of the cases where the LLVM implementation is superior.

(Update: found an old thread where SPJ briefly discusses LLVM.)
  • evan

found my bug!

Not too interesting, but this has been bugging me for a week.

Been working on a toy program that proxies a TCP connection. It was working fine for some and then would mysteriously corrupt others; I spent a long time looking and it finally hit me, painfully obvious.

I had threads going in each way doing this in a loop:
 forward :: Socket -> Socket -> IO ByteString
 forward src dst = do
   buf <- recv src (16 * 2^10)
   send dst buf
   return buf
Network.Socket.ByteString.send of course returns an integer of how many bytes were actually sent; you need to loop if you want to be sure to send it all. (I had imagined this higher-level API was handling this looping for me, but it makes more sense that it wouldn't. It does however handle EINTR, so when I had glanced at the source in my desperation I did notice a loop, throwing me off the scent.) Sticking a loop around the send fixed everything.
  • evan

using control.applicative

A post in three sections, each a step of evolution beyond the last. Contrast with The Evolution of a Haskell Programmer.

Here's a situation that comes up decently frequently. Say you have some monadic operation that gives you back a basic type. Actually, let me make it concrete so it's easier to follow. Say you have a parsec parser that parses an integer:

p_int :: Parser Int

Deep elsewhere in your code, you have some data type that you'd like to stuff the result of that parse into. E.g.:

data Expr = EInt Int | EString String  -- etc
p_expr :: Parser Expr

So the question is: how do you use p_int within p_expr, converting the output? Let's start with the obvious way:

p_expr = do i <- p_int       -- parse an int
            return (EInt i)  -- wrap it in EInt

But it's so verbose! You can shorten by going point-free with the monad operators:

p_expr = p_int >>= return . EInt

But it always feels like golf to me when I see that in code; you may as well just do the do-expression in one line in only seven more characters:

p_expr = do x <- p_int; return (EInt x)

On the other hand, it does feel redundant to give the integer a name just to throw it away. The more experienced Haskell hacker knows about Control.Monad, which provides some operators to work with monads. Notably, there's liftM (parens added for clarity):

liftM :: (Monad m) => (a -> b) -> (m a -> m b)

That "lifts" a plain function to a function that works on monad values, so our running example can become:

p_expr = liftM EInt p_int

Can we do better? From reading other code I picked up that the prelude's fmap, which is map over the Functor type class, is the same as liftM for a monad. It feels a bit golfy but not so bad: conceptually you're mapping EInt over the result of the parse (but only if it's succesful):

p_expr = fmap EInt p_Int

But recently I gained a newfound understanding of Control.Applicative and found something better. From their docs, Applicative is "a structure intermediate between a functor and a monad: it provides pure expressions and sequencing, but no binding."

To keep it straight, here's a table with increasing abstraction as you go downwards. (Note the types of the columns aren't exactly the same, but maybe it illustrates the point.)

Typeclassbring value inprimary operationhelpers
Monadreturn>>=liftM, ap
Applicativepure<*><$>
Functorfmap

Conceptually, Applicative lets you bring values in and combine values, but doesn't let you get back at the values. It suits the parser example just fine:

p_expr = pure EInt <*> p_int

That is, pure brings the whole function into Applicative and then <*> is normal Haskell composition (normally just written with whitespace or $) within Applicative. But there's a helper operator that's just for this sort of thing:

(<$>) :: Functor f => (a -> b) -> f (a -> b)

You'll note it's the same signature as liftM, but in usage it's quite clear:

p_expr = EInt <$> p_int

The name is analagous to the $ operator but the brackets indicate within Applicative. You can read that as "apply EInt to p_int, but within Applicative". Succinct and clear.

The nice thing about Applicative (and sort of the point of it) as compared to monads is that you don't have this continual diving in and out of the monad; you just lift everything in and then combine. Contrast these two binary expressions:

add x y = x + y
normal_expression = add int_1 int_2
applicative_expr  = add <$> p_int <*> p_int

<*> (equivalently ap for monads) is there to let you to do further composition on the right, all within Applicative.

Finally, note that all of this isn't specific to this parser problem; you can use these functions just as well with other monads (like IO) and also to other places involving functors:

(+1) <$> [1, 2, 3]
-- equivalent to map (+1) [1, 2, 3]

See also Perl 6 meta-operators, which is this sort of thing done at the syntax level.

  • evan

zippers

Zippers: there are plenty of docs, so here's some random notes instead. (The original paper is in ML and quite readable.) Zippers are a fine example of the sort of progressive generalization that people play with in Haskell; my favorite blog at the moment is filled with interesting math along those lines.

I think the first zipper I encountered, before I knew the term, was in reading a Brainfuck implementation written in O'Caml. Brainfuck's model is basically a Turing machine, where you have a focal point on a tape and you can move around on the tape and do operations based on the focal point.

The question is: how do you represent this tape data structure with immutable data structures? If you use a list and an offset, examining the focal point is O(n) (to walk down the list to the offset) and modifications are O(n) (to rebuild the list). The solution, which seemed novel enough to me at the time that its obviousness now doesn't dissuade me from describing it for you, is to represent the tape as three parts: cells to the left of you, your current cell, and cells to the right. The trick is that on both sides, the nearest cell is at the front of the list; in some sense the left cell list is backwards.

Now stepping by one cell is O(1): you just pop the top of the relevant side list off and stuff your previous focal point as the new top on the other side. What I didn't fully appreciate at the time is that this is also gentle with memory. You're only popping/adding from the front of a list, so the remainder of the cell lists are constant and shared. Each step is destructuring one cons cell and creating one new one.

So whatever, you might say. You have jump through hoops to get functional data structures and it's a waste of time. Here's where it hopefully gets interesting or at least more complicated.

This idea, called a zipper, can be generalized: your cursor is a pair of the current point along with context representing the remainder of the data structure arranged such that nearby movements are small steps. (In the Brainfuck example the context is the cell lists.)

Most examples discuss movement on a tree. Your current point is an unmodified subtree, and your context is a pointer to the subtree's parent along with the rest of the parent's children (the children other than your current point). This otherwise kind of useless Wikibook article has a nice picture about halfway down visualizing it, where your pointer kinda "grabs" the tree and "lifts it up" so it hangs around that point, which is a nice structural metaphor. In terms of memory, it's sorta like the list zipper in that most of the tree is left alone and can be shared during movement; all the zipper added is a reversed path back to the root.

What struck me recently is that with a tree zipper you've stepped around the problem of parent pointers in a tree while still allowing movement in any direction. The zipper here is effectively just building a stack of parent pointers as you walk down the tree and popping the stack to walk up, which is the sort of trick you've likely used (or at least could have invented) in an imperative programming context to conserve memory while allowing arbitrary tree navigation.

This is again all nice enough -- two related concepts that do similar tricks with data structures -- except it can be generalized even further. A zipper can be defined for any data structure systematically by computing the derivative (in the calculus sense, kinda) of its data type! What's the use? Even the guy who discovered it isn't sure: "While this connection seems unlikely to be a mere coincidence, it is perhaps surprising to find a use for the laws of the infinitesimal calculus in such a discrete setting." More research is warranted, I guess.

  • evan

gat update

I've still been playing with gat occasionally; getting sick recently (I blame snej!) probably helped. The needless goto I mentioned in my last entry was, of course, found and cleaned up by the multitude of people chiseling away at git.

I added a command like "git log" to display the history of a project and that finally had enough work to do for me to look into performance. GHC has a profiler but for some reason I had expected it to not be helpful, which is not at all the case -- it did pretty much exactly what I wanted! I guess I had thought it'd be hard to properly attribute blame when everything's computed lazily, but on the other hand I can image how it might work. My initial naive code (e.g. repeatedly concatenating linked lists) was somewhere around 10x as slow as git; now it's about 1.4x, which is quick enough that I don't care. (Git is of course doing a lot more, so it remains to be seen whether that will hold.)

I also got the pager working, so gat diff HEAD^ now shows a paged, colored diff.

For those following at home, the performance improvements were pretty standard: stuff like dropping a high-level library or two, using better algorithms, and caching. Brad commented that it's sad that for performance I have to stop using my nice libraries, but I think I'm just dropping a level or two of abstraction and not the entire benefit of using a high-level language. You make separate modules that are covered by tests and then you're free to make the internals as gnarly as you want, and even my gnarlified parser is still using stuff like foldl.



I've commented many times about how the culture of people using a language drive the direction it heads in, for better or for worse. I find it interesting that for Haskell the two main drivers seem to be performance and research into ever-fancier abstractions, as those seem at odds. What you end up with is this incredible mathematical purity alongside operations talking about pointers; my binary search, above, juxtaposes being parameterized by two type classes (one of which was Monad) with searching within a block of mmap'ed memory.

I especially dig this because you typically think of programming languages and libraries as trading off productivity and performance, and it's nice to at least imagine there's some way to get both. For example, recent work on left fold enumerators promises to be a high-level abstraction around event-driven block-based IO, and it tickles me to see quotes like "The ordinary, recursive left fold is the fix point of the non-recursive one. On the other hand, the instantiation of the non-recursive left fold as a stream, as we shall see, effectively captures a continuation in a monadic action" in a discussion of a performant IO representation.
  • evan

de facto

I enjoyed reading this interview of Simon Peyton-Jones of Haskell fame. He comes across as so humble and good-natured, and uses "jolly" as an intensifier. This especially resonated:
One way that this [discussion about writing down a new shared standard for recent Haskell developments] has come about, is that the compiler I am responsible for (the GHC or Glasgow Haskell Compiler), has become the de facto standard. There are lots of people using that, so if you use GHC then your program will work.

I don’t think that’s a good thing in principle, however, for a language to be defined by an implementation. Haskell is based whatever GHC accepts right now, but it [Haskell] should have an independent definition. So I would like to see Haskell Prime [their upcoming language spec] happen because I think it’s healthy to see an independent definition of the language rather than for it to be defined by a de facto standard of a particular compiler.
I've seen it over and over -- no matter how well-intentioned the stewards of a standard are, when you only have one implementation you end up getting tied to unintentional details.
  • evan

gat, a git clone in haskell

I've been pretty busy with work lately, so I may as well dump this on the internet before it gets too dusty. Though I think I understand Git decently enough now, I've been curious about the internals. So I sat down to start reading and documenting, for example, the on-disk pack file format. (Git includes some technical docs but they're pretty thin.) But then I started writing some Haskell code for fun and to verify my understanding was correct and before I knew it I had 5% of a Git clone written in Haskell. That is, not a "porcelain" (a frontend on top of their utilities) but a separate implementation that works with the on-disk format.

So, gat. It can read the index, loose objects, and a bit of pack files; it has an implementation of the delta decoder so it can read delta-compressed objects out of the pack format; it understands how to convert git-svn^^ into reading refs/remotes/git-svn, recursively parsing out its parent pointers, and decoding that; it can do Git-style color diffs between various trees and commits. And finally and most importantly: it does some mmap()ing but is not at all fast, it is incorrect in the pieces it has implemented, and it does no writing of any data structures whatsoever -- that is to say, nothing useful.

But it's been fun to play with, at least. Maybe I'll clean up those docs and post them too at some point.

A few notes from reading Git code:
  • Git is a jumble of random nearly-commentless code, full of globals and strange state and not at all clear control flows. On the other hand, it's also much more Unixy than the code I'm used to reading, doing all sorts of tricks like using mmap() instead of read() (because the latter just involves an extra copy, y'know?) and forking. I am simultaneously impressed and terrified of what's likely going on in my kernel.
  • To align some disk-based data structure Git uses this code:
    (offsetof(struct cache_entry,name) + (len) + 8) & ~7
    That's an off-by-one error (gets eight bytes of padding where zero would do) that I'm a little puzzled by -- maybe it's intentional? Maybe it's remained for backwards compatibility?
  • This goto also makes no sense to me:
        ret = ent->data;
        if (ret && ent->p == p && ent->base_offset == base_offset)
          goto found_cache_entry;
        return unpack_entry(p, base_offset, type, base_size);
      found_cache_entry:
    (Those are the only two instances of found_cache_entry in the file...)
  • Git contains at least three incompatible bit-packed integer encodings, found in pack file offsets, object headers, and the delta format.
  • It's crazy to me that in 2008 people are still writing code that is manually passing around buffers and lengths and verifying by hand that the space works out.
  • less can take all sorts of useful flags, causing it to e.g. only page if necessary, allow color but not other control codes through, and more! When Git spawns less as a pager, it actually execs less in the parent side of the fork, making itself the child of the less process; dear Unix people, explain to me why you'd do that? (I have my thoughts, but I'm curious about yours.)
  • evan

the barbarians are at the gates

I enjoyed this introduction to the tenth issue of "The Monad.Reader":
The barbarians are at the gates. Hordes of Java programmers are being exposed to generics and delegates; hundreds of packages have been uploaded to Hackage; the Haskell IRC channel has nearly hit 500 users; and it’s only a matter of time before Microsoft seals that multi-billion dollar bid for Hayoo.

The time has come to retreat and climb higher into our ivory tower: we need to design a language that is so devious, so confusing, and so bizarre, it will take donkey’s years for mainstream languages to catch up. Agda, Coq, and Epigram are some approximation of what functional programming might become, but why stop there? I want strict data, lazy codata, quotient types, and a wackier underlying type theory.
  • evan

haskell syntax idea

I often end up with nested parens, when I need binding on the right:
foo (bar (baz (etc x)))
That's ok in Lisp where the parens become invisible, but parens are infrequent enough in Haskell that they're ugly. So they provide an "apply" operator that just has different binding rules, and the above can be written:
foo $ bar $ baz $ etc x
but the dollar signs are huge! If it were colon, it could be written like this
foo: bar: baz: etc x
Where the colon instinctively means: "I'm done supplying arguments to this function; everything to the right of here is its final argument."