evan_tech

Previous Entry Share Next Entry
10:05 pm, 13 May 08

type-safe printf

printf is a function with a complicated type. In C we used to just give up and tell the compiler "this function takes some other stuff that you shouldn't worry about" with the amusing "..." builtin. These days compilers have special support for annotating printf-like functions to provide type-checking. The other side of this is that an implementation of printf necessarily has a little tokenizer/parser for run-time processing of the format string, along with the associated performance penalty*.

Yet pretty much all programs that involve format strings ought to have the format strings known statically. Even a mini-language like printf turns out to have enough power to not be able to safely process untrusted input, as the "poke" instruction (named %n) demonstrated by creating a completely new class of security vulnerabilities. And without the compiler to help with type coercions, it's easy to write something invalid, especially when you're playing fast and loose with integer and pointer sizes across platforms.

Perl and Ruby neatly sidestep this problem by using string interpolation: at parse/compile time, the compiler scans the strings for bits to be expanded and just rewrites the "format" string to the equivalent concatenation of literal strings and variable values, which then uses the normal language's support for pasting strings together. (Prove it to yourself: perl -e 'use strict; print "$x";' aborts with a "compilation error".) But sometimes you really just want something like printf, and both those languages fall back on "figure it out at runtime" for that.


Supporting printf at all proves to be pretty difficult in more strict languages which generally require all types to be known. OCaml's compiler does some crazy hacks where sometimes a quoted string is interpreted as a format, a six-parameter type that, for example, needs its own concatenation operator. Haskell at least encodes it in the user-available language with some typeclass magic that gets you to more or less to feature parity with dynamic languages -- failure at runtime if the parameters don't properly line up.

But it turns out there's a nice paper that provides a type-safe encoding of printf that doesn't rely on any fancy language features. The paper is structured like this: (1) "wouldn't it be nice if printf worked like this?" (2) "oh wow check it out, here are the functions!" I've been staring at it for a week and though I can sorta see how it works, it's unclear to me how anyone would come up with it. Here's an overview from a person who lacks sufficient brain to say much smarter (me).


To start with, you don't use a string for the format string. This sorta seems like you've already given up, but you could imagine a macro expanding a format string into the proper expression here, much like how Perl/Ruby's interpolation works. Since this is functional code we're talking about, it ends up being an expression involving some functions. Format string concatenation ends up being encodable as composition, which means you end up with the same operator as Perl (.) for pasting them together.

The basic task then is that you need print (some magic here) to be able to give you a function of varying types, depending on what the magic is, so that print format1 3 "foo" can be type-checked that format expects an int and a string. So the type of printf must be (some magic type involving an a) -> a, where the polymorphic a is a function produced by the magic. And here's where the magic drops, painful in its simplicity:

lit :: String -> (String -> a) -> String -> a
lit text k s = k (s ++ text)

int :: (String -> a) -> String -> Int -> a
int k s val = k (s ++ show val)

printf :: ((String -> String) -> String -> a) -> a
printf fmt = fmt id ""


And that's it. lit is what converts a literal string into a format, while int is the placeholder for an int. So "my int is %d\n" would be expressed as lit "my int is " . int . lit "\n". If you drop this in to GHC you'll see that the type of printf (lit "my int is " . int . lit "\n") really is, as you'd expect, Int -> String -- it's waiting for you to give it an int so it can dump out the formatted string. The result of printf is just a plain function, so you do all the normal sorts of things you'd want, like partially apply it or pass it to map. The formatters are plain functions, too, so you can add your own formatter that, for example, can accept a list (as he does in the paper).


So how's it work?

Look at the two formatters, int and lit. The k parameter to the formatters is a continuation: it's where each formatter should pass the string constructed so far when the formatter is done. Then the s input parameter is the string as constructed so far. You can mentally expand printf (lit "foo" . int) as (lit "foo" (int (id))) "", where that empty string is your starting string and id is the innermost continuation which just gives you back the string that's been constructed.

You can also look at int like this, just with some added parens for clarity: (String -> a) -> (String -> (Int -> a)). It takes the continuation as a parameter, and the function it returns is sorta the same shape as the continuation but with Int -> a in place of a -- that's how it tacks on its need for an int to the greater formatting requirement.

But from there... uh, the types just work out. I don't know. It's pretty much magic.


[Edit: a commenter on reddit linked to this discussion of the general technique, which I haven't read yet but looks promising.]


* Though I'd argue you're in a pretty bad place if printf performance is your bottleneck.