September 10th, 2007

  • evan

crash mainly

Work code liberally uses assertions, both for the normal use of invariants and also more broadly for just expected values. For example, if a command expects you to pass a --foo argument, it's common to just assert that the foo flag is not empty. Or if a program expects a data file to be in a certain directory, it's common to just assert that opening the file succeeds. This means that the common (almost the only) "invalid input" behavior for everything is to crash.

There are few reasons this works out to be sane. First, here's an auto-backtracing system so every crash includes its call stack. The other is that while in most cases the code itself is enough to make clear what went wrong (e.g. "assertion failed: !FLAG_foo.empty()"), the assertion macros also support providing a diagnostic string (e.g. "assertion failed: file != NULL (while opening /foo/bar/"). Another is that the only consumers of this software are programmers, who often will find a call stack more useful than a diagnostic error message (to answer questions like "What code path made it try to even load that file?").

Now that I work in client software, the world is reversed. One of my first code reviews met the response: "Client software should never crash." Even in the craziest sorts of error conditions, it's better to try to limp along enough to at least close gracefully. And especially in the sorts of contexts where recovery may be possible, like a missing data file, you can at least inform the user that the software needs to be reinstalled.

So the assertion pattern now is:
if (!invariant) {
  NOTREACHED();  // debug-mode-only macro (int 3 or whatever)
  return NULL;  // indicate failure to caller

A more controversial situation is long-running server code. You'd think it's in the first category of software (as long as you don't output internal data to users), but it may be in the second.

Here's an example: suppose you have a server that can either show a help page or a search result, and that the contents of the help page is loaded from disk at startup and saved into some global. Now suppose some combination of control flow, perhaps even a user-exposed bug involving memory corruption, causes the help page pointer to go NULL at some point. This would normally be the sort of invariant your server ought to abort on. But if you're running n thousand such servers, a single popular query ("omg digg this error 500 help page result!") will take out your entire farm, killing the search functionality (which is maybe still limping along, despite the memory corruption -- say you're lucky, and the corruption is limited only to nulling out the help page pointer). It's good to learn about and fix the errors, but you don't want your service to be completely down while you try to figure out what went wrong.

That's the rationale, anyway, but it still feels a bit awkward to me. On the other hand, when you've got memory corruption bugs I don't think there's a non-awkward solution.

Further / not-quite-the-same-but-somehow-relevant reading: crash-only software.

(PS: the data-leak bug I linked to above is actually not an instance of the above policy.)