Previous Entry Share Next Entry
10:36 pm, 2 Jan 07

rethink the sync

Rethink the Sync -- one of two best papers at OSDI '06. This paper is neat. The basic problem they address is that synchronous I/O is slow but safe, while async I/O is orders of magnitude faster but can cause data loss if a disk crashes:
The tension between durability and performance leads to surprising behavior. For instance, on most desktop operating systems, even executing an explicit synchronization command such as fsync does not protect against data loss in the event of a power failure. This behavior is not a bug, but rather a conscious design decision to sacrifice durability for performance. For example, on fsync, the Linux 2.4 kernel commits data to the volatile hard drive cache rather than to the disk platter. If a power failure occurs, the data in the drive cache is lost. Because of this behavior, applications that require stronger durability guarantees, such as the MySQL database, recommend disabling the drive cache.
Their solution is to take a step back and re-examine the sync problem in terms of what you're actually trying to accomplish, and restate it as this: if an external observer thinks that a disk process has committed, then you must have it committed for the system to be safe (imagine a file server that buffers a write to disk then writes to the network a "file saved" packet, causing the user of the file server to delete their copy, but then crashing immediately before the disk write goes through). But (and this is the crux of their idea) within a world of applications without external observers, nobody's gotta know whether things have committed or not, as long as the proper commit ordering is preserved.

Consider a single application, which writes five times in a row and then does the network I/O: you can asynchronously buffer the writes and then block the network write until the disk writes complete. In fact, you can even buffer the network write, too, as long as all subsequent operations are also buffered and that they're all eventually executed in order. This is a bigger win if you can coalesce those writes into one disk hit, or if the file operations involve a temp file that is deleted before it even needs to be put out on disk. What's even cooler about this work as that they do this sort of dependency analysis (like "network output depends on previous five disk transactions to finish") across processes -- because really, if three processes all communicate between themselves and only later decide to write an "all done" on the screen, it's again the same (from a reliability standpoint) if you buffer it all as long as you don't announce you're done until you really are. One of their benchmarks that demonstrates this (which, to highlight a flaw in the paper, I think is pretty unrealistic from a real-world perspective) has multiple clients interacting with a local (as in on the same machine) MySQL server, where their file system is "up to three times faster than when [MySQL] runs on top of ext3 mounted asynchronously" (note it's outperforming even async here, but again that this benchmark is flawed from a real-world perspective).

Some observations not from the paper: one is that this doesn't really help "normal" client apps, because they typically inform the user with every I/O operation they do (at least by hiding the "file save" dialog). The other is that to me this has some (maybe-)deep connection to Haskell's notion of I/O, in that in Haskell what you express to the compiler is not a series of instructions to be executed, but instead a sort of graph of dependencies between operations -- in your code you write "all of this part can be executed whenever you need its output, but before you can use that variable you need to pause and wait for the operating system".

Also in a Haskell-ish sort of spirit, I believe their higher-level abstraction of I/O could break software that isn't "properly written", in the sense of having no race conditions: imagine a pair of processes, one which periodically wakes up and checks for a flag file to be on disk, and then the other that periodically writes the flag file and later deletes it. This pair of programs was incorrect to begin with -- even on a normal operating system perhaps the flag-checker program just always happens to be unluckily scheduled and isn't ever awake when the flag file is available -- but with the paper's file system it could potentially be the case that the flag file is never even committed to disk! The reason I again call this Haskell-ish is that Haskell plays a lot with these sorts of abstraction games: by following higher-level rules about your program -- for example, not getting any guarantees about when a given bit of code will run (another way of looking at this is ceding more control to the compiler) -- you allow it to make more interesting decisions about the execution of your program. I remember reading about a cross-language benchmark that had to be changed because the Haskell compiler noticed that the final value produced by some lengthy computation was never used, and so it never even computed it and executed the benchmark instantaneously.

This paper additionally alludes to another project they've worked on that actually performs speculative execution of processes: for example, when waiting for a slow RPC to complete, checkpoint the process and continue running with the assumption that the RPC succeeded, and then roll back if it turns out that the RPC failed. But the project in this paper compromises with a subset of that functionality, requiring no speculative execution (which has a hairy, invasive implementation, I'd bet) but still getting a win. Pretty neat.