evan_tech

Previous Entry Share Next Entry
12:18 pm, 23 Feb 08

array-based index files

1. LogJam (a LiveJournal client with some offline support) at one point stored entry data in date-organized trees of XML files. This is pretty much the least convenient data format ever, yes, but at the time I was concerned with keeping the number of dependencies down. Keeping dependencies down was a problem that spilled into the next situation: various operations worked in the space of entry numbers, so we needed an index somewhere. After some various false starts I realized we could store a flat file of time_t, 4 bytes per entry, where the nth entry in the array was the time for the nth entry. Simple¹ and tiny, with one disk seek to access. Even if I wrote ten entries a day for ten years, it'd only be about 150kb; a more realistic long-lived LJ like evan is still under 4k entries.

2. git-svn, the utility that bridges svn repositories into git branches, has a similar index for mapping svn revision numbers to git tree hashes. Theirs is one hash per line, with newlines for clarity. It seems they had a very similar struggle to mine (from the git-svn source):
# Tie::File seems to be prone to offset errors if revisions get sparse,
# it's not that fast, either. Tie::File is also not in Perl 5.6. So
# one of my favorite modules is out :< Next up would be one of the DBM
# modules, but I'm not sure which is most portable... So I'll just
# go with something that's plain-text, but still capable of
# being randomly accessed. So here's my ultra-simple fixed-width
# database. All records are 40 characters + "\n", so it's easy to seek
# to a revision: (41 * rev) is the byte offset.
# A record of 40 0s denotes an empty revision.
# And yes, it's still pretty fast (faster than Tie::File).

Both of these rely on the source space being dense. If you had an LJ itemid of 10e6 you'd get a 40mb index file. This density was a property of LJ that I relied upon, even though it wasn't necessarily required of LJ. git-svn has the same shortcoming, especially when used with a tree managed with gvn, which tends to make the revision counter increase rapidly. A shared tree at work has 110k revisions and a 4mb revision database, despite my shallow git-svn clone of it only taking 4mb itself.

(PS: of course, both of these could easily be solved while still not pulling in a database by using a slightly more complicated file format, like a sorted list of entries. But it's not worth the effort; even 4mb is tiny, especially for data that's just a cache anyway.)

1 And with an endianness bug waiting to happen, which of course did only after it had been released and was working fine and someone managed to run LogJam with an NFS-mounted home directory on two systems with different endianness. Ugh.