evan_tech

Previous Entry Share Next Entry
I spent a few hours today reverse engineering the ".puz" crossword file format. It's used by a popular java applet used by, for example, the New York Times. Reverse engineering can be fun: it's a mixture of patience and creativity. You mix some knowledge on the bit level (byte order, file offsets) and on the higher level, looking for patterns within examples and then patterns across examples. It's also about putting yourself in the creator's position: did they encode strings as length+data or null-terminated? If they have shorts/longs, are they in network order or Windows order, how can you tell (usually more information in the lower bits), and why would they have chosen that? What information does the file need to contain, and how would you have gone about representing it?

I spent quite a bit of time on some mysterious patterns within the bytes in the headers of some files I had until I looked at a new one and saw that region containing random text. I now suspect it's either random memory garbage (they forgot to initialize that part of the struct to zero) or perhaps some sort of cut'n'paste buffer -- either way, a big waste of time. Oh well.

For future reference, then, here's what I have:
The file consists of: a 52-byte header, the key, dashes, and then a table of strings. (These are all defined below.)

The header consists of:
OffsetContents
212 bytes: "ACROSS&DOWN\0"
441 byte: puzzle width
451 byte: puzzle height
461 byte: # of clues

I don't know the other bytes in the header, but the random junk I mentioned is around byte 30.

The key is width*height bytes long, and is the answer to the puzzle, one byte per cell. A period is used for a black cell.

The dashes are width*height bytes long, and appear to mirror the key with dashes in place of the key letters. I'm not sure what this is for -- perhaps it allows some puzzles to come with some answers filled in already?

The string table is the rest of the file. It is just a sequence of strings, each terminated by nul. Strings appear to be ISO-8859-1 -- at least they use their © symbol. I've also seen lines terminated with Windows-style \r\n.

The first string is the title, the second the authors, and the third is the copyright. The next n (using the field from the header) are the clues for the puzzle. Finally, there is sometimes one remaining string, which is a comment. (I have one file where this contains more binary data -- haven't figured that out yet.)

The clues can be mapped to cells like this: first, numbers are assigned in the standard crossword way. Then, the first clue is 1 down (if it exists), the next is 1 across (if it exists), the next 2 down, etc.

I have an implementation of a crossword puz decoder here.