# evan_tech

11:34 pm, 3 Jun 03

### Judy trick

Via via , from the source of Judy:

```// Test if Word has any of the 4 bytes == ‘\0’:
//
// This arcane code takes advantage of the CPU having a fast barrel shifter and
// a carry bit. Doug stole this code from elsewhere, and we know it works, but
// we really can’t explain why.

#define NOZEROBYTE(Word) \
((((((Word) - 0x01010101) | (Word)) ^ (Word)) & 0x80808080) ? 0 : 1)```

It took me a while to follow, but it’s not that bad after I understood it. Consider it a series of steps:

1. Subtracting 0x01010101 subtracts 1 from each byte, which forces a carry into any byte that is 0 already.
2. (A | B) ^ B, as you can prove to yourself, zeros all bits in A that are set in B.
3. A & 0x80808080 tests the high bit in each byte.
So basically, you force there to be a carry into any zeroed bytes, then clear the high bits in any bytes that had the high bit set in the first place, and then finally you test if there are any high bits left over. If there are, there was a zeroed byte.

•  You can take it a step further if you want, by locating the actual zero byte. From "Hacker's Delight" by Harry S. Warren Jr.:`/* Original byte: 00 80 other */y = (x & 0x7f7f7f7f) + 0x7f7f7f7f /* 7f 7f 1xxxxxxx */y = ~(y | x | 0x7f7f7f7f); /* 80 00 00000000 *//* These steps map: */if (y == 0) return 4; /* 00000000 ==> 4, */else if (y > 0x0000ffff) return (y >> 31) ^ 1; /* 80xxxxxx ==> 0 */ /* 0080xxxx ==> 1 */else return (y >> 15) ^ 3; /* 00000080 ==> 3 */`Alternatively, if you have an architecture that can locate the first bit set in a word (e.g. the BSF operation on x86), you can perform this operation without any branches.I highly recommend that book. You should get a copy if you don't already have one. reply to this
•  I'm awfully tired, so maybe I'm just being a cracker.. That comment really doesn't make sense to me though. I don't see any shifts in that code at all. The CPU has a fast full adder.It's worth pointing out that the pentium4 doesn't have a barrel shifter now. They have shifts for several common sizes and implement the rest as groups of those.Although, now that I've said that, my measurements show that every 32bit shift costs 8 cycles on my pentium4.Neat trick anyhow. reply to this