02:40 pm, 11 Mar 08
deflate compression in googlebot
Random bit of Google trivia: Googlebot recently added support for "deflate" compression, despite no real demand for it. Why? Go ahead and guess; the answer follows.
This allowed changing the
[N.B.: I'm not an expert on any of this stuff, so feel free to correct me. Thankfully, I was also not involved in the implementation details, so I can excuse myself from blame!]
This was all prompted by an observation by one of the wikipedia folks, who emailed me about it. However, I asked him if this Google-side change helped and he never replied. :~(
This allowed changing the
Accept-Encoding
header to gzip,deflate
, which then makes it better match the Accept-Encoding
sent by web browsers. This then allows proxies like Squid to share cache entries between requests from browsers and requests from Googlebot. Here's an interesting thread on the way the Accept-Encoding
and Vary
headers interplay to ruin things for proxies. (Here's a page I skimmed that goes into a bunch more detail.) You could argue the blame lies with Squid, which apparently(?) treats the Accept-Encoding
header value as an opaque string rather than a list of encodings. On the other hand, doing something smarter depends on Squid magically knowing what circumstances caused a server to choose which Accept-Encoding combination.[N.B.: I'm not an expert on any of this stuff, so feel free to correct me. Thankfully, I was also not involved in the implementation details, so I can excuse myself from blame!]
This was all prompted by an observation by one of the wikipedia folks, who emailed me about it. However, I asked him if this Google-side change helped and he never replied. :~(
This is an interesting case. Intuitively (i.e. without reading the HTTP spec) I'd expect that a caching proxy should decode the message body and cache that, and then apply the encoding itself on subsequent requests based on the client's headers.
I guess this fundamentally depends on how HTTP treats Content-Encoding. If it is considered to create a new representation of the resource -- with its own ETag and everything, like varying values of Content-type -- then Squid wouldn't be allowed to do the above. However, if it's just considered a mechanical transformation that the browser will decode transparently, then it'd be acceptable.
I vauguely remember this being the distinction between Content-Encoding and Transfer-Encoding, though when I think it through I guess it could be dangerous to do this because proxies without such support could end up caching the encoded version and serving it up to clients without support.
we just rolled a workaround in squid, so we didn't feel the effect of the change already - previously we normalized all Accept-Encoding headers to two forms (compressed/uncompressed), so that various stray headers wouldn't break our caching.
Thanks for getting this fixed though! You guys are brilliant :)