August 24th, 2003

  • evan

parse as you type, 2

One of the reasons I've been thinking about "parse while you type" is from my experience on a simpler, but quite similar problem: "spell check while you type".

The reduction here is that all parsing is local. The algorithm is simple: whenever you add or remove text, scan forwards and backwards to the ends of the affected words, feed them to the spell checker, and then mark up the text as necessary.

But good spell checker would (in theory: I don't know if any open source ones do) actually need context to properly spell check. For a language like Japanese or Thai, you don't have spaces to delimit words. (There are even difficulties with simpler languages.)
Worse, discovering homophone errors requires(?) syntatic parsing, which is sorta like compiling but about a million times harder. (There's a subset of linguistics related to "discourse analysis"—that is, analysis across sentence boundaries—and the potential complexity of that terrifies me.)

I know Microsoft's Word does this to some extent because it underlines questionable structures with green. I can't, however, imagine how far they go or how they determined how far they could go.

(Now that I look at that Pango bug again, I really ought to fix it myself. That'd be a fun and worthwhile project, and Noah Levitt even provided a test case...
NO! Bad Evan! Finish your existing projects first.)
  • evan

remote loading

(Skip down to the stuff you didn’t already know.)
On the web, image data is always separate from page content. (There is a way to include images inline using something like data://, but I think it’s only supported by Mozilla.) Images are retrieved in a separate HTTP request. You probably already know this, and don’t really even think about it.

But these sorts of technical decisions often have social implications: someone can take the URL to an image on my site and paste it into a webpage on their site. Now I’m paying for “their” (really, their viewers’) bandwidth. This is called “remote loading” in some circles, though to me the term feels a bit pseudo-technical.

Remote loading is a pretty big problem for hosting sites. People want to be able to put their images up on their site, but the hosting company doesn’t want slashdot to put the image on their front page and spiking their bandwidth.

The solution? You check the Referer header (or is that Referrer? someone misspelled it in some spec and now I’m always confused about what to call it) and make sure the image loads are coming from your site. LiveJournal even has a FAQ on how to make Apache do it.

Why do I bring this up?
I just read that AOL is now specifically blocking LiveJournal, and not elsewhere, as a referer. It’s completely within their rights to do so but it’s a bit of an underhanded move to make. Now their users put images up on other sites, see them work fine, then try them on LiveJournal, and come to LJ’s tech support asking “why won’t my images load?”

Brad (really a fountain of good ideas / neat hacks) had an idea a while back to get around this: make a Java applet that behaves just like an <img> tag. You could even add features to browsers, like showing a progress bar while the image loaded. But the important twist is you’d let the user specify the Referer header to send when retrieving the image. Now there is quite literally no way for AOL to block us (unless they want to add session cookies or try to track multiple hits, both of which they’re unlikely to do).

This is unlikely to ever happen, as it’s even more devious than I think we’re willing to be.