September 5th, 2003

  • evan

automatic screen scraping

I've been thinking for a few days about screen-scraping weblogs. If you assume the text is generated by a computer, the problem boils down to this:

Suppose you have some text of the form AaBbCcAdBeCf, where each of those letters can correspond to an arbitrary string. You want to algorithmically notice that you have the pattern AxByCz repeated twice, and extract the data ((a,b,c), (d,e,f)). Humans have almost no problem doing this on web pages (but I think our understanding of HTML helps; doing it on arbitrary strings seems like it would be hard) but it's hard for me to describe the process algorithmically.

One of the reasons I'm mentioning this here is that I'm hoping that someone knows of relevant research / algorithms. It seems like this problem has either been already solved, or already determined to be unsolvable.

I'm thinking of two approaches: my first intuition is to search for repeated substrings, preferring longer ones, and collapsing them to identifiers. (We're trying to find A, B, C above.) From there, though, it gets fuzzier: do you collapse all remaining strings into unique identifiers, then try finding repeated similar substrings? Murky at best.

The other attack that just occurred to me about 30 seconds ago (prompting this post) is to hope an algorithm like Sequitur would magically do it for me. (Sorry, that link is the best I could find: basically, it's an algorithm that produces a context-free grammar for an arbitrary string.) Maybe this approach would work better? I will investigate!

Update: better sequitur link.