Evan Martin (evan) wrote in evan_tech,
Evan Martin

automatic screen scraping

I've been thinking for a few days about screen-scraping weblogs. If you assume the text is generated by a computer, the problem boils down to this:

Suppose you have some text of the form AaBbCcAdBeCf, where each of those letters can correspond to an arbitrary string. You want to algorithmically notice that you have the pattern AxByCz repeated twice, and extract the data ((a,b,c), (d,e,f)). Humans have almost no problem doing this on web pages (but I think our understanding of HTML helps; doing it on arbitrary strings seems like it would be hard) but it's hard for me to describe the process algorithmically.

One of the reasons I'm mentioning this here is that I'm hoping that someone knows of relevant research / algorithms. It seems like this problem has either been already solved, or already determined to be unsolvable.

I'm thinking of two approaches: my first intuition is to search for repeated substrings, preferring longer ones, and collapsing them to identifiers. (We're trying to find A, B, C above.) From there, though, it gets fuzzier: do you collapse all remaining strings into unique identifiers, then try finding repeated similar substrings? Murky at best.

The other attack that just occurred to me about 30 seconds ago (prompting this post) is to hope an algorithm like Sequitur would magically do it for me. (Sorry, that link is the best I could find: basically, it's an algorithm that produces a context-free grammar for an arbitrary string.) Maybe this approach would work better? I will investigate!

Update: better sequitur link.
Tags: project

  • blog moved

    As described elsewhere, I've quit LiveJournal. If you're interested in my continuing posts, you should look at one of these (each contains feed…

  • dremel

    They published a paper on Dremel, my favorite previously-unpublished tool from the Google toolchest. Greg Linden discusses it: "[...] it is capable…

  • treemaps

    I finally wrote up my recent adventures in treemapping, complete with nifty clickable visualizations.

  • Post a new comment


    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.