Evan Martin (evan) wrote in evan_tech,
Evan Martin

automatic screen scraping

I've been thinking for a few days about screen-scraping weblogs. If you assume the text is generated by a computer, the problem boils down to this:

Suppose you have some text of the form AaBbCcAdBeCf, where each of those letters can correspond to an arbitrary string. You want to algorithmically notice that you have the pattern AxByCz repeated twice, and extract the data ((a,b,c), (d,e,f)). Humans have almost no problem doing this on web pages (but I think our understanding of HTML helps; doing it on arbitrary strings seems like it would be hard) but it's hard for me to describe the process algorithmically.

One of the reasons I'm mentioning this here is that I'm hoping that someone knows of relevant research / algorithms. It seems like this problem has either been already solved, or already determined to be unsolvable.

I'm thinking of two approaches: my first intuition is to search for repeated substrings, preferring longer ones, and collapsing them to identifiers. (We're trying to find A, B, C above.) From there, though, it gets fuzzier: do you collapse all remaining strings into unique identifiers, then try finding repeated similar substrings? Murky at best.

The other attack that just occurred to me about 30 seconds ago (prompting this post) is to hope an algorithm like Sequitur would magically do it for me. (Sorry, that link is the best I could find: basically, it's an algorithm that produces a context-free grammar for an arbitrary string.) Maybe this approach would work better? I will investigate!

Update: better sequitur link.
Tags: project

  • megaupload captcha

    Someone make a Javascript-based captcha cracker for megaupload. It's strange to see those captchas again because I idly myself wrote a…

  • zombie ghosd

    I was tickled to discover another IBM developerworks article on one of my abandoned hacks and that both it and its predecessor have been translated…

  • gat, a git clone in haskell

    I've been pretty busy with work lately, so I may as well dump this on the internet before it gets too dusty. Though I think I understand Git decently…

  • Post a new comment


    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.