evan_tech

Previous Entry Share Next Entry
12:07 pm, 22 Sep 03

more scraping

(previous entry on this)

Sequitur won’t work. The grammar it produces always has one and only one expansion for any non-terminal. It took me a while to realize this, but simply typing my “AaBbCcAxByCz” into their web form shows it.

I keep thinking about this, and I keep thinking I’m getting nowhere. I glanced through a few papers on “automatic document structure” but everything I found was more concerned with document structure from OCR’d papers, where they used visual layout information. I think what I’m looking at is actually an easier problem...

There seem to be two approaches:
  1. Use HTML document structure to help.
  2. Treat even HTML as just a body of text.

(1) clearly sounds easier, but I’m not sure if it is. The approach would be something like taking the DOM tree of the document and looking for similar branches, where “similar” is a pretty unclear term but could be fuzzed. Clearly, node names matching is a very good match (match div to div), attributes less so (class=subject to class=subject), and then text nodes (“subject:” to “subject:”) matching the least important.
But a simple glance at evan reveals it isn’t valid XML, so I can’t rely on DOM. Maybe an HTML-aware parser? (eg, if the <p> tag doesn’t have a matching close, assume it was a <p/>).

(2) is clearly much cooler, and goes back to that longest-repeated substring problem, which I am still trying to pin down as to what I mean by that. If you have nine matches on a string “abcd”, ten matches on simply “bc” is potentially more relevant (but obviously nine matches on any substring of abcd is probably not useful-- there’s still the possible case where every entry begins with, say, the text “Today I” and because it’s on every entry it gets interpreted into the document’s structure). But even once I have some sort of substring set, I’m back to the tree-matching (except with a flattened tree) problem in (1). A extra hint such as “there are 10 entries on this page you’re parsing” would be an incredible hint—any 10-times-reoccurring substring would be much more relevant—but ideally this could work without human interaction.

It seems that any compression algorithm (sequitur included) could be used to reduce repeated substrings to identifiers... bleh.

Going nowhere.