Evan Martin (evan) wrote in evan_tech,
Evan Martin

update on wikipedia

I've been putting about ten minutes a night into my wikipedia project, so there's not much to say. But my goal is to extract all the links, and I think I have a program doing that now. The problem is that the export is just the raw WikiText, so I have to pull each row (entry) from the database, extract all the links, and then map those textual names back to entry ids.

It's complicated by the fact that the link->entry title mapping is a little ad-hoc; there's whitespace stripping and mapping spaces to underscores, as well as capitalization of the first letter of the entry, but I'm not sure where I was supposed to discover that. I'm also a little concerned about the high-bit-set characters in the entry titles; I think they mayb be ISO-8859-1? I dug through the MediaWiki source for a bit but it didn't seem too principled.

But thanks to Brad, I'm using the HANDLER OPEN, etc. syntax for MySQL and my program is now iterating through all 600,000 rows of this database. It's taking a while, and the limiting factor seems to be CPU in my export program, but profiles indicate it's spending much of its time in the GC, and a bunch more converting strings to integers and back from SQL. I don't know too much about optimizing O'Caml, nor do I have an intuition as to whether just writing it in straightfoward Perl would do the job better.

By the way, there are only about 600,000 rows in this database. The Wikipedia stuff I've read trumpeted reaching a million entries but as far as I can tell that only means their primary key hit that point. After stripping out all of the namespaces (such as Talk), I'm left with 646,068 rows, and many of them are redirects: the text of the entry for "AlaskA" is #REDIRECT [[Alaska]].

In any case, I'm currently around row 70,000 and I've extracted 1.4 million links. (I haven't yet decided what to do with links to pages that don't exist; right now I'm dropping them.)

  • dremel

    They published a paper on Dremel, my favorite previously-unpublished tool from the Google toolchest. Greg Linden discusses it: "[...] it is capable…

  • google ime

    Japanophiles might be interested to learn that Google released a Japanese IME. IME is the sort of NLP problem that Google is nearly uniquely…

  • ghc llvm

    I read this thesis on an LLVM backend for GHC, primarily because I was curious to learn more about GHC internals. The thesis serves well as an…

  • Post a new comment


    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.