12:28 am, 10 Oct 04
update on wikipedia
I've been putting about ten minutes a night into my wikipedia project, so there's not much to say. But my goal is to extract all the links, and I think I have a program doing that now. The problem is that the export is just the raw WikiText, so I have to pull each row (entry) from the database, extract all the links, and then map those textual names back to entry ids.
It's complicated by the fact that the link->entry title mapping is a little ad-hoc; there's whitespace stripping and mapping spaces to underscores, as well as capitalization of the first letter of the entry, but I'm not sure where I was supposed to discover that. I'm also a little concerned about the high-bit-set characters in the entry titles; I think they mayb be ISO-8859-1? I dug through the MediaWiki source for a bit but it didn't seem too principled.
But thanks to Brad, I'm using the HANDLER OPEN, etc. syntax for MySQL and my program is now iterating through all 600,000 rows of this database. It's taking a while, and the limiting factor seems to be CPU in my export program, but profiles indicate it's spending much of its time in the GC, and a bunch more converting strings to integers and back from SQL. I don't know too much about optimizing O'Caml, nor do I have an intuition as to whether just writing it in straightfoward Perl would do the job better.
By the way, there are only about 600,000 rows in this database. The Wikipedia stuff I've read trumpeted reaching a million entries but as far as I can tell that only means their primary key hit that point. After stripping out all of the namespaces (such as Talk), I'm left with 646,068 rows, and many of them are redirects: the text of the entry for "AlaskA" is #REDIRECT [[Alaska]].
In any case, I'm currently around row 70,000 and I've extracted 1.4 million links. (I haven't yet decided what to do with links to pages that don't exist; right now I'm dropping them.)
It's complicated by the fact that the link->entry title mapping is a little ad-hoc; there's whitespace stripping and mapping spaces to underscores, as well as capitalization of the first letter of the entry, but I'm not sure where I was supposed to discover that. I'm also a little concerned about the high-bit-set characters in the entry titles; I think they mayb be ISO-8859-1? I dug through the MediaWiki source for a bit but it didn't seem too principled.
But thanks to Brad, I'm using the HANDLER OPEN, etc. syntax for MySQL and my program is now iterating through all 600,000 rows of this database. It's taking a while, and the limiting factor seems to be CPU in my export program, but profiles indicate it's spending much of its time in the GC, and a bunch more converting strings to integers and back from SQL. I don't know too much about optimizing O'Caml, nor do I have an intuition as to whether just writing it in straightfoward Perl would do the job better.
By the way, there are only about 600,000 rows in this database. The Wikipedia stuff I've read trumpeted reaching a million entries but as far as I can tell that only means their primary key hit that point. After stripping out all of the namespaces (such as Talk), I'm left with 646,068 rows, and many of them are redirects: the text of the entry for "AlaskA" is #REDIRECT [[Alaska]].
In any case, I'm currently around row 70,000 and I've extracted 1.4 million links. (I haven't yet decided what to do with links to pages that don't exist; right now I'm dropping them.)
I spent a week trying to tune it, and eventually just gave up. I wish I hadn't had to, because generally the pragmatics of developing in ocaml are much nicer than C++, but it seemed I had no choice.