Evan Martin (evan) wrote in evan_tech,
Evan Martin

wikipedia / databases

I'm looking to play with the Wikipedia data, but first I need to load their database dumps. I tried the naive way (tell FreeBSD to install mysql, then run their database dump) and it just sat there, grinding. Thankfully, I know a MySQL expert, and he told me enough to point me in the right direction. (The main problem was that the wikipedia database dump wanted InnoDB, which needs more configuring before it'll work.)

The bottleneck CPU when un-bzipping the data (.39gb compressed, 1.4gb uncompressed), then disk speed. systat -vmstat indicates only 10mb/sec on my RAID, which feels low, but I also really don't want to fight with it right now. It took under a half hour to run, and now it's building the index (/*!40000 ALTER TABLE cur ENABLE KEYS */).

It's weird struggling with a gigabyte of data now; it feels like such a small amount of data compared to the stuff we deal with at work.

  • blog moved

    As described elsewhere, I've quit LiveJournal. If you're interested in my continuing posts, you should look at one of these (each contains feed…

  • dremel

    They published a paper on Dremel, my favorite previously-unpublished tool from the Google toolchest. Greg Linden discusses it: "[...] it is capable…

  • treemaps

    I finally wrote up my recent adventures in treemapping, complete with nifty clickable visualizations.

  • Post a new comment


    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.