Evan Martin (evan) wrote in evan_tech,
Evan Martin
evan
evan_tech

screen scraping with ruby

For various reasons I seem to often need to screen-scrape websites. I've tried Rubyful Soup but it seems to take tens of seconds to parse relatively small files. I just now tried htmltools as well and it's also incredibly slow -- 9.7 seconds to parse a 400kb file on a P4 3.8ghz machine!

The file is really straightforward, too: it's just one huge table. How hard is this to parse quickly? Firefox seems to have no trouble with it...

So my question for you: what should I be using instead? (I'm happy to switch languages -- gaal/brad, which Perl HTML parsing library is the most pleasant? Preferably something that doesn't use SAX or DOM APIs...)
Tags: grumpy, ruby
Subscribe

  • your vcs sucks

    I've been hacking on some Haskell stuff lately that's all managed in darcs and it's reminded me of an observation I made over two years ago now (see…

  • inspiration

    _why: "when you don't create things, you become defined by your tastes rather than ability. your tastes only narrow & exclude people. so create."

  • perl people, explain your language to me

    Every time I use perl I feel mildly positive about it right up until I encounter CPAN. I've never managed to make CPAN work, despite the multitude of…

  • Post a new comment

    Error

    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 19 comments

  • your vcs sucks

    I've been hacking on some Haskell stuff lately that's all managed in darcs and it's reminded me of an observation I made over two years ago now (see…

  • inspiration

    _why: "when you don't create things, you become defined by your tastes rather than ability. your tastes only narrow & exclude people. so create."

  • perl people, explain your language to me

    Every time I use perl I feel mildly positive about it right up until I encounter CPAN. I've never managed to make CPAN work, despite the multitude of…