Scraping the Craiglist blog: FAIL

03 Apr 2008

I heard from Jason Calacanis this morning that Craigslist has a new blog. The downside? No feed. A little searching turned up that Josh Catone (a great guy I met through RailsForum) pieced together a feed that just has the title and date of the blog entries for those of us who’re feed-reader dependent.

But I don’t want to have to visit the site, I just want it to appear in Google Reader like TechCrunch and RobotWalrus do. So I tried to piece together a ruby script that would scrape the blog and turn it into a feed. Conclusion? Total failure.

Here’s the code that should work:

    require 'rubygems'
    require 'hpricot'
    require 'activesupport'
    require 'rss/maker'
    require 'net/http'

    blog = Hpricot.parse(Net::HTTP.get(URI.parse('http://blog.craigslist.org')))
    main_table_cell = (blog / 'td').find {|td| td.attributes['width'] == '625' }

    feed = RSS::Maker.make('1.0') do |rss|
      rss.channel.about         = "Craigslist Blog"
      rss.channel.title         = "Craigslist Blog"
      rss.channel.description   = "Craigslist Blog"
      rss.channel.link          = "http://blog.craigslist.org/"
      (main_table_cell / 'a').select {|a| '' == a.inner_text }.each do |anchor|
        intro         = anchor.next_sibling
        header        = (intro / 'h2').first
        date          = Date.parse(header.next.inner_text.scan(/Posted (.*) by/).flatten.first)
        author_link   = header.next_sibling
        comments_link = author_link.next_sibling
        permalink     = comments_link.next_sibling

        contents = []
        paragraph = intro
        while paragraph = paragraph.next_sibling do
          contents << paragraph.inner_html
        end
        contents << "<a href='#{comments_link.attributes['href']}'>Comments</a>"

        item          = rss.items.new_item
        item.author   = author_link.inner_text
        item.title    = header.inner_text
        item.link     = permalink.attributes['href']
        item.date     = date
        item.description = '<p>' + contents.join('</p><p>') + '</p>'
      end
    end

    puts feed

But the Craigslist blog has the least valid html I’ve seen since the 90’s. This script barfs out pretty quickly because of wildly inconsistent placement of <p>, <a>, and even <hr> tags. It would have to be a much bigger script to try to outsmart the CHTML (Crappy Hypertext Markup Language) on the craigslist blog.

So here’s my petition to Craigslist: let me read! Please! Even something so simple as wrapping each post in a <div class='post'> would fix everything.

  • feed43 said: just use feed43.com

Please if you found this post helpful or have questions.