Firebug + hpricot

October 10th, 2006

Probably everyone else out there knows this already, but for all you screen scrapers out there, digging away at your scrAPI's I've just discovered a neat little trick that will save me tons of time.

Firebug

Firebug allows you to inspect an html document just by moving the mouse around on the screen. The really neato part about this is that it looks like it shows the Xpath to the element that the cursor is pointing at. (Incedentally, the implemenation of this must be amazing 'cause it's super responsive.) To access this, you can choose the 'Inspect' button on the top right of the firebug window, and make sure that the DOM tab is selected in the bottom right of your firebug window. As you move the mouse around you can see the Xpath appear in the status line right at the bottom of Firefox. As an example, the Xpath for this wordpress text area that I am posting in is:

[text]
/html/body/form/div/div/fieldset[2]/div[2]/textarea
[/text]

Hpricot

_why's excellent Hpricot library allows you to parse and manipulate HTML and other XML type languages with Xpath, as well as other ways. But it doesn't handle Xpath's with indices yet. So it won't work with the fieldset[2] or div[2] part of the Xpath above. Fortunately Peter Szinek comes up for a fix with that.

So, to get the contents of the textarea you can do:

RUBY:
  1. require 'rubygems'
  2. require 'hpricot'
  3. require 'open-uri'
  4.  
  5. class AnodyneScraper
  6.  
  7.   def initialize
  8.     @url = "http://www.anodyne.ca/wp-admin/post.php?action=edit&post=19"
  9.     @doc = Hpricot(open(@url))
  10.     parse
  11.   end
  12.  
  13.   def parse
  14.     eval(to_hpricot('@doc/html/body/form/div/div/fieldset[2]/div[2]/textarea'))
  15.   end
  16.  
  17.   private
  18.   def to_hpricot(xpath)
  19.     "#{'(' * (xpath.split(/\d+/).size-1)}" +                 
  20.     xpath.gsub(/\d+/) { |num| (num.to_i - 1).to_s }.
  21.         gsub(/\/(.*?)\[/) { |p| "/'#{$1}')[" }           
  22.   end
  23. end

Dead easy.

Of course, if you knew the insides of firebug, you could almost get to the stage of code generation there.

I'll feed my hpricot trees with firebug droppings in the next few days and let you know if it works as I think it does. I know that it works in simple cases.

programming, ruby on rails | Comments | Trackback Jump to the top of this page

Leave a Reply

  •  
  •  
  •  

You can keep track of new comments to this post with the comments feed.

Recently on Flickr

    mull - 21.jpgmull - 20.jpgmull - 19.jpgmull - 18.jpg

Recently Listened

Meta

The Carousell