Firebug + hpricot
October 10th, 2006
Probably everyone else out there knows this already, but for all you screen scrapers out there, digging away at your scrAPI's I've just discovered a neat little trick that will save me tons of time.
Firebug
Firebug allows you to inspect an html document just by moving the mouse around on the screen. The really neato part about this is that it looks like it shows the Xpath to the element that the cursor is pointing at. (Incedentally, the implemenation of this must be amazing 'cause it's super responsive.) To access this, you can choose the 'Inspect' button on the top right of the firebug window, and make sure that the DOM tab is selected in the bottom right of your firebug window. As you move the mouse around you can see the Xpath appear in the status line right at the bottom of Firefox. As an example, the Xpath for this wordpress text area that I am posting in is:
[text]
/html/body/form/div/div/fieldset[2]/div[2]/textarea
[/text]
Hpricot
_why's excellent Hpricot library allows you to parse and manipulate HTML and other XML type languages with Xpath, as well as other ways. But it doesn't handle Xpath's with indices yet. So it won't work with the fieldset[2] or div[2] part of the Xpath above. Fortunately Peter Szinek comes up for a fix with that.
So, to get the contents of the textarea you can do:
-
require 'rubygems'
-
require 'hpricot'
-
require 'open-uri'
-
-
class AnodyneScraper
-
-
def initialize
-
@url = "http://www.anodyne.ca/wp-admin/post.php?action=edit&post=19"
-
@doc = Hpricot(open(@url))
-
parse
-
end
-
-
def parse
-
eval(to_hpricot('@doc/html/body/form/div/div/fieldset[2]/div[2]/textarea'))
-
end
-
-
private
-
def to_hpricot(xpath)
-
"#{'(' * (xpath.split(/\d+/).size-1)}" +
-
xpath.gsub(/\d+/) { |num| (num.to_i - 1).to_s }.
-
gsub(/\/(.*?)\[/) { |p| "/'#{$1}')[" }
-
end
-
end
Dead easy.
Of course, if you knew the insides of firebug, you could almost get to the stage of code generation there.
I'll feed my hpricot trees with firebug droppings in the next few days and let you know if it works as I think it does. I know that it works in simple cases.





Leave a Reply