Using Hpricot to Scrub HTML
[UPDATE 2007-01-10] I’ve updated the scrubber, see Hpricot Scrub for more. [/UPDATE]
I went looking for a Ruby replacement for Html::Scrubber in perl for a gig and came up blank. Can it really be possible the nobody is doing anything more than blindly stripping tags?
I had seen Hpricot and thought I needed to find a reason to use it, well here it is. I monkey patched a couple methods into Hrpicot and off I went.
Here’s the Hpricot bits.
module Hpricot
class Elements
def strip
each { |x| x.strip }
end
def strip_attributes(safe=[], patterns={})
each { |x| x.strip_attributes(safe, patterns) }
end
end
class Elem
def strip
parent.replace_child self, Hpricot.make(inner_html) unless
parent.nil?
end
def strip_attributes(safe=[], patterns={})
attributes.each { |atr|
pat = patterns[atr[0].to_sym] || ''
remove_attribute(atr[0]) unless safe.include?(atr[0]) &&
atr[1].match(pat)
} unless attributes.nil?
end
end
end
Just that bit get’s me to the point where I can do things like this
doc = Hpricot(open('http://slashdot.org/').read)
# remove all anchors leaving behind the text inside.
(doc/:a).strip
# strip all attributes except for src from all images
(doc/:img).strip_attributes(['src'])
Then I made scrubber that passes in the array and hash to those methods to handle the dirty work. It looks like this, though I’m also using Tidy so mine is alittle different.
class HtmlScrubber
@@config = YAML.load_file(
"#{RAILS_ROOT}/config/html_scrubber.yml") unless
defined?(@@config)
def self.scrub(markup)
doc = Hpricot(markup || '', :xhtml_strict => true)
raise 'No markup specified' if doc.nil?
@@config[:nuke_tags].each { |tag| (doc/tag).remove }
@@config[:allow_tags].each { |tag|
(doc/tag).strip_attributes(@@config[:allow_attributes],
@@config[:attribute_patterns]) }
doc.traverse_all_element {|e|
e.strip unless @@config[:allow_tags].include?(e.name)
}
doc.inner_html
end
end
Here is a zip of the code and a sample config: html_scrubber.zip
continue reading
Wayne E. Seguin on rvm + gemsets + TextMate == yay! on Feb 19, 2010 at 11:36 AM
Great writeup!
I’d like to point out one additional item for users of rvm, you can type ‘rvm info’ in your shell to gain the information about your current environment all in one command.
Thanks!
Marnen Laibow-Koser on I prefer GemTools over config.gem on Jun 27, 2009 at 07:01 AM
I’ve never used GemTools, but I used to use geminstaller for similar reasons. However, as config.gem has matured, it has gotten to the point where I am switching applications over from geminstaller; rake gems is a big reason for this.
And for the use case of Capistrano installing everything, you don’t need GemTools. Just put in a suitable Cap task to install the Rails gems to bootstrap the system; then you can rely on config.gem.
Reena on FCKeditor on Rails on Apr 13, 2009 at 04:17 AM
Hi,
Thank You
micahel on Hpricot Scrub on Apr 07, 2009 at 10:11 AM
@chick – was the gem for version 0.3.5 broken for you?
gem install hpricot_scrub
sounds like you may have gotten the old hpricot-scrub gem, which should have been removed from rubyforge, but appears it was still there.
chick on Hpricot Scrub on Mar 26, 2009 at 02:44 PM
forgot to say thanks