Using Hpricot to Scrub HTML
[UPDATE 2007-01-10] I’ve updated the scrubber, see Hpricot Scrub for more. [/UPDATE]
I went looking for a Ruby replacement for Html::Scrubber in perl for a gig and came up blank. Can it really be possible the nobody is doing anything more than blindly stripping tags?
I had seen Hpricot and thought I needed to find a reason to use it, well here it is. I monkey patched a couple methods into Hrpicot and off I went.
Here’s the Hpricot bits.
module Hpricot
class Elements
def strip
each { |x| x.strip }
end
def strip_attributes(safe=[], patterns={})
each { |x| x.strip_attributes(safe, patterns) }
end
end
class Elem
def strip
parent.replace_child self, Hpricot.make(inner_html) unless
parent.nil?
end
def strip_attributes(safe=[], patterns={})
attributes.each { |atr|
pat = patterns[atr[0].to_sym] || ''
remove_attribute(atr[0]) unless safe.include?(atr[0]) &&
atr[1].match(pat)
} unless attributes.nil?
end
end
end
Just that bit get’s me to the point where I can do things like this
doc = Hpricot(open('http://slashdot.org/').read)
# remove all anchors leaving behind the text inside.
(doc/:a).strip
# strip all attributes except for src from all images
(doc/:img).strip_attributes(['src'])
Then I made scrubber that passes in the array and hash to those methods to handle the dirty work. It looks like this, though I’m also using Tidy so mine is alittle different.
class HtmlScrubber
@@config = YAML.load_file(
"#{RAILS_ROOT}/config/html_scrubber.yml") unless
defined?(@@config)
def self.scrub(markup)
doc = Hpricot(markup || '', :xhtml_strict => true)
raise 'No markup specified' if doc.nil?
@@config[:nuke_tags].each { |tag| (doc/tag).remove }
@@config[:allow_tags].each { |tag|
(doc/tag).strip_attributes(@@config[:allow_attributes],
@@config[:attribute_patterns]) }
doc.traverse_all_element {|e|
e.strip unless @@config[:allow_tags].include?(e.name)
}
doc.inner_html
end
end
Here is a zip of the code and a sample config: html_scrubber.zip
continue reading
A very happy Sean on rvm friendly TextMate bundles on Aug 10, 2010 at 02:43 AM
Thanks so much for these. I’d almost give up on getting Textmate working with RVM.
Ed Ruder on rvm + gemsets + TextMate == yay! * 2 on Jul 30, 2010 at 04:06 PM
What got everything working for me was to: a) follow the directions on http://rvm.beginrescueend.com/integration/textmate/, b) rename a second Builder.rb file in ~/Library/Application\ Support/TextMate/Pristine\ Copy/Support/lib, and c) set up TM_RUBY per slides 36 & 37 of http://www.slideshare.net/freelancing_god/zsh-and-rvm (create a shell script that resolves to rvm’s ruby on the fly and point TextMate to it)!
Now, everything is working like a charm! (No mucking with TextMate’s Bundle Editor, either, which is nice.)
Trevor on rvm friendly TextMate bundles on Jun 22, 2010 at 10:20 AM
Really excited to see that you’ve put this together, but I’m having trouble getting it to see anything other than my system Ruby. I’ve set up the .rvmrc in my project w/ the proper ruby and can confirm that it’s seeing the right one. I assume that this is intended to use project-level .rvmrc?
Gerhard on rvm + gemsets + TextMate == yay! * 2 on Jun 02, 2010 at 04:27 AM
This kinda’ works, but I’m getting a weird error related to paths even though the path that it’s complaining about is correct and the file exists, is readable etc.
Also, formatting is all screwy when it uses rvm & gemsets.
jake on rvm + gemsets + TextMate == yay! * 2 on May 18, 2010 at 10:10 AM
You will have to edit each command you want to have these changes in the Bundle Editor.