[UPDATE 2007-02-07] Changed scrub to return self [/UPDATE]
Using Hpricot to Scrub HTML - The remix
So I wanted to bring the HTML Scrubber into my Hpricot tweaks to tidy it up a bit and this is what I ended up with.
Now you can use the following to remove all tags from an HTML snippet
doc = Hpricot(open('http://slashdot.org/').read)
doc.scrub
Strip all hrefs, leaving the text inside in tact
(doc/:a).strip
Scrub the snippet based on a config hash
doc.scrub(hash)
hpricot_scrub.rb
require 'hpricot'
module Hpricot
class Elements
def strip
each { |x| x.strip }
end
def strip_attributes(safe=[])
each { |x| x.strip_attributes(safe) }
end
end
class Elem
def remove
parent.children.delete(self)
end
def strip
children.each { |x| x.strip unless x.class == Hpricot::Text }
if strip_removes?
remove
else
parent.replace_child self, Hpricot.make(inner_html) unless parent.nil?
end
end
def strip_attributes(safe=[])
attributes.each {|atr|
remove_attribute(atr[0]) unless safe.include?(atr[0])
} unless attributes.nil?
end
def strip_removes?
# I'm sure there are others that shuould be ripped instead of stripped
attributes && attributes['type'] =~ /script|css/
end
end
class Doc
def scrub(config={})
config = {
:nuke_tags => [],
:allow_tags => [],
:allow_attributes => []
}.merge(config)
config[:nuke_tags].each { |tag| (self/tag).remove }
config[:allow_tags].each { |tag|
(self/tag).strip_attributes(config[:allow_attributes])
}
children.reverse.each {|e|
e.strip unless e.class == Hpricot::Text ||
config[:allow_tags].include?(e.name)
}
self
end
end
end
Sample config in YAML
---
:allow_tags: # let these tags stay, but will strip attributes
- 'b'
- 'blockquote'
- 'br'
- 'div'
- 'h1'
- 'h2'
- 'h3'
- 'h4'
- 'h5'
- 'h6'
- 'hr'
- 'i'
- 'em'
- 'img'
- 'li'
- 'ol'
- 'p'
- 'pre'
- 'small'
- 'span'
- 'span'
- 'strike'
- 'strong'
- 'sub'
- 'sup'
- 'table'
- 'tbody'
- 'td'
- 'tfoot'
- 'thead'
- 'tr'
- 'u'
- 'ul'
:nuke_tags: # completely removes everything between open and close tag
- 'form'
- 'script'
:allow_attributes: # let these attributes stay, strip all others
- 'src'
- 'font'
- 'alt'
- 'style'
- 'align'
The source with sample data/test, run the test with
ruby test