Hpricot Scrub
[UPDATE 2007-02-07] Changed scrub to return self [/UPDATE]
Using Hpricot to Scrub HTML – The remix
So I wanted to bring the HTML Scrubber into my Hpricot tweaks to tidy it up a bit and this is what I ended up with.
Now you can use the following to remove all tags from an HTML snippetdoc = Hpricot(open('http://slashdot.org/').read)
doc.scrub
Strip all hrefs, leaving the text inside in tact
(doc/:a).strip
Scrub the snippet based on a config hash
doc.scrub(hash)
hpricot_scrub.rb
require 'hpricot'
module Hpricot
class Elements
def strip
each { |x| x.strip }
end
def strip_attributes(safe=[])
each { |x| x.strip_attributes(safe) }
end
end
class Elem
def remove
parent.children.delete(self)
end
def strip
children.each { |x| x.strip unless x.class == Hpricot::Text }
if strip_removes?
remove
else
parent.replace_child self, Hpricot.make(inner_html) unless parent.nil?
end
end
def strip_attributes(safe=[])
attributes.each {|atr|
remove_attribute(atr[0]) unless safe.include?(atr[0])
} unless attributes.nil?
end
def strip_removes?
# I'm sure there are others that shuould be ripped instead of stripped
attributes && attributes['type'] =~ /script|css/
end
end
class Doc
def scrub(config={})
config = {
:nuke_tags => [],
:allow_tags => [],
:allow_attributes => []
}.merge(config)
config[:nuke_tags].each { |tag| (self/tag).remove }
config[:allow_tags].each { |tag|
(self/tag).strip_attributes(config[:allow_attributes])
}
children.reverse.each {|e|
e.strip unless e.class == Hpricot::Text ||
config[:allow_tags].include?(e.name)
}
self
end
end
end
Sample config in YAML
---
:allow_tags: # let these tags stay, but will strip attributes
- 'b'
- 'blockquote'
- 'br'
- 'div'
- 'h1'
- 'h2'
- 'h3'
- 'h4'
- 'h5'
- 'h6'
- 'hr'
- 'i'
- 'em'
- 'img'
- 'li'
- 'ol'
- 'p'
- 'pre'
- 'small'
- 'span'
- 'span'
- 'strike'
- 'strong'
- 'sub'
- 'sup'
- 'table'
- 'tbody'
- 'td'
- 'tfoot'
- 'thead'
- 'tr'
- 'u'
- 'ul'
:nuke_tags: # completely removes everything between open and close tag
- 'form'
- 'script'
:allow_attributes: # let these attributes stay, strip all others
- 'src'
- 'font'
- 'alt'
- 'style'
- 'align'
The source with sample data/test, run the test with
ruby test
blowmage on Hpricot Scrub on Aug 20, 2008 at 09:24 PM
It would be nice if scrub() returned self, so you could chain the calls.
html = Hpricot(open(url).read).scrub.inner_html
michael on Hpricot Scrub on Aug 20, 2008 at 09:24 PM
so be it…
I updated the .zip and the post to reflect the change.
epugh on Hpricot Scrub on Aug 20, 2008 at 09:24 PM
Just an FYI for anyone else, it is now a gem! gem install hpricot-scrub
chick on Hpricot Scrub on Mar 26, 2009 at 02:43 PM
Seems close but
1 ) gem did not seem to have the lib/hpricot_scrub/hpricot_scrub.rb
2 ) hpricot_scrub.rb, which I grabbed from post did not extend classes Comment and BogusETag with remove, strip etc. methods
3 ) Line 73 did not deal with children that do not respond_to? :name
After fixing up those things locally it seems to work ok
chick on Hpricot Scrub on Mar 26, 2009 at 02:44 PM
forgot to say thanks
micahel on Hpricot Scrub on Apr 07, 2009 at 10:11 AM
@chick – was the gem for version 0.3.5 broken for you?
gem install hpricot_scrub
sounds like you may have gotten the old hpricot-scrub gem, which should have been removed from rubyforge, but appears it was still there.