UnderpantsGnome

1. Collect Underpants 2. ??? 3. Profit!

Hpricot Scrub

January 20, 2007 21:23

[UPDATE 2007-02-07] Changed scrub to return self [/UPDATE]

Using Hpricot to Scrub HTML – The remix

So I wanted to bring the HTML Scrubber into my Hpricot tweaks to tidy it up a bit and this is what I ended up with.

Now you can use the following to remove all tags from an HTML snippet

doc = Hpricot(open('http://slashdot.org/').read)
doc.scrub

Strip all hrefs, leaving the text inside in tact
(doc/:a).strip
Scrub the snippet based on a config hash
doc.scrub(hash)

hpricot_scrub.rb
require 'hpricot'

module Hpricot
class Elements
def strip
each { |x| x.strip }
end

def strip_attributes(safe=[]) each { |x| x.strip_attributes(safe) } end end class Elem def remove parent.children.delete(self) end def strip children.each { |x| x.strip unless x.class == Hpricot::Text } if strip_removes? remove else parent.replace_child self, Hpricot.make(inner_html) unless parent.nil? end end def strip_attributes(safe=[]) attributes.each {|atr| remove_attribute(atr0) unless safe.include?(atr0) } unless attributes.nil? end def strip_removes?
  1. I’m sure there are others that shuould be ripped instead of stripped
    attributes && attributes[‘type’] =~ /script|css/
    end
    end
class Doc def scrub(config={}) config = { :nuke_tags => [], :allow_tags => [], :allow_attributes => [] }.merge(config) config[:nuke_tags].each { |tag| (self/tag).remove } config[:allow_tags].each { |tag| (self/tag).strip_attributes(config[:allow_attributes]) } children.reverse.each {|e| e.strip unless e.class == Hpricot::Text || config[:allow_tags].include?(e.name) } self end end

end


Sample config in YAML
---
  :allow_tags: # let these tags stay, but will strip attributes
    - 'b'
    - 'blockquote'
    - 'br'
    - 'div'
    - 'h1'
    - 'h2'
    - 'h3'
    - 'h4'
    - 'h5'
    - 'h6'
    - 'hr'
    - 'i'
    - 'em'
    - 'img'
    - 'li'
    - 'ol'
    - 'p'
    - 'pre'
    - 'small'
    - 'span'
    - 'span'
    - 'strike'
    - 'strong'
    - 'sub'
    - 'sup'
    - 'table'
    - 'tbody'
    - 'td'
    - 'tfoot'
    - 'thead'
    - 'tr'
    - 'u'
    - 'ul'

:nuke_tags: # completely removes everything between open and close tag - ‘form’ - ‘script’ :allow_attributes: # let these attributes stay, strip all others - ‘src’ - ‘font’ - ‘alt’ - ‘style’ - ‘align’


The source with sample data/test, run the test with
ruby test

Comments

Gravatar
blowmage August 21, 2008 04:24

It would be nice if scrub() returned self, so you could chain the calls.

html = Hpricot(open(url).read).scrub.inner_html

Gravatar
michael August 21, 2008 04:24

so be it…

I updated the .zip and the post to reflect the change.

Gravatar
epugh August 21, 2008 04:24

Just an FYI for anyone else, it is now a gem! gem install hpricot-scrub

Gravatar
chick March 26, 2009 21:43

Seems close but
1 ) gem did not seem to have the lib/hpricot_scrub/hpricot_scrub.rb
2 ) hpricot_scrub.rb, which I grabbed from post did not extend classes Comment and BogusETag with remove, strip etc. methods
3 ) Line 73 did not deal with children that do not respond_to? :name
After fixing up those things locally it seems to work ok

Gravatar
chick March 26, 2009 21:44

forgot to say thanks

Gravatar
micahel April 07, 2009 17:11

@chick – was the gem for version 0.3.5 broken for you?

gem install hpricot_scrub

sounds like you may have gotten the old hpricot-scrub gem, which should have been removed from rubyforge, but appears it was still there.

Add a comment

Textile enabled (Reference)