Underpants Gnome


Hpricot Scrub

Posted in ruby, rails by Michael on the January 20th, 2007

[UPDATE 2007-02-07] Changed scrub to return self [/UPDATE]

Using Hpricot to Scrub HTML - The remix

So I wanted to bring the HTML Scrubber into my Hpricot tweaks to tidy it up a bit and this is what I ended up with.

Now you can use the following to remove all tags from an HTML snippet


doc = Hpricot(open('http://slashdot.org/').read)
doc.scrub

Strip all hrefs, leaving the text inside in tact
(doc/:a).strip
Scrub the snippet based on a config hash


doc.scrub(hash)

hpricot_scrub.rb


require 'hpricot'

module Hpricot
  class Elements
    def strip
      each { |x| x.strip }
    end
    
    def strip_attributes(safe=[])
      each { |x| x.strip_attributes(safe) }
    end
  end

  class Elem
    def remove
      parent.children.delete(self)
    end

    def strip
      children.each { |x| x.strip unless x.class == Hpricot::Text }

      if strip_removes?
        remove
      else
        parent.replace_child self, Hpricot.make(inner_html) unless parent.nil?
      end
    end
    
    def strip_attributes(safe=[])
      attributes.each {|atr|
          remove_attribute(atr[0]) unless safe.include?(atr[0])
      } unless attributes.nil?
    end
    
    def strip_removes?
      # I'm sure there are others that shuould be ripped instead of stripped
      attributes && attributes['type'] =~ /script|css/
    end
  end

  class Doc
    def scrub(config={})
      config = {
        :nuke_tags => [],
        :allow_tags => [],
        :allow_attributes => []
      }.merge(config)
      
      config[:nuke_tags].each { |tag| (self/tag).remove }
      config[:allow_tags].each { |tag|
        (self/tag).strip_attributes(config[:allow_attributes])
      }
      children.reverse.each {|e|
        e.strip unless e.class == Hpricot::Text ||
          config[:allow_tags].include?(e.name)
      }
      self
    end
  end
end

Sample config in YAML


---
    :allow_tags: # let these tags stay, but will strip attributes
        - 'b'
        - 'blockquote'
        - 'br'
        - 'div'
        - 'h1'
        - 'h2'
        - 'h3'
        - 'h4'
        - 'h5'
        - 'h6'
        - 'hr'
        - 'i'
        - 'em'
        - 'img'
        - 'li'
        - 'ol'
        - 'p'
        - 'pre'
        - 'small'
        - 'span'
        - 'span'
        - 'strike'
        - 'strong'
        - 'sub'
        - 'sup'
        - 'table'
        - 'tbody'
        - 'td'
        - 'tfoot'
        - 'thead'
        - 'tr'
        - 'u'
        - 'ul'

    :nuke_tags: # completely removes everything between open and close tag
        - 'form'
        - 'script'
        
    :allow_attributes: # let these attributes stay, strip all others
        - 'src'
        - 'font'
        - 'alt'
        - 'style'
        - 'align'

The source with sample data/test, run the test with


ruby test

3 Responses to 'Hpricot Scrub'

Subscribe to comments with RSS

  1. blowmage said,

    on February 6th, 2007 at 8:25 pm

    It would be nice if scrub() returned self, so you could chain the calls.

    html = Hpricot(open(url).read).scrub.inner_html

  2. Michael said,

    on February 7th, 2007 at 12:35 pm

    so be it…

    I updated the .zip and the post to reflect the change.

  3. epugh said,

    on December 3rd, 2007 at 12:55 pm

    Just an FYI for anyone else, it is now a gem! gem install hpricot-scrub

Leave a Reply

You must be logged in to post a comment.