Hpricot Scrub

[UPDATE 2007-02-07] Changed scrub to return self [/UPDATE]

Using Hpricot to Scrub HTML - The remix

So I wanted to bring the HTML Scrubber into my Hpricot tweaks to tidy it up a bit and this is what I ended up with.

Now you can use the following to remove all tags from an HTML snippet

doc = Hpricot(open('http://slashdot.org/').read)
doc.scrub

Strip all hrefs, leaving the text inside in tact (doc/:a).strip Scrub the snippet based on a config hash

doc.scrub(hash)

hpricot_scrub.rb

require 'hpricot'

module Hpricot
  class Elements
    def strip
      each { |x| x.strip }
    end

    def strip_attributes(safe=[])
      each { |x| x.strip_attributes(safe) }
    end
  end

  class Elem
    def remove
      parent.children.delete(self)
    end

    def strip
      children.each { |x| x.strip unless x.class == Hpricot::Text }

      if strip_removes?
        remove
      else
        parent.replace_child self, Hpricot.make(inner_html) unless parent.nil?
      end
    end

    def strip_attributes(safe=[])
      attributes.each {|atr|
          remove_attribute(atr[0]) unless safe.include?(atr[0])
      } unless attributes.nil?
    end

    def strip_removes?
      # I'm sure there are others that shuould be ripped instead of stripped
      attributes && attributes['type'] =~ /script|css/
    end
  end

  class Doc
    def scrub(config={})
      config = {
        :nuke_tags => [],
        :allow_tags => [],
        :allow_attributes => []
      }.merge(config)

      config[:nuke_tags].each { |tag| (self/tag).remove }
      config[:allow_tags].each { |tag|
        (self/tag).strip_attributes(config[:allow_attributes])
      }
      children.reverse.each {|e|
        e.strip unless e.class == Hpricot::Text ||
          config[:allow_tags].include?(e.name)
      }
      self
    end
  end
end

Sample config in YAML

---
  :allow_tags: # let these tags stay, but will strip attributes
    - 'b'
    - 'blockquote'
    - 'br'
    - 'div'
    - 'h1'
    - 'h2'
    - 'h3'
    - 'h4'
    - 'h5'
    - 'h6'
    - 'hr'
    - 'i'
    - 'em'
    - 'img'
    - 'li'
    - 'ol'
    - 'p'
    - 'pre'
    - 'small'
    - 'span'
    - 'span'
    - 'strike'
    - 'strong'
    - 'sub'
    - 'sup'
    - 'table'
    - 'tbody'
    - 'td'
    - 'tfoot'
    - 'thead'
    - 'tr'
    - 'u'
    - 'ul'

  :nuke_tags: # completely removes everything between open and close tag
    - 'form'
    - 'script'

  :allow_attributes: # let these attributes stay, strip all others
    - 'src'
    - 'font'
    - 'alt'
    - 'style'
    - 'align'

The source with sample data/test, run the test with

ruby test