Hpricot Scrub
[UPDATE 2007-02-07] Changed scrub to return self [/UPDATE]
Using Hpricot to Scrub HTML - The remix
So I wanted to bring the HTML Scrubber into my Hpricot tweaks to tidy it up a bit and this is what I ended up with.
Now you can use the following to remove all tags from an HTML snippet
doc = Hpricot(open('http://slashdot.org/').read)
doc.scrub
Strip all hrefs, leaving the text inside in tact
(doc/:a).strip
Scrub the snippet based on a config hash
doc.scrub(hash)
hpricot_scrub.rb
require 'hpricot'
module Hpricot
class Elements
def strip
each { |x| x.strip }
end
def strip_attributes(safe=[])
each { |x| x.strip_attributes(safe) }
end
end
class Elem
def remove
parent.children.delete(self)
end
def strip
children.each { |x| x.strip unless x.class == Hpricot::Text }
if strip_removes?
remove
else
parent.replace_child self, Hpricot.make(inner_html) unless parent.nil?
end
end
def strip_attributes(safe=[])
attributes.each {|atr|
remove_attribute(atr[0]) unless safe.include?(atr[0])
} unless attributes.nil?
end
def strip_removes?
# I'm sure there are others that shuould be ripped instead of stripped
attributes && attributes['type'] =~ /script|css/
end
end
class Doc
def scrub(config={})
config = {
:nuke_tags => [],
:allow_tags => [],
:allow_attributes => []
}.merge(config)
config[:nuke_tags].each { |tag| (self/tag).remove }
config[:allow_tags].each { |tag|
(self/tag).strip_attributes(config[:allow_attributes])
}
children.reverse.each {|e|
e.strip unless e.class == Hpricot::Text ||
config[:allow_tags].include?(e.name)
}
self
end
end
end
Sample config in YAML
---
:allow_tags: # let these tags stay, but will strip attributes
- 'b'
- 'blockquote'
- 'br'
- 'div'
- 'h1'
- 'h2'
- 'h3'
- 'h4'
- 'h5'
- 'h6'
- 'hr'
- 'i'
- 'em'
- 'img'
- 'li'
- 'ol'
- 'p'
- 'pre'
- 'small'
- 'span'
- 'span'
- 'strike'
- 'strong'
- 'sub'
- 'sup'
- 'table'
- 'tbody'
- 'td'
- 'tfoot'
- 'thead'
- 'tr'
- 'u'
- 'ul'
:nuke_tags: # completely removes everything between open and close tag
- 'form'
- 'script'
:allow_attributes: # let these attributes stay, strip all others
- 'src'
- 'font'
- 'alt'
- 'style'
- 'align'
The source with sample data/test, run the test with
ruby test
Managing gems with Rake
[UPDATE 2007-02-07] I realized I left some extra junk in the version of Util in the zip, it’s been updated [/UPDATE]
I have a rake task and a Util class that I use to make setting up required gems painless and to be sure that I’m always running the versions I think I am.
Install or update required gems
rake gems:install
Make sure they are loaded with the right versions during startup, by adding the following to environment.rb
Util.load_gems
This uses a config file that looks like
:source: http://local_mirror.example.com # this is optional
:gems:
- :name: mongrel
:version: "1.0"
# this gem has a specfic source URL
:source: 'http://mongrel.rubyforge.org/releases'
- :name: hpricot
:version: '0.4'
# this tells us to load not just install
:load: true
- :name: postgres
:version: '0.7.1'
:load: true
# any extra config that needs to be passed to gem install
:config: '--with-pgsql-include-dir=/usr/local/pgsql/include
--with-pgsql-lib-dir=/usr/local/pgsql/lib'
Here’s the Util class
require 'yaml'
class Util
def self.load_gems
config = YAML.load_file(
File.join(RAILS_ROOT, 'config', 'gems.yml'))
gems = config[:gems].reject {|gem| ! gem[:load] }
gems.each do |gem|
require_gem gem[:name], gem[:version]
require gem[:name]
end
end
end
Here’s the rake task
require 'yaml'
namespace :gems do
require 'rubygems'
task :install do
# defaults to --no-rdoc, set DOCS=(anything) to build docs
docs = (ENV['DOCS'].nil? ? '--no-rdoc' : '')
#grab the list of gems/version to check
config = YAML.load_file(File.join('config', 'gems.yml'))
gems = config[:gems]
gems.each do |gem|
# load the gem spec
gem_spec = YAML.load(`gem spec #{gem[:name]} 2> /dev/null`)
gem_loaded = false
begin
gem_loaded = require_gem gem[:name], gem[:version]
rescue Exception
end
# if forced
# or there is no gem_spec
# or the spec version doesn't match the required version
# or require_gem returns false
# (return false also happens if the gem has already been loaded)
if ! ENV['FORCE'].nil? ||
! gem_spec ||
(gem_spec.version.version != gem[:version] && ! gem_loaded)
gem_config = gem[:config] ? " -- #{gem[:config]}" : ''
source = gem[:source] || config[:source] || nil
source = "--source #{source}" if source
ret = system "gem install #{gem[:name]}
-v #{gem[:version]} -y #{source} #{docs} #{gem_config}"
# something bad happened, pass on the message
p $? unless ret
else
puts "#{gem[:name]} #{gem[:version]} already installed"
end
end
end
end