Underpants Gnome


Hpricot Scrub

Posted in ruby, rails by Michael on the January 20th, 2007

[UPDATE 2007-02-07] Changed scrub to return self [/UPDATE]

Using Hpricot to Scrub HTML - The remix

So I wanted to bring the HTML Scrubber into my Hpricot tweaks to tidy it up a bit and this is what I ended up with.

Now you can use the following to remove all tags from an HTML snippet


doc = Hpricot(open('http://slashdot.org/').read)
doc.scrub

Strip all hrefs, leaving the text inside in tact
(doc/:a).strip
Scrub the snippet based on a config hash


doc.scrub(hash)

hpricot_scrub.rb


require 'hpricot'

module Hpricot
  class Elements
    def strip
      each { |x| x.strip }
    end
    
    def strip_attributes(safe=[])
      each { |x| x.strip_attributes(safe) }
    end
  end

  class Elem
    def remove
      parent.children.delete(self)
    end

    def strip
      children.each { |x| x.strip unless x.class == Hpricot::Text }

      if strip_removes?
        remove
      else
        parent.replace_child self, Hpricot.make(inner_html) unless parent.nil?
      end
    end
    
    def strip_attributes(safe=[])
      attributes.each {|atr|
          remove_attribute(atr[0]) unless safe.include?(atr[0])
      } unless attributes.nil?
    end
    
    def strip_removes?
      # I'm sure there are others that shuould be ripped instead of stripped
      attributes && attributes['type'] =~ /script|css/
    end
  end

  class Doc
    def scrub(config={})
      config = {
        :nuke_tags => [],
        :allow_tags => [],
        :allow_attributes => []
      }.merge(config)
      
      config[:nuke_tags].each { |tag| (self/tag).remove }
      config[:allow_tags].each { |tag|
        (self/tag).strip_attributes(config[:allow_attributes])
      }
      children.reverse.each {|e|
        e.strip unless e.class == Hpricot::Text ||
          config[:allow_tags].include?(e.name)
      }
      self
    end
  end
end

Sample config in YAML


---
    :allow_tags: # let these tags stay, but will strip attributes
        - 'b'
        - 'blockquote'
        - 'br'
        - 'div'
        - 'h1'
        - 'h2'
        - 'h3'
        - 'h4'
        - 'h5'
        - 'h6'
        - 'hr'
        - 'i'
        - 'em'
        - 'img'
        - 'li'
        - 'ol'
        - 'p'
        - 'pre'
        - 'small'
        - 'span'
        - 'span'
        - 'strike'
        - 'strong'
        - 'sub'
        - 'sup'
        - 'table'
        - 'tbody'
        - 'td'
        - 'tfoot'
        - 'thead'
        - 'tr'
        - 'u'
        - 'ul'

    :nuke_tags: # completely removes everything between open and close tag
        - 'form'
        - 'script'
        
    :allow_attributes: # let these attributes stay, strip all others
        - 'src'
        - 'font'
        - 'alt'
        - 'style'
        - 'align'

The source with sample data/test, run the test with


ruby test

Managing gems with Rake

Posted in ruby, rails by Michael on the January 16th, 2007

[UPDATE 2007-02-07] I realized I left some extra junk in the version of Util in the zip, it’s been updated [/UPDATE]

I have a rake task and a Util class that I use to make setting up required gems painless and to be sure that I’m always running the versions I think I am.

Install or update required gems

rake gems:install

Make sure they are loaded with the right versions during startup, by adding the following to environment.rb

Util.load_gems

This uses a config file that looks like


:source: http://local_mirror.example.com # this is optional
:gems:
  - :name: mongrel
    :version: "1.0"
    # this gem has a specfic source URL
    :source: 'http://mongrel.rubyforge.org/releases'

  - :name: hpricot
    :version: '0.4'
    # this tells us to load not just install
    :load: true 

  - :name: postgres
    :version: '0.7.1'
    :load: true
    # any extra config that needs to be passed to gem install
    :config: '--with-pgsql-include-dir=/usr/local/pgsql/include
              --with-pgsql-lib-dir=/usr/local/pgsql/lib' 

Here’s the Util class


require 'yaml'

class Util
  def self.load_gems
    config = YAML.load_file(
      File.join(RAILS_ROOT, 'config', 'gems.yml'))
    gems = config[:gems].reject {|gem| ! gem[:load] }
    gems.each do |gem|
      require_gem gem[:name], gem[:version]
      require gem[:name]
    end
  end
end

Here’s the rake task


require 'yaml'

namespace :gems do
  require 'rubygems'

  task :install do
    # defaults to --no-rdoc, set DOCS=(anything) to build docs
    docs = (ENV['DOCS'].nil? ? '--no-rdoc' : '')
    #grab the list of gems/version to check
    config = YAML.load_file(File.join('config', 'gems.yml'))
    gems = config[:gems]

    gems.each do |gem|
      # load the gem spec
      gem_spec = YAML.load(`gem spec #{gem[:name]} 2> /dev/null`)
      gem_loaded = false
      begin
        gem_loaded = require_gem gem[:name], gem[:version]
      rescue Exception
      end

      # if forced
      # or there is no gem_spec
      # or the spec version doesn't match the required version
      # or require_gem returns false
      # (return false also happens if the gem has already been loaded)
      if ! ENV['FORCE'].nil? ||
         ! gem_spec ||
         (gem_spec.version.version != gem[:version] && ! gem_loaded)
        gem_config = gem[:config] ? " -- #{gem[:config]}" : ''
        source = gem[:source] || config[:source] || nil
        source = "--source #{source}" if source
        ret = system "gem install #{gem[:name]} 
            -v #{gem[:version]} -y #{source} #{docs} #{gem_config}"
        # something bad happened, pass on the message
        p $? unless ret
      else
        puts "#{gem[:name]} #{gem[:version]} already installed"
      end
    end
  end
end

zipped source

FCKeditor on Rails goes plugin

Posted in ruby, rails by Michael on the January 9th, 2007

Just a quick announcement, FCKeditor on Rails will run in Rails 1.2 as a plugin (with a little help), more info on the blog or in trac.