Underpants Gnome


Hpricot Scrub

Posted in ruby, rails by Michael on the January 20th, 2007

[UPDATE 2007-02-07] Changed scrub to return self [/UPDATE]

Using Hpricot to Scrub HTML - The remix

So I wanted to bring the HTML Scrubber into my Hpricot tweaks to tidy it up a bit and this is what I ended up with.

Now you can use the following to remove all tags from an HTML snippet


doc = Hpricot(open('http://slashdot.org/').read)
doc.scrub

Strip all hrefs, leaving the text inside in tact
(doc/:a).strip
Scrub the snippet based on a config hash


doc.scrub(hash)

hpricot_scrub.rb


require 'hpricot'

module Hpricot
  class Elements
    def strip
      each { |x| x.strip }
    end
    
    def strip_attributes(safe=[])
      each { |x| x.strip_attributes(safe) }
    end
  end

  class Elem
    def remove
      parent.children.delete(self)
    end

    def strip
      children.each { |x| x.strip unless x.class == Hpricot::Text }

      if strip_removes?
        remove
      else
        parent.replace_child self, Hpricot.make(inner_html) unless parent.nil?
      end
    end
    
    def strip_attributes(safe=[])
      attributes.each {|atr|
          remove_attribute(atr[0]) unless safe.include?(atr[0])
      } unless attributes.nil?
    end
    
    def strip_removes?
      # I'm sure there are others that shuould be ripped instead of stripped
      attributes && attributes['type'] =~ /script|css/
    end
  end

  class Doc
    def scrub(config={})
      config = {
        :nuke_tags => [],
        :allow_tags => [],
        :allow_attributes => []
      }.merge(config)
      
      config[:nuke_tags].each { |tag| (self/tag).remove }
      config[:allow_tags].each { |tag|
        (self/tag).strip_attributes(config[:allow_attributes])
      }
      children.reverse.each {|e|
        e.strip unless e.class == Hpricot::Text ||
          config[:allow_tags].include?(e.name)
      }
      self
    end
  end
end

Sample config in YAML


---
    :allow_tags: # let these tags stay, but will strip attributes
        - 'b'
        - 'blockquote'
        - 'br'
        - 'div'
        - 'h1'
        - 'h2'
        - 'h3'
        - 'h4'
        - 'h5'
        - 'h6'
        - 'hr'
        - 'i'
        - 'em'
        - 'img'
        - 'li'
        - 'ol'
        - 'p'
        - 'pre'
        - 'small'
        - 'span'
        - 'span'
        - 'strike'
        - 'strong'
        - 'sub'
        - 'sup'
        - 'table'
        - 'tbody'
        - 'td'
        - 'tfoot'
        - 'thead'
        - 'tr'
        - 'u'
        - 'ul'

    :nuke_tags: # completely removes everything between open and close tag
        - 'form'
        - 'script'
        
    :allow_attributes: # let these attributes stay, strip all others
        - 'src'
        - 'font'
        - 'alt'
        - 'style'
        - 'align'

The source with sample data/test, run the test with


ruby test

Managing gems with Rake

Posted in ruby, rails by Michael on the January 16th, 2007

[UPDATE 2007-02-07] I realized I left some extra junk in the version of Util in the zip, it’s been updated [/UPDATE]

I have a rake task and a Util class that I use to make setting up required gems painless and to be sure that I’m always running the versions I think I am.

Install or update required gems

rake gems:install

Make sure they are loaded with the right versions during startup, by adding the following to environment.rb

Util.load_gems

This uses a config file that looks like


:source: http://local_mirror.example.com # this is optional
:gems:
  - :name: mongrel
    :version: "1.0"
    # this gem has a specfic source URL
    :source: 'http://mongrel.rubyforge.org/releases'

  - :name: hpricot
    :version: '0.4'
    # this tells us to load not just install
    :load: true 

  - :name: postgres
    :version: '0.7.1'
    :load: true
    # any extra config that needs to be passed to gem install
    :config: '--with-pgsql-include-dir=/usr/local/pgsql/include
              --with-pgsql-lib-dir=/usr/local/pgsql/lib' 

Here’s the Util class


require 'yaml'

class Util
  def self.load_gems
    config = YAML.load_file(
      File.join(RAILS_ROOT, 'config', 'gems.yml'))
    gems = config[:gems].reject {|gem| ! gem[:load] }
    gems.each do |gem|
      require_gem gem[:name], gem[:version]
      require gem[:name]
    end
  end
end

Here’s the rake task


require 'yaml'

namespace :gems do
  require 'rubygems'

  task :install do
    # defaults to --no-rdoc, set DOCS=(anything) to build docs
    docs = (ENV['DOCS'].nil? ? '--no-rdoc' : '')
    #grab the list of gems/version to check
    config = YAML.load_file(File.join('config', 'gems.yml'))
    gems = config[:gems]

    gems.each do |gem|
      # load the gem spec
      gem_spec = YAML.load(`gem spec #{gem[:name]} 2> /dev/null`)
      gem_loaded = false
      begin
        gem_loaded = require_gem gem[:name], gem[:version]
      rescue Exception
      end

      # if forced
      # or there is no gem_spec
      # or the spec version doesn't match the required version
      # or require_gem returns false
      # (return false also happens if the gem has already been loaded)
      if ! ENV['FORCE'].nil? ||
         ! gem_spec ||
         (gem_spec.version.version != gem[:version] && ! gem_loaded)
        gem_config = gem[:config] ? " -- #{gem[:config]}" : ''
        source = gem[:source] || config[:source] || nil
        source = "--source #{source}" if source
        ret = system "gem install #{gem[:name]} 
            -v #{gem[:version]} -y #{source} #{docs} #{gem_config}"
        # something bad happened, pass on the message
        p $? unless ret
      else
        puts "#{gem[:name]} #{gem[:version]} already installed"
      end
    end
  end
end

zipped source

FCKeditor on Rails goes plugin

Posted in ruby, rails by Michael on the January 9th, 2007

Just a quick announcement, FCKeditor on Rails will run in Rails 1.2 as a plugin (with a little help), more info on the blog or in trac.

Best thing since sliced bread

Posted in ruby, rails by Michael on the September 22nd, 2006

Jamis Buck has shed a little light on figuring out WTF that Ruby process eating all your processor is actually doing.

Alright, maybe not quite the same as sliced bread, but very nice none-the-less.

I can’t tell you how many times I could have used this, now I just need to wait for the need to pop up again.

[UPDATE] Apparently it get’s better than this, much better

Rails in LA, WTF??

Posted in General, rails by Michael on the April 29th, 2006

So I’m not one to normally bitch about stuff like this in public, but this one kind of forced me to.

So I landed on digg.com while reading my feeds tonight and I see an ad up top for Rails jobs

rails-wtf-1.png

so I figure I’ll see what’s up (no Burt, I’m not looking). I click through and decide to narrow down the search to Los Angeles and I get this:

rails-wtf-2.png

Ok, it’s in Los Angeles, but the best I can gather the only thing it has to do with Rails is there may be train tracks close??

mmm Feeds

Posted in Tech, ruby, rails by Michael on the March 8th, 2006

Ok, so the project I’ve been workig on is getting close…

Feed Harvest

if you are interested in the (very) private beta, let us know.