Skip to main content

Using Hpricot to Scrub HTML

[UPDATE 2007-01-10] I’ve updated the scrubber, see Hpricot Scrub for more. [/UPDATE]

I went looking for a Ruby replacement for Html::Scrubber in perl for a gig and came up blank. Can it really be possible the nobody is doing anything more than blindly stripping tags?

I had seen Hpricot and thought I needed to find a reason to use it, well here it is. I monkey patched a couple methods into Hrpicot and off I went.

Here’s the Hpricot bits.


module Hpricot
  class Elements
    def strip
      each { |x| x.strip }
    end

    def strip_attributes(safe=[], patterns={})
      each { |x| x.strip_attributes(safe, patterns) }
    end
  end

  class Elem
    def strip
      parent.replace_child self, Hpricot.make(inner_html) unless 
        parent.nil?
    end

    def strip_attributes(safe=[], patterns={})
      attributes.each { |atr|
          pat = patterns[atr[0].to_sym] || ''
          remove_attribute(atr[0]) unless safe.include?(atr[0]) &&
            atr[1].match(pat)
      } unless attributes.nil?
    end
  end
end

Just that bit get’s me to the point where I can do things like this


doc = Hpricot(open('http://slashdot.org/').read)

# remove all anchors leaving behind the text inside.
(doc/:a).strip 

# strip all attributes except for src from all images
(doc/:img).strip_attributes(['src']) 

Then I made scrubber that passes in the array and hash to those methods to handle the dirty work. It looks like this, though I’m also using Tidy so mine is alittle different.


class HtmlScrubber
  @@config = YAML.load_file(
    "#{RAILS_ROOT}/config/html_scrubber.yml") unless 
      defined?(@@config)

  def self.scrub(markup)
    doc = Hpricot(markup || '', :xhtml_strict => true)
    raise 'No markup specified' if doc.nil?
    @@config[:nuke_tags].each { |tag| (doc/tag).remove }
    @@config[:allow_tags].each { |tag|
      (doc/tag).strip_attributes(@@config[:allow_attributes], 
        @@config[:attribute_patterns]) }
    doc.traverse_all_element {|e|
      e.strip unless @@config[:allow_tags].include?(e.name)
    }
    doc.inner_html
  end
end

Here is a zip of the code and a sample config: html_scrubber.zip

continue reading

Profiling Rails end-to-end

I wanted to do some profiling of a Rails app, so I did a little digging and found ruby-prof with new and improved call graphs. Plus it’s very fast. The install couldn’t be easier
sudo gem install ruby-prof
Then I wanted to see if I could get this to run in before and after filters, I haven’t had any luck, though I haven’t tried all that hard. Since I wanted to be able to do this relatively easily I threw together a mini module to handle the report generation piece for me. So now I can profile a controller action by adding this to my application controller

require 'ruby_profiler'

class ApplicationController < ActionController::Base
  include RubyProfiler
end
Then in the controller I just need to

def some_action
  result = RubyProf.profile {
    ...
  }
  write_profile(result, 5, RubyProfiler::GRAPH_HTML)
end

source: ruby_profiler.rb

continue reading

mmm Feeds

Ok, so the project I’ve been workig on is getting close…

Feed Harvest if you are interested in the (very) private beta, let us know.

continue reading

FCKeditor on Rails

I’ve been reading all this great stuff about Ruby on Rails so I told my boss that we should look into it, then I expensed a copy of Agile web Development with Rails and gave it a read. It looked promising.

I read a post on the Rails blog the other day about integrating FCKeditor with Rails and thought that would be a nice addition, unfortunately the method mentioned was little more than how to drop tags in a page to get FCKeditor to go. There wasn’t any real Rails to it.

I decided that would make a somewhat interesting project to start playing with Rails as it needs to interact with the file system a little. So I spent the past day-and-a-halfish building FCKeditor on Rails, it’s a little rough around the edges and I still want to integrate the mcpuk File Browser becasue it has so much more functionality than the default.

The end result is a Rails helper/controller that lets you add an FCKeditor instance just like you would expect in Rails:

fckeditor(:object, :param, {:width => '600px', :height => '500px'})

not to shabby. Now we will see how long it takes me to get around to adding mcpuk support.

The source can also be found in the FCKeditor trac project.

continue reading

swish-e, weee

So I spent some time this weekend putting together a PHP Class to wrap around the swish-e search engine. It handles most of the settings you would want to set and also takes care of showing highlighted contextual results, providing you used StoreDescription when you built the index.

When I get it a little more ironed out I’ll cut it loose, providing anybody gives a rip.

continue reading

Latest comments

A very happy Sean on rvm friendly TextMate bundles on Aug 10, 2010 at 02:43 AM

Thanks so much for these. I’d almost give up on getting Textmate working with RVM.

A very happy Sean

Ed Ruder on rvm + gemsets + TextMate == yay! * 2 on Jul 30, 2010 at 04:06 PM

What got everything working for me was to: a) follow the directions on http://rvm.beginrescueend.com/integration/textmate/, b) rename a second Builder.rb file in ~/Library/Application\ Support/TextMate/Pristine\ Copy/Support/lib, and c) set up TM_RUBY per slides 36 & 37 of http://www.slideshare.net/freelancing_god/zsh-and-rvm (create a shell script that resolves to rvm’s ruby on the fly and point TextMate to it)!

Now, everything is working like a charm! (No mucking with TextMate’s Bundle Editor, either, which is nice.)

Ed Ruder

Trevor on rvm friendly TextMate bundles on Jun 22, 2010 at 10:20 AM

Really excited to see that you’ve put this together, but I’m having trouble getting it to see anything other than my system Ruby. I’ve set up the .rvmrc in my project w/ the proper ruby and can confirm that it’s seeing the right one. I assume that this is intended to use project-level .rvmrc?

Trevor

Gerhard on rvm + gemsets + TextMate == yay! * 2 on Jun 02, 2010 at 04:27 AM

This kinda’ works, but I’m getting a weird error related to paths even though the path that it’s complaining about is correct and the file exists, is readable etc.

Also, formatting is all screwy when it uses rvm & gemsets.

Gerhard

jake on rvm + gemsets + TextMate == yay! * 2 on May 18, 2010 at 10:10 AM

You will have to edit each command you want to have these changes in the Bundle Editor.

jake