Interesting Python libraries for natural language processing

Posted by Eric Kidd Mon, 28 Dec 2009 15:56:00 GMT

I’ve been looking at various libraries for natural language processing, and I’m pleasantly surprised by the tools created by the Python community. Some examples:

  • The Python NLTK library provides parsers for many popular copora, visualization tools, and a wide variety of simple natural language algorithms (though few of these are probabilistic). Highlights include:
  • ConceptNet provides a simple semantic model of the world.
  • NumPy (and SciPy) provide extensive support for linear algebra and data visualization.
  • PyCUDA provides access to Nvidia GPUs for high-performance scientific computation, and it integrates with NumPy.

If you need to build a web crawler, there’s Twisted, which makes it easy to write fast, asynchronous networking code.

All in all, I usually prefer Ruby to Python, because I love Ruby’s metaprogramming support. But the Python community has built an impressive variety of scientific and linguistic tools. Many thanks to everybody who contributed to these projects!

Tags ,

Wave Hackathon

Posted by Eric Kidd Sat, 21 Nov 2009 23:35:00 GMT

I’m currently attending the Wave hackathon at the Massachusetts GTUG. Here’s some code from a protocol-level Wave agent that I just demoed:

# Capitalize random words.
replace /\b(random|words)\b/i do |word|

# Shorten URLs.
replace /\bhttp:\/\/([^ ]+)/  do |url| 

In keeping with the traditions of hackathons, this agent is horribly fragile. It only works with FedOne’s console-based wave client, and it doesn’t handle annotations correctly.

Some earlier—and more robust—wave-related projects:

  • Pick Several: A gadget which implements approval voting. Written using GWT. Includes a reusable library for writing GWT-based wave gadgets.
  • BugLinky: A robot which links bug numbers to a bug tracker. Includes a reusable library for simple pattern-matching, text replacement and annotation.

Many thanks to GTUG and to Google for organizing this hackathon!

Tags ,


Posted by Eric Kidd Sat, 12 Sep 2009 00:37:00 GMT

I’m currently upgrading, and you may encounter some RSS-related weirdness. Also, some programs in locally-hosted version control repositories may be unavailable temporarily.

My apologies for any inconveniences. I hope to post some more Ruby and Haskell code soon!

Real-time text annotation with Google Wave

Posted by Eric Kidd Tue, 01 Sep 2009 13:06:00 GMT

If you haven’t seen Google Wave yet, you may want to watch this 10-minute demo or take a look at the official Wave site. Otherwise, this code won’t make a lot of sense. :-)

Figure: Real-time text annotation with Google Wave.

I received my developer sandbox account last week, and spent some time experimenting with Wave. So far, it seems like a really sweet tool—it’s fast (if you stay away from waves with 200+ messages), convenient, and fun to use. Wave isn’t yet ready for prime time, but it could be fairly solid in six months and widely deployed by this time next year.

After using Wave for less than a week, I really wish my friends and coworkers had accounts. This is a good sign.

Google has already released about 40,000 lines of Wave code as open source, and there’s apparently quite a bit more in the pipeline.

Extending wave with gadgets and robots

Wave supports two major kinds of extensions:

  1. Gadgets are interactive content that users can embed in a wave. In the demo, Google shows off a map gadget, a chess game, a handy “Yes/No/Maybe” poll, and many others.
  2. Robots are essentially a cross between IRC bots, text editor extensions, and web form processors. They can do anything another human could do, in real-time, as you type. They can also create and respond to HTML-style form elements.

So far, I’ve tried writing a robot in Java. (Why Java? Wave is built using Google Web Toolkit, which compiles Java to JavaScript. So right now, the Java libraries are slightly more mature than the Python libraries.)

buglinky: Link to bugs as you type

buglinky reads your text as you type, and automatically links strings of the form “bug #123” to a bug tracker of your choice. It can also detect raw bug tracker URLs and replace them with the corresponding text.

Internally, buglinky is built around a custom BlipProcessor class that does all the heavy lifting. All I need to do is hook up my individual processors and run them:

ArrayList<BlipProcessor> processors =
  new ArrayList<BlipProcessor>();
processors.add(new BugUrlReplacer(BUG_URL));
processors.add(new BugNumberLinker(BUG_URL));
  processors, bundle, BOT_ADDRESS);

Here’s the code which links “bug #123” to the bug tracker:

class BugNumberLinker extends BlipProcessor {
  // ...constructor sets up bugUrl...

  protected String getPattern() {
    return "(?:[Bb]ug|[Ii]ssue|[Tt]icket|[Cc]ase) \#?(\\d+)";

  protected void processMatch(
      TextView doc, Range range, Matcher match) {
    annotate(doc, range, "link/manual",

Replacing URLs with plain text is similarly easy:

class BugUrlReplacer extends BlipProcessor {
  // ...constructor sets up bugUrl...

  protected String getPattern() {
    return Pattern.quote(bugUrl) + "(\\d+)";

  protected void processMatch(
      TextView doc, Range range, Matcher match) {
    replace(doc, range, "issue #" +;

If you have a sandbox account, you can experiment with buglinky. You can also download the source code for buglinky from github.

Future directions

Obviously, this will look much better once somebody has the time to port it to JRuby. How about something like this?

match /#{Regexp.escape(bugUrl)}(\\d+)/ do
  replace "issue ###{$1}"

match /(?:[Bb]ug|[Ii]ssue|[Tt]icket|[Cc]ase) \#?(\\d+)/ do
  link(bug_url + $1)

Additionally, the current robot API relies on a JSON-RPC-based proxy between the Wave server and a robot. This is fine for processing entire waves in a single pass, but it uses too much bandwidth for real-time text processing. So I would love to be able to run this code inside a real wave server. But that will have to wait until the federation protocol is turned on.

Tags ,

Write a 32-line chat client using Ruby, AMQP & EventMachine (and a GUI using Shoes)

Posted by Eric Kidd Fri, 08 May 2009 18:06:00 GMT

Have you ever considered using instant messages to communicate between programs? You can do this using Jabber’s XMPP protocol, of course. But it’s also worth taking a look at AMQP, a distributed messaging protocol first used at JPMorgan Chase. AMQP is fast, easy to use, and implemented by at least 4 open source servers.

To try it out, install the excellent Ruby AMQP bindings, and set up the RabbitMQ server (which is written in Erlang using Mnesia). On a Mac, you might do something like this:

sudo gem install amqp
sudo port install python25 rabbitmq-server
sudo rabbitmq-server

Once your server is running, save the following code as chat.rb:

require 'rubygems'
gem 'amqp'
require 'mq'

unless ARGV.length == 2
  STDERR.puts "Usage: #{$0} <channel> <nick>"
  exit 1
$channel, $nick = ARGV

AMQP.start(:host => 'localhost') do
  $chat = MQ.topic('chat')

  # Print any messages on our channel.
  queue = MQ.queue($nick)
  queue.bind('chat', :key => $channel)
  queue.subscribe do |msg|
    if msg.index("#{$nick}:") != 0
      puts msg

  # Forward console input to our channel.
  module KeyboardInput
    include EM::Protocols::LineText2
    def receive_line data
      $chat.publish("#{$nick}: #{data}",
                    :routing_key => $channel)

Now, run copies in two different terminals:

ruby chat.rb channel_1 sarah
ruby chat.rb channel_1 joe

Everything you type into one terminal will be relayed to the other.

How it works

The following line creates a topic exchange named “chat”:

$chat = MQ.topic('chat')

A topic exchange allows many-to-many communication. Here, we bind a listener to our exchange, and ask to receive all messages tagged with our channel name:

queue.bind('chat', :key => $channel)

Note that :key may be hierarchical, and it may contain wildcards. To write data to our topic exchange, we use publish:

$chat.publish("#{$nick}: #{data}",
              :routing_key => $channel)

Our keyboard input is processed using EventMachine, a Ruby library for writing high-performance, multi-protocol servers. It’s very similar to Python’s Twisted library, though it has less documentation and support for fewer protocols.

We use EventMachine’s EM.open_keyboard to create a asynchronous keyboard input channel, and we use EM::Protocols::LineText2 to treat the keyboard input as a line-oriented protocol.

Adding a Shoes GUI

Shoes is an eccentric, entertaining, and highly-portable GUI library by _why the lucky stiff. With a certain amount of grotesque kludging (and some pointers from “s1kx” on the #shoes IRC channel), I managed to get the Mac version of Shoes to talk to EventMachine. You may find that this code fails strangely on your computer. Honestly, I don’t know anything about Shoes. And I’m doing some pretty bad things with threads.

First, the pretty pictures:

Next, the code:

Shoes.setup { gem 'amqp' }
require 'mq'

$app = => 256) do
  background(gradient('#CFF', '#FFF'))
  @output = stack(:margin => 10)

  def nick str
    span(str, :stroke => red)

  def display text
    @output.append do
      if text =~ /^([^:]+): (.*)$/
        para nick("#{$1}: "), $2
        para text
end do
    AMQP.start(:host => 'localhost') do
      queue = MQ.queue('shoes')
      queue.subscribe do |msg|
  rescue => e
    # Try to report at least _some_ errors
    # where we'll be able to see them.

Note that the GUI client listens to all channels simultaneously, because it doesn’t pass a :key to bind. And when writing code to run in a Shoes background thread, don’t expect to see any error messages.

Learning more about AMQP

The Ruby AMQP documentation page has a good list of papers, magazine articles, and other background material on AMQP.

Tags , , , ,

Financial crisis background and Munger on the banks

Posted by Eric Kidd Tue, 05 May 2009 09:36:00 GMT

Charlie Munger is the long-time partner of Warren Buffet. Of the two, he’s the more politically conservative. Their company, Berkshire Hathaway, has recently bought big stakes in several of the better-off investment banks.

Recently, Munger sharply criticized the management of the investment banks, saying they’ve grown too politically powerful. The key quote:

“We need to remove from the investment banking and the commercial banking industries a lot of the practices and prerogatives that they have so lovingly possessed,” Munger said. “If they are too big to fail, they are too big to be allowed to be as gamey and venal as they’ve been – and as stupid as they’ve been.” (, via Baseline Scenario.)

What does the bankers’ stupidity have to do with the usual themes of this blog? Well, much of the crisis comes down to bad probability calculations: the big banks have been treating highly correlated events as though they were independent events.

Some good background material on the crisis:

Tags , ,

Designing programs with RSpec and Cucumber (plus a book recomendation)

Posted by Eric Kidd Thu, 30 Apr 2009 15:07:00 GMT

Over the last couple of years, I’ve occasionally written Ruby programs using RSpec and (more recently) Cucumber. These two tools are inspired by Test Driven Development (TDD), a school of thought which says you should write unit tests before implementing a feature.

When doing TDD, you work inwards from the interface to the implementation. You start by writing a test case against the interface you wish you had, and then you make that test case work. This is a subtle shift in how you approach a design problem, but it frequently results in beautiful APIs. (And you also get a fully automated test suite for your software, liberating you to make much larger changes without fear of breaking things.)

The problem with the word “test”

Unfortunately, the name “Test Driven Development” is misleading. Most folks think of “testing” as something you do after development is complete. But TDD is really more of a design activity—you’re specifying how your APIs should work before you actually start coding.

Dan North spent some time struggling to teach developers about TDD. After a while, he decided that the main barrier to understanding was the word “test.” He proposed replacing TDD with Behavior Driven Development (BDD), and he started referring to unit tests as “specifications.”

In the Ruby community, the most popular BDD tool is RSpec. Using RSpec, you might specify an API something like this:

describe "simplify_name" do
  it "should convert all letters to lowercase" do
    simplify_name("AbC").should == "abc"

  it "should remove everything but letters and spaces" do
    simplify_name(" Joe Smith 3 -+\n").should == "joe smith"

After writing this specification, you would then go ahead and implement simplify_name. And from then on, whenever you changed your program, you could automatically check it against this specification.

Using specifications to communicate with clients and users

By itself, RSpec is mostly useful for programmers. Sure, a specification looks a lot like English. But would you really want to show it to an end user?

Cucumber goes one step further. Instead of using code to specify how an API should work, it uses plain text to describe how a user interface should work. For example:

Feature: Log in and out
  As an administrator
  I want to restrict access to certain portions of my site
  In order to prevent users from changing the content

  Scenario: Logging in
    Given I am not logged in as an administrator
    When I go to the administrative page
    And I fill in the fields
      | Username | admin  |
      | Password | secret |
    And I press "Log in"
    Then I should be on the administrative page
    And I should see "Log out"

  Scenario: Logging out

Here’s the neat part: This specification is actually an executable program. Each line of text corresponds to a “step”, which is defined in another file. Here’s an example from the standard webrat_steps.rb file:

Then /^I should see "([^\"]*)"$/ do |text|
  response.should contain(text)

Cucumber encourages you to think at a very high level, and to specify how different users will actually use your software. It’s particularly helpful if you need to communicate between programmers and end-users.

My experiences with RSpec and Cucumber

I’ve been using RSpec on and off for a couple of years now, and Cucumber since late last year. Initially, I found both tools fascinating, but also a bit frustrating. Both RSpec and Cucumber have very strong opinions about how you should write software. Now, I found those opinions very interesting, and I was quite happy to be influenced by the assumptions built into the tools. But every now and then, I would need to do something that the authors of RSpec and Cucumber hadn’t anticipated, and I would inevitably wind up struggling to make things work.

But recent versions of RSpec and Cucumber are richer and more flexible. They cover more important cases straight out of the box, and they’re easier to customize. So I can finally recommend both tools for real-world projects: They’ll still guide your thinking, but they should give you enough flexibility to handle the corner-cases.

The RSpec (and Cucumber) book

Unfortunately, the documentation for RSpec and Cucumber is scattered around the web, and there aren’t enough online guides showing the best way to solve common problems.

But the Pragmatic Press is working on The RSpec Book, which contains a large section on Cucumber, and a walkthrough of a typical development session using Cucumber and RSpec.

Currently, the RSpec book is available as a “beta book”. This is a downloadable, DRM-free PDF, with periodic updates throughout the publishing process. Right now, between one-third and one-half of the chapters have been roughed in, and the book is already very useful.

So if you’re curious about RSpec and Cucumber, have a look around the two web sites, and maybe watch some of the screencasts. If you decide to investigate further, pick up the beta book and dive in.

Tags , ,

Remote root holes reported as "denial of service"

Posted by Eric Kidd Thu, 30 Apr 2009 12:57:00 GMT

Via LWN.

If you’re a Linux system administrator, you shouldn’t put your faith in security advisories. The kernelbof blog accuses Linux distributors of being too quick to label security bugs as “denial of service” attacks:

I’m wondering why kernel developers (or vendors?) continue to claim that kernel memory corruption are just Denial of Service. Most of the times they _are_ exploitable.

As an example, the author quotes Ubuntu Security Notice 751:

The SCTP stack did not correctly validate FORWARD-TSN packets. A remote attacker could send specially crafted SCTP traffic causing a system crash, leading to a denial of service.

(Emphasis added.)

The author claims, however, to have created an exploit for this bug. He says his exploit allows a remote attacker to gain root access, often on the first attempt. If this is true, it would give him a quick way to gain control over any Linux system which has a process listening to an SCTP socket.

Ubuntu’s security team is not doing system administrators any favors by labeling memory corruption as “denial of service” attacks. If you can corrupt memory, there are some terrifyingly clever ways to run code. And marking memory as non-executable won’t necessarily protect you.

If you administer a Linux system, you should probably aim to patch alleged “denial of service” bugs as quickly as you can.

Tags ,

Installing TortoiseGit

Posted by Eric Kidd Sun, 11 Jan 2009 09:55:00 GMT

On December 12th, Frank Li released TortoiseGit 0.1. When we downloaded this initial release at work, we were underwhelmed:

git logOK, but not as good as gitk
git commitBroken

On January 4th, however, Frank Li released TortoiseGit 0.2. He’d been extremely busy:

git logOK, but not as good as gitk
git commitOK (except for add and rm)
git addBroken (see Bug 6 for workaround)
git rmBroken
git statusOK
git pullOK
git pushOK
SSHOK (tested with PuTTY)
git cloneAlways clones to home directory (see Bug 8)
Clean mergeOK
Conflicted mergeManual, as with command-line tool
SubmodulesNo support

Basically, TortoiseGit 0.2 is almost usable, and the project is proceeding at a breakneck pace. If you have Windows users that you want to migrate to Git—and who don’t want to use the command-line tool—it’s worth a look.

Installation instructions follow.



Ubiquitous Hoogle

Posted by Eric Kidd Mon, 01 Sep 2008 10:33:00 GMT

Ubiquity is an experimental Firefox plugin. It’s a “graphical command line” similar to QuickSilver on the Macintosh.

You can easily add your own commands to Ubiquity. The following article shows how to create a Hoogle search command that looks up Haskell functions by name or by type signature.

Searching for putStr

You can press Return or click on one the links in the preview.


Tags , ,

Older posts: 1 2 3 4 ... 12