Best article I've seen on SOPA

Posted by Eric Kidd Wed, 18 Jan 2012 19:29:00 GMT

Wikipedia, Google and many other internet sites are protesting PIPA and SOPA today. But their official explanations don’t include very many details about the actual legislation.

If you’d like to learn more, check out this excellent background piece by a freelance film editor.

Screencast: Use Rails and RDF.rb to parse Best Buy product reviews

Posted by Eric Kidd Sun, 05 Jun 2011 19:07:00 GMT

In the past few years, many companies have been embedding machine-readable metadata in their web pages. Among these is Best Buy, which provides extensive RDFa data describing their products, prices and user reviews.

The following 20-minute screencast shows how to use Ruby 1.9.2, Rails 3.1rc1, RDF.rb and my rdf-agraph gem to compare user ratings of the iPad and various Android Honeycomb tablets.


Tags , ,  | 1 comment

Heroku "Celadon Cedar" review

Posted by Eric Kidd Fri, 03 Jun 2011 19:20:00 GMT

Heroku just released a new version of their hosting service for Ruby on Rails. It’s called Celadon Cedar, and it adds support for arbitrary background processes, Node.js servers and long-polling over HTTP.

I just finished porting a large Rails 3.0 application to Heroku’s Ceder stack from Chef+EC2, and I’m deeply impressed. But there are still some rough edges, especially with regard to asset caching.

Procfiles are really cool

Previous versions of Heroku could only run two types of processes: Web servers and delayed_job workers. If you needed to monitor a ZeroMQ queue or run a cron job every minute, you were out of luck. So even though I loved Heroku, about 2/3rds of my clients couldn’t even consider using it.

Celadon Cedar, however, allows you to create a Procfile specifying a list of process types to run:

web:    bundle exec rails server -p $PORT
worker: bundle exec rake jobs:work
clock:  bundle exec clockwork config/clock.rb

Once you’ve deployed your project, you can specify how many of each process you want:

heroku scale web=3 worker=2 clock=1

Even better, if you’re running on a development machine, or if you want to deploy to a regular Linux server, you can use the Foreman gem to launch the processes manually, or to generate init scripts:

foreman start
foreman export upstart /etc/init -u username

If you’re feeling more ambitious, you can also run Unicorn and Node.js servers on Heroku.

Asset caching is even worse than before

Previous versions of Heroku had a built-in Varnish cache, which would cache CSS, JavaScripts and images for 12 hours. The Varnish cache was automatically flushed on redeploy, so it gave you a nice performance boost for zero work.

However, if you were running a high-performance site, you would generally want to run all your JavaScript and CSS through YUI Compressor, which vastly improves your download times. Under the previous version of Heroku, this was annoying to set up: You had to either commit your compiled assets into git, or deploy them to a CDN manually.

The Celadon Cedar stack, unfortunately, doesn’t make it any easier to set up YUI Compressor, and it removes the existing Varnish cache. In place of Varnish, Heroku encourages you to set up Rack::Cache with memcached as a storage backend.

You may want to consider adding the following line to your file, right before the run statement:

use Rack::Deflater

Combined with Rack::Cache, this will give you back some of the functionality of Varnish. But it’s a lot more work than you needed to do before, and the results aren’t as good. Heroku made this decision deliberately, because Varnish prevented them for doing cool things with Node.js servers and long-polled HTTP connections. But it still represents a retreat from Heroku’s famous ease of use.

What Heroku’s Cedar stack really needs is first-class support for Rack::Cache, Rack::Deflator, and the new Sprockets asset caching in Rails 3.1. Please, just allow me to add a couple of lines to my Gemfile and have everything work automagically. Yeah, you’ve spoiled me and made me lazy.

You’ll have to upgrade to Ruby 1.9.2

According to the official documentation, only Ruby 1.9.2 is supported under Celadon Cedar. This isn’t entirely surprising—Rails 3.1 recommends Ruby 1.9.2 as well—but it may be a problem for some users.

Fortunately, my client’s application worked flawlessly under Ruby 1.9.2 with only a single change to the Gemfile.

Running a cron job once per minute is really easy, but it costs $71/month

One of Heroku’s engineers explains how to run high-frequency cron jobs using Clockwork and delayed_job.

Basically, you add a couple of lines to your Procfile:

worker: bundle exec rake jobs:work
clock:  bundle exec clockwork config/clock.rb

…and you put something like the following in config/clock.rb:

require File.expand_path('../environment',  __FILE__)

# Run our heartbeat once per minute.
every(1.minutes, 'myapp.heartbeat') { MyApp.delay.heartbeat }

This creates a DelayedJob and hands it off to our worker process. According to the tutorial, you’re supposed to do the actual work in a separate process, so as not to interfere with other events. This approach is elegant, but it’s going to cost you $71/month for two “dynos”. Ouch.

Cedar is a great new stack, but it needs polishing

I’m really impressed with Celadon Cedar. Heroku has vastly improved their support for complex applications with a lot of moving parts. But along the way, they’ve made it slightly harder to deploy simple applications, and they still don’t have a painless way to do asset caching. Of course, these minor drawbacks should improve dramatically once the Ruby community plays with Cedar for a few weeks.

Many thanks to Heroku for a great new release! I’ll be moving more applications over soon.

Does anybody have any suggestions on how make better use of Cedar and Rails?

Tags ,  | no comments

Derivatives of algebraic data structures: An excellent tutorial

Posted by Eric Kidd Fri, 20 May 2011 20:01:00 GMT

Last month, the folks at Lab49 explained how to compute the derivative of a data structure. This is a great example of how to write about mathematical subjects for a casual audience: They draw analogies to well-known programming languages, they follow a single, well-chosen thread of explanation, and there’s a clever payoff at the end.

The Lab49 blog post is, of course, based on two classic papers by Conor McBride, and Huet’s original paper The Zipper.

If you’re interested in real-world applications of this technique, there’s a great explanation in the final chapter of Learn You a Haskell for Great Good. If you’re interested in some deeper mathematical connections, see the discussion at Lambda the Ultimate.

Tags ,  | 5 comments

What do these fixed points have in common?

Posted by Eric Kidd Thu, 12 May 2011 12:09:00 GMT

A question asked while standing in the shower: What do all of the following have in common?

  1. Banach and Brouwer fixed points. If you’re in Manhattan, and you crumple up a map of Manhattan and place it on the ground, at least one point on your map will be exactly over the corresponding point on the ground. (This is true even if your map is larger than life.)
  2. The fixed points computed by the Y combinator, which is used to construct anonymous recursive functions in the lambda calculus.
  3. The Nash equilibrium, which is the stable equilibrium of a multi-player game (and one of the key ideas of economics). See also this lovely—if metaphorical—rant by Scott Aaronson.
  4. The eigenvectors of a matrix, which will still point in the same direction after multiplication by the matrix.

At what level of abstraction are all these important ideas really just the same idea? If we strip everything down to generalized abstract nonsense, is there a nice simple formulation that covers all of the above?

(I can’t play with this shiny toy today; I have to work.)

Tags ,  | 2 comments

AWS outage timeline & downtimes by recovery strategy

Posted by Eric Kidd Mon, 25 Apr 2011 08:41:00 GMT

Renting a server from Amazon is no substitute for a disaster recovery plan.

If you run your own servers, you need backups. If you can’t afford to go down, you also need offsite replication. But if you lease servers in the cloud, how can you protect against problems like this week’s Amazon outage?

Keep reading for a timeline of the outage, plus a list of recovery strategies and the minimum downtime that each would have incurred.

A timeline of the Amazon outage

Here’s a timeline of what went wrong, and when it was fixed. Note, in particular, the window from roughly 1:00 AM to 1:48 PM PST when several of Amazon’s availability zones were partially unavailable. (For a glossary of Amazon Web Service terminology, see the bottom of this post.)

I’ve also included Heroku’s status reports on this timeline.

21 April 2011

1:15 AM PDT Heroku begins investigating high error rates.

1:41 AM PDT Amazon admits they are seeing problems with EBS volumes and EC2 instances in US East 1. The outage affects multiple availability zones. Amazon later described the problem as follows:

A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.

1:52 AM PDT Heroku reports that applications and tools are functioning intermittently.

3:05 AM PDT Amazon reports that RDS databases replicated across multiple Availability Zones are not failing over as expected. This is a big deal, because these multi-AZ RDS databases are intended to be an expensive, highly-reliable option for storing data.

1:48 PM PDT EBS volumes and EC2 instances are now working correctly in all but one availability zone.

2:15 PM PDT Heroku reports that they can now launch new EBS instances.

2:35 PM PDT Amazon restores access to “majority” of multi-AZ RDS databases. (There’s nothing in the Amazon timeline to indicate when all of the multi-AZ RDS databases came back online.)

3:07 PM PDT Heroku brings core services back online, and restores service to many applications.

4:15 PM PDT Heroku reports: “In some cases the process of bringing many applications online simultaneously has created intermittent availability and elevated error rates.”

8:27 PM PDT Heroku finishes restoring API services.

22 April 2011

2:19 AM PDT Heroku reports that all dedicated databases are back online.

6:25 AM PDT Heroku reports that new application creation is enabled.

1:30 PM PDT Amazon reports “majority” of EBS volumes in affected zone have been recovered. Remaining volumes will require a more time-consuming recovery process.

9:11 PM PDT Amazon reports that “control plane” congestion is limiting the speed at which they can recover the remaining volumes.

23 April 2011

11:54 AM PDT Amazon is still wrestling with control plane congestion.

Quick update. We’ve tried a couple of ideas to remove the bottleneck in opening up the APIs, each time we’ve learned more but haven’t yet solved the problem. We are making progress, but much more slowly than we’d hoped. Right now we’re setting up more control plane components that should be capable of working through the backlog of attach/detach state changes for EBS volumes. These are coming online, and we’ve been seeing progress on the backlog, but it’s still too early to tell how much this will accelerate the process for us.

8:39 PM PDT Amazon finishes re-enabling their APIs for all recovered volumes in the affected zone. Not all EBS volumes have been recovered yet, however.

We continue to see stability in the service and are confident now that that the service is operating normally for all API calls and all restored EBS volumes.

8:39 PM PDT Heroku reports that all applications are back online, though a few still cannot deploy new code via git.

24 April 2011

3:26 AM PDT Amazon re-enables RDS APIs in the affected zone, but not all databases have been recovered:

The RDS APIs for the affected Availability Zone have now been restored. We will continue monitoring the service very closely, but at this time RDS is operating normally in all Availability Zones for all APIs and restored Database Instances. Recovery is still underway for a small number of Database Instances in the affected Availability Zone.

5:21 AM PDT Heroku reports that all functionality is fully restored, including deploying new applications.

7:35 PM PDT Amazon reports that all EBS volumes are back online.

7:39 PM PDT Amazon reports that all RDS databases are back online.

Strategies for surviving a major cloud outage, and associated downtime

1. Rely on a single EBS volume with no snapshots. If you relied on single EBS volume with no shapshots, there’s a chance that your site would have been offline for over 3.5 days after the initial outage. There’s also at least a 0.1% to 0.5% annual chance of losing your EBS volume entirely. This is not a recommended approach.

2. Deploy into a single availability zone, with EBS snapshots. In this scenario, if an availability zone goes down, you can theoretically restore from backup into another availability zone. During this recent outage, your site might have remained offline for over 12 hours, and you might have lost any changes since your last backup (unless you reintegrated them manually). Given Amazon’s record during 2009 and 2010, this could still give you 99.95% uptime if no other EBS volume failures occurred. Despite the recent events, this may still be a viable strategy for many smaller, lower-revenue sites.

3. Rely on multi-AZ RDS databases to fail over to another availability zone. This approach should have lower downtime than relying on EBS snapshots, but in this case, the multi-AZ RDS failover mechanisms took longer than 14 hours for some users.

4. Run in 3 AZs, at no more than 60% capacity in each. This is the approach taken by Netflix, which sailed through this outage without no known downtime. If a single AZ fails, then the remaining two zones will be at 90% capacity. And because the extra capacity is running at all times, Netflix doesn’t need to launch new instances in the middle of a “bank run” (see below).

5. Replicate data to another AWS region or cloud provider. This is still the gold standard for sites which require high uptime guarantees. Unfortunately, it requires transmitting large amounts of data over the public internet, which is both expensive and slow. In this case, downtime is function of external systems and how quickly they can fail over to the replicated database.

There are some other approaches, such as writing backups and transaction logs to S3, where they are likely to remain available even in the case of severe outages.

Lessons learned

For some excellent post-mortems, see:

Here are some of the most important points:

1. The biggest danger in a well-engineered cloud system is a “run on the bank”, where initial failures trigger error-recovery code, which in turn may drive the load far beyond normal limits. According to Amazon, an initial network problem triggered an EBS re-mirroring, which in turn overloaded their management plane. This, in turn, triggered emergency recovery scripts written by AWS customers, forcing the total load even higher. To stabilize the situation, Amazon was forced to disable API access to multiple zones. Just as in 1933, the easiest solution to a bank run is a bank holiday.

2. Availability Zone failures are correlated. Even though Amazon claims that multiple availability zones should not fail at the same time, it’s clear that all the availability zones within a region share a management plane. This means that a large enough failure can overload the shared management plane.

3. EBS remains the weakest link. Recent months have seen widespread complaints about EBS, and Netflix has published an article on working around those limitations.

4. Few cloud providers publish their disaster recovery plans, making it hard to estimate downtime. If you were a Heroku customer last week, you had no way to evaluate how Heroku would respond to a major outage, or their plans for keeping your site on the air. As it turns out, they had widespread dependencies on EBS, and no plan for getting Heroku-based sites back on the air if an availability zone failed.

5. Test your disaster recovery plan. If you haven’t tested your disaster recovery plan, then you have no idea how long it will take you to get back on the air.



The state of Ruby, RDF and Rails 3

Posted by Eric Kidd Mon, 20 Dec 2010 19:56:00 GMT

Recently, I was investigating the state of RDF in the Ruby world. Here are some notes, in case anybody is curious. I have used only a few of these Ruby RDF libraries, so please feel free to add your own comments with corrections and other alternatives.

There’s also some stuff about ActiveModel and ActiveRelation down at the end, for people who are interested in Rails 3.


Tags , ,  | 6 comments

Feedhose demo: Real-time RSS using Node.js and

Posted by Eric Kidd Wed, 13 Oct 2010 12:09:00 GMT

Yesterday evening, I released an experimental Node.js/ application:

Just leave your web browser open, and watch the New York Times headlines scroll by. Dave Winer is sending me some traffic this morning, so I’m going to find out how well this stack scales.

I’ve tested it in IE 6, IE 7, Firefox 3.5 and a ridiculously new version of Chrome, and it runs without any major problems. Please let me know if you encounter any problems in other browsers!

During the day, I’ll update this post with technical details: How it works, how much it costs to run, and some tricks I’m using to keep the system alive.


Visualizing WordNet relationships as graphs

Posted by Eric Kidd Tue, 29 Dec 2009 20:38:00 GMT

The WordNet database contains all sorts of interesting relationships between words: it can categorize words into hierarchies, find the parts of an object, and answer many other interesting questions.

The code below relies on the NLTK and NetworkX libraries for Python.

Categorizing words

What, exactly, is a dog? It’s a domestic animal and a carnivore, not to mention a physical entity (as opposed to an abstract entity, such as an idea). WordNet knows all these facts:

How do we generate this image? First, we look up the first entry for “dog” in WordNet. This returns a “synset”, or a set of words with equivalent meanings.

dog = wn.synset('dog.n.01')

Next, we compute the transitive closure of the hypernym relationship, or (in English) we look for all the categories to which “dog” belongs, and all the categories to which those categories belong, recursively:

graph = closure_graph(dog,
                      lambda s: s.hypernyms())

After that, we just pass the resulting graph to NetworkX for display:


The implementation

The closure_graph function repeatedly calls fn on the supplied symset, and uses the result to build a NetworkX graph. This code goes at the top of the file, so you can use wn and nx in your own code.

from nltk.corpus import wordnet as wn
import networkx as nx

def closure_graph(synset, fn):
    seen = set()
    graph = nx.DiGraph()

    def recurse(s):
        if not s in seen:
            for s1 in fn(s):

    return graph

By using a high-quality graph library, we make it much easier to merge, analyze and display our graphs.

More graphs

Parts of the finger, generated with synset('finger.n.01') and part_meronyms:

Types of running, generated with synset('run.v.01') and hyponyms:

Tags ,

Experimenting with NLTK

Posted by Eric Kidd Mon, 28 Dec 2009 21:31:00 GMT

The Natural Language Toolkit for Python is a great framework for simple, non-probabilistic natural language processing. Here are some example snippets (and some trouble-shooting notes).


We can search for “dog” in Chesterton’s The Man Who Was Thursday:

>>> from import *
>>> text9.concordance("dog", width=40)
Displaying 4 of 4 matches:
ead of a cat or a dog , it could not ha
d you ever hear a dog bark like that ?"
aid , " is that a dog -- anybody ' s do
og -- anybody ' s dog ?" There broke up

Synonyms and categories

We can use WordNet to look up synonyms:

from nltk.corpus import wordnet

dog = wordnet.synset('dog.n.01')
print dog.lemma_names

This prints:

['dog', 'domestic_dog', 'Canis_familiaris']

We can also look up the “hypernyms”, or larger categories that include the word “dog”:

paths = dog.hypernym_paths()

def simple_path(path):
    return [s.lemmas[0].name for s in path]

for path in paths:
    print simple_path(path)

This prints:

['entity', 'physical_entity', 'object',
 'whole', 'living_thing', 'organism',
 'animal', 'domestic_animal', 'dog']
['entity', 'physical_entity', 'object',
 'whole', 'living_thing', 'organism',
 'animal', 'chordate', 'vertebrate',
 'mammal', 'placental', 'carnivore',
 'canine', 'dog']

For more neat examples, take a look at the NLTK book.

Installation notes

While setting up NLTK, I bumped into a few problems.

Problem: The dispersion_plot function returns immediately without displaying anything.

Fix: Configure your matplotlib back-end correctly.

Problem: The GUI fails with the error:

out of stack space (infinite loop?)

Fix: Recompile Tcl with threads. On the Mac:

sudo port install tcl +threads

Tags ,

Older posts: 1 2 3 ... 12