Posted by Eric Kidd
Wed, 18 Jan 2012 19:29:00 GMT
Wikipedia, Google and many other internet sites are protesting PIPA and SOPA today. But their official explanations don’t include very many details about the actual legislation.
If you’d like to learn more, check out this excellent background piece by a freelance film editor.
9 comments
Posted by Eric Kidd
Sun, 05 Jun 2011 19:07:00 GMT
In the past few years, many companies have been embedding machine-readable metadata in their web pages. Among these is Best Buy, which provides extensive RDFa data describing their products, prices and user reviews.
The following 20-minute screencast shows how to use Ruby 1.9.2, Rails 3.1rc1, RDF.rb and my rdf-agraph gem to compare user ratings of the iPad and various Android Honeycomb tablets.
Read more...
Tags RDF, Rails, Ruby | 1 comment
Posted by Eric Kidd
Fri, 03 Jun 2011 19:20:00 GMT
Heroku just released a new version of their hosting service for Ruby on Rails. It’s called Celadon Cedar, and it adds support for arbitrary background processes, Node.js servers and long-polling over HTTP.
I just finished porting a large Rails 3.0 application to Heroku’s Ceder stack from Chef+EC2, and I’m deeply impressed. But there are still some rough edges, especially with regard to asset caching.
Procfiles are really cool
Previous versions of Heroku could only run two types of processes: Web servers and delayed_job workers. If you needed to monitor a ZeroMQ queue or run a cron job every minute, you were out of luck. So even though I loved Heroku, about 2/3rds of my clients couldn’t even consider using it.
Celadon Cedar, however, allows you to create a Procfile specifying a list of process types to run:
web: bundle exec rails server -p $PORT
worker: bundle exec rake jobs:work
clock: bundle exec clockwork config/clock.rb
Once you’ve deployed your project, you can specify how many of each process you want:
heroku scale web=3 worker=2 clock=1
Even better, if you’re running on a development machine, or if you want to deploy to a regular Linux server, you can use the Foreman gem to launch the processes manually, or to generate init scripts:
foreman start
foreman export upstart /etc/init -u username
If you’re feeling more ambitious, you can also run Unicorn and Node.js servers on Heroku.
Asset caching is even worse than before
Previous versions of Heroku had a built-in Varnish cache, which would cache CSS, JavaScripts and images for 12 hours. The Varnish cache was automatically flushed on redeploy, so it gave you a nice performance boost for zero work.
However, if you were running a high-performance site, you would generally want to run all your JavaScript and CSS through YUI Compressor, which vastly improves your download times. Under the previous version of Heroku, this was annoying to set up: You had to either commit your compiled assets into git, or deploy them to a CDN manually.
The Celadon Cedar stack, unfortunately, doesn’t make it any easier to set up YUI Compressor, and it removes the existing Varnish cache. In place of Varnish, Heroku encourages you to set up Rack::Cache with memcached as a storage backend.
You may want to consider adding the following line to your config.ru file, right before the run statement:
Combined with Rack::Cache, this will give you back some of the functionality of Varnish. But it’s a lot more work than you needed to do before, and the results aren’t as good. Heroku made this decision deliberately, because Varnish prevented them for doing cool things with Node.js servers and long-polled HTTP connections. But it still represents a retreat from Heroku’s famous ease of use.
What Heroku’s Cedar stack really needs is first-class support for Rack::Cache, Rack::Deflator, and the new Sprockets asset caching in Rails 3.1. Please, just allow me to add a couple of lines to my Gemfile and have everything work automagically. Yeah, you’ve spoiled me and made me lazy.
You’ll have to upgrade to Ruby 1.9.2
According to the official documentation, only Ruby 1.9.2 is supported under Celadon Cedar. This isn’t entirely surprising—Rails 3.1 recommends Ruby 1.9.2 as well—but it may be a problem for some users.
Fortunately, my client’s application worked flawlessly under Ruby 1.9.2 with only a single change to the Gemfile.
Running a cron job once per minute is really easy, but it costs $71/month
One of Heroku’s engineers explains how to run high-frequency cron jobs using Clockwork and delayed_job.
Basically, you add a couple of lines to your Procfile:
worker: bundle exec rake jobs:work
clock: bundle exec clockwork config/clock.rb
…and you put something like the following in config/clock.rb:
require File.expand_path('../environment', __FILE__)
every(1.minutes, 'myapp.heartbeat') { MyApp.delay.heartbeat }
This creates a DelayedJob and hands it off to our worker process. According to the tutorial, you’re supposed to do the actual work in a separate process, so as not to interfere with other events. This approach is elegant, but it’s going to cost you $71/month for two “dynos”. Ouch.
Cedar is a great new stack, but it needs polishing
I’m really impressed with Celadon Cedar. Heroku has vastly improved their support for complex applications with a lot of moving parts. But along the way, they’ve made it slightly harder to deploy simple applications, and they still don’t have a painless way to do asset caching. Of course, these minor drawbacks should improve dramatically once the Ruby community plays with Cedar for a few weeks.
Many thanks to Heroku for a great new release! I’ll be moving more applications over soon.
Does anybody have any suggestions on how make better use of Cedar and Rails?
Tags Rails, Ruby | no comments
Posted by Eric Kidd
Fri, 20 May 2011 20:01:00 GMT
Last month, the folks at Lab49 explained how to compute the derivative of a data structure. This is a great example of how to write about mathematical subjects for a casual audience: They draw analogies to well-known programming languages, they follow a single, well-chosen thread of explanation, and there’s a clever payoff at the end.
The Lab49 blog post is, of course, based on two classic papers by Conor McBride, and Huet’s original paper The Zipper.
If you’re interested in real-world applications of this technique, there’s a great explanation in the final chapter of Learn You a Haskell for Great Good. If you’re interested in some deeper mathematical connections, see the discussion at Lambda the Ultimate.
Tags Haskell, Math | 5 comments
Posted by Eric Kidd
Thu, 12 May 2011 12:09:00 GMT
A question asked while standing in the shower: What do all of the following have in common?
- Banach and Brouwer fixed points. If you’re in Manhattan, and you crumple up a map of Manhattan and place it on the ground, at least one point on your map will be exactly over the corresponding point on the ground. (This is true even if your map is larger than life.)
- The fixed points computed by the Y combinator, which is used to construct anonymous recursive functions in the lambda calculus.
- The Nash equilibrium, which is the stable equilibrium of a multi-player game (and one of the key ideas of economics). See also this lovely—if metaphorical—rant by Scott Aaronson.
- The eigenvectors of a matrix, which will still point in the same direction after multiplication by the matrix.
At what level of abstraction are all these important ideas really just the same idea? If we strip everything down to generalized abstract nonsense, is there a nice simple formulation that covers all of the above?
(I can’t play with this shiny toy today; I have to work.)
Tags Haskell, Math | 2 comments
Posted by Eric Kidd
Mon, 25 Apr 2011 08:41:00 GMT
Renting a server from Amazon is no substitute for a disaster recovery plan.
If you run your own servers, you need backups. If you can’t afford to go
down, you also need offsite replication. But if you lease servers in the
cloud, how can you protect against problems like this week’s Amazon outage?
Keep reading for a timeline of the outage, plus a list of recovery
strategies and the minimum downtime that each would have incurred.
A timeline of the Amazon outage
Here’s a timeline of what went wrong, and when it was fixed. Note, in
particular, the window from roughly 1:00 AM to 1:48 PM PST when several of
Amazon’s availability zones were partially unavailable. (For a
glossary of Amazon Web Service terminology, see the bottom of this post.)
I’ve also included Heroku’s status reports on this timeline.
21 April 2011
1:15 AM PDT Heroku begins investigating high error rates.
1:41 AM PDT Amazon admits they are seeing problems with EBS volumes and
EC2 instances in US East 1. The outage affects multiple availability
zones. Amazon later described the problem as follows:
A networking event early this morning triggered a large amount of
re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a
shortage of capacity in one of the US-EAST-1 Availability Zones, which
impacted new EBS volume creation as well as the pace with which we could
re-mirror and recover affected EBS volumes. Additionally, one of our
internal control planes for EBS has become inundated such that it’s
difficult to create new EBS volumes and EBS backed instances. We are
working as quickly as possible to add capacity to that one Availability
Zone to speed up the re-mirroring, and working to restore the control plane
issue. We’re starting to see progress on these efforts, but are not there
yet. We will continue to provide updates when we have them.
1:52 AM PDT Heroku reports that applications and tools are functioning
intermittently.
3:05 AM PDT Amazon reports that RDS databases replicated across
multiple Availability Zones are not failing over as expected. This is a
big deal, because these multi-AZ RDS databases are intended to be an
expensive, highly-reliable option for storing data.
1:48 PM PDT EBS volumes and EC2 instances are now working correctly in
all but one availability zone.
2:15 PM PDT Heroku reports that they can now launch new EBS instances.
2:35 PM PDT Amazon restores access to “majority” of multi-AZ RDS
databases. (There’s nothing in the Amazon timeline to indicate when all
of the multi-AZ RDS databases came back online.)
3:07 PM PDT Heroku brings core services back online, and restores
service to many applications.
4:15 PM PDT Heroku reports: “In some cases the process of bringing many
applications online simultaneously has created intermittent availability
and elevated error rates.”
8:27 PM PDT Heroku finishes restoring API services.
22 April 2011
2:19 AM PDT Heroku reports that all dedicated databases are back
online.
6:25 AM PDT Heroku reports that new application creation is enabled.
1:30 PM PDT Amazon reports “majority” of EBS volumes in affected zone
have been recovered. Remaining volumes will require a more time-consuming
recovery process.
9:11 PM PDT Amazon reports that “control plane” congestion is limiting
the speed at which they can recover the remaining volumes.
23 April 2011
11:54 AM PDT Amazon is still wrestling with control plane congestion.
Quick update. We’ve tried a couple of ideas to remove the bottleneck in
opening up the APIs, each time we’ve learned more but haven’t yet solved
the problem. We are making progress, but much more slowly than we’d
hoped. Right now we’re setting up more control plane components that should
be capable of working through the backlog of attach/detach state changes
for EBS volumes. These are coming online, and we’ve been seeing progress on
the backlog, but it’s still too early to tell how much this will accelerate
the process for us.
8:39 PM PDT Amazon finishes re-enabling their APIs for all recovered
volumes in the affected zone. Not all EBS volumes have been recovered yet,
however.
We continue to see stability in the service and are confident now that that
the service is operating normally for all API calls and all restored EBS
volumes.
8:39 PM PDT Heroku reports that all applications are back online,
though a few still cannot deploy new code via git.
24 April 2011
3:26 AM PDT Amazon re-enables RDS APIs in the affected zone, but not
all databases have been recovered:
The RDS APIs for the affected Availability Zone have now been restored. We
will continue monitoring the service very closely, but at this time RDS is
operating normally in all Availability Zones for all APIs and restored
Database Instances. Recovery is still underway for a small number of
Database Instances in the affected Availability Zone.
5:21 AM PDT Heroku reports that all functionality is fully restored,
including deploying new applications.
7:35 PM PDT Amazon reports that all EBS volumes are back online.
7:39 PM PDT Amazon reports that all RDS databases are back online.
Strategies for surviving a major cloud outage, and associated downtime
1. Rely on a single EBS volume with no snapshots. If you relied on
single EBS volume with no shapshots, there’s a chance that your site
would have been offline for over 3.5 days after the initial outage.
There’s also at least a 0.1% to 0.5% annual chance of losing your EBS
volume entirely. This is not a recommended approach.
2. Deploy into a single availability zone, with EBS snapshots. In this
scenario, if an availability zone goes down, you can theoretically
restore from backup into another availability zone. During this recent
outage, your site might have remained offline for over 12 hours, and you
might have lost any changes since your last backup (unless you
reintegrated them manually). Given Amazon’s record during 2009
and 2010, this could still give you 99.95% uptime if no other EBS volume
failures occurred. Despite the recent events, this may still be a viable
strategy for many smaller, lower-revenue sites.
3. Rely on multi-AZ RDS databases to fail over to another availability zone. This approach should have lower downtime than
relying on EBS snapshots, but in this case, the multi-AZ RDS failover
mechanisms took longer than 14 hours for some users.
4. Run in 3 AZs, at no more than 60% capacity in each. This is the
approach taken by Netflix, which sailed through this
outage without no known downtime. If a single AZ fails, then the
remaining two zones will be at 90% capacity. And because the extra
capacity is running at all times, Netflix doesn’t need to launch new
instances in the middle of a “bank run” (see below).
5. Replicate data to another AWS region or cloud provider. This is still
the gold standard for sites which require high uptime guarantees.
Unfortunately, it requires transmitting large amounts of data over the
public internet, which is both expensive and slow. In this case,
downtime is function of external systems and how quickly they can fail
over to the replicated database.
There are some other approaches, such as writing backups and transaction
logs to S3, where they are likely to remain available even in the case of
severe outages.
Lessons learned
For some excellent post-mortems, see:
Here are some of the most important points:
1. The biggest danger in a well-engineered cloud system is a “run on the bank”, where initial failures trigger error-recovery code, which in turn may drive the load far beyond normal limits. According to Amazon, an initial network problem triggered an
EBS re-mirroring, which in turn overloaded their management plane. This,
in turn, triggered emergency recovery scripts written by AWS customers,
forcing the total load even higher. To stabilize the situation, Amazon
was forced to disable API access to multiple zones. Just as in 1933, the
easiest solution to a bank run is a bank holiday.
2. Availability Zone failures are correlated. Even though Amazon claims
that multiple availability zones should not fail at the same time, it’s
clear that all the availability zones within a region share a management
plane. This means that a large enough failure can overload the shared
management plane.
3. EBS remains the weakest link. Recent months have seen widespread
complaints about EBS, and Netflix has published an article
on working around those limitations.
4. Few cloud providers publish their disaster recovery plans, making it hard to estimate downtime. If you were a Heroku customer last week,
you had no way to evaluate how Heroku would respond to a major outage, or
their plans for keeping your site on the air. As it turns out, they had
widespread dependencies on EBS, and no plan for getting Heroku-based
sites back on the air if an availability zone failed.
5. Test your disaster recovery plan. If you haven’t tested your
disaster recovery plan, then you have no idea how long it will take you
to get back on the air.
Read more...
3 comments
Posted by Eric Kidd
Mon, 20 Dec 2010 19:56:00 GMT
Recently, I was investigating the state of RDF in the Ruby world. Here are
some notes, in case anybody is curious. I have used only a few of
these Ruby RDF libraries, so please feel free to add your own comments with
corrections and other alternatives.
There’s also some stuff about ActiveModel and ActiveRelation down at the end, for people who are interested in Rails 3.
Read more...
Tags RDF, Rails, Ruby | 6 comments
Posted by Eric Kidd
Wed, 13 Oct 2010 12:09:00 GMT
Yesterday evening, I released an experimental Node.js/Socket.io application:
feedhose.randomhacks.net
Just leave your web browser open, and watch the New York Times headlines scroll by. Dave Winer is sending me some traffic this morning, so I’m going to find out how well this stack scales.
I’ve tested it in IE 6, IE 7, Firefox 3.5 and a ridiculously new version of Chrome, and it runs without any major problems. Please let me know if you encounter any problems in other browsers!
During the day, I’ll update this post with technical details: How it works, how much it costs to run, and some tricks I’m using to keep the system alive.
2 comments
Posted by Eric Kidd
Tue, 29 Dec 2009 20:38:00 GMT
The WordNet database contains all sorts of interesting relationships between words: it can categorize words into hierarchies, find the parts of an object, and answer many other interesting questions.
The code below relies on the NLTK and NetworkX libraries for Python.
Categorizing words
What, exactly, is a dog? It’s a domestic animal and a carnivore, not to mention a physical entity (as opposed to an abstract entity, such as an idea). WordNet knows all these facts:

How do we generate this image? First, we look up the first entry for “dog” in WordNet. This returns a “synset”, or a set of words with equivalent meanings.
dog = wn.synset('dog.n.01')
Next, we compute the transitive closure of the hypernym relationship, or (in English) we look for all the categories to which “dog” belongs, and all the categories to which those categories belong, recursively:
graph = closure_graph(dog,
lambda s: s.hypernyms())
After that, we just pass the resulting graph to NetworkX for display:
The implementation
The closure_graph function repeatedly calls fn on the supplied symset, and uses the result to build a NetworkX graph. This code goes at the top of the file, so you can use wn and nx in your own code.
from nltk.corpus import wordnet as wn
import networkx as nx
def closure_graph(synset, fn):
seen = set()
graph = nx.DiGraph()
def recurse(s):
if not s in seen:
seen.add(s)
graph.add_node(s.name)
for s1 in fn(s):
graph.add_node(s1.name)
graph.add_edge(s.name, s1.name)
recurse(s1)
recurse(synset)
return graph
By using a high-quality graph library, we make it much easier to merge, analyze and display our graphs.
More graphs
Parts of the finger, generated with synset('finger.n.01') and part_meronyms:

Types of running, generated with synset('run.v.01') and hyponyms:

Tags NLP, Python
Posted by Eric Kidd
Mon, 28 Dec 2009 21:31:00 GMT
The Natural Language Toolkit for Python is a great framework for simple, non-probabilistic natural language processing. Here are some example snippets (and some trouble-shooting notes).
Concordances
We can search for “dog” in Chesterton’s The Man Who Was Thursday:
>>> from nltk.book import *
>>> text9.concordance("dog", width=40)
Displaying 4 of 4 matches:
ead of a cat or a dog , it could not ha
d you ever hear a dog bark like that ?"
aid , " is that a dog -- anybody ' s do
og -- anybody ' s dog ?" There broke up
Synonyms and categories
We can use WordNet to look up synonyms:
from nltk.corpus import wordnet
dog = wordnet.synset('dog.n.01')
print dog.lemma_names
This prints:
['dog', 'domestic_dog', 'Canis_familiaris']
We can also look up the “hypernyms”, or larger categories that include the word “dog”:
paths = dog.hypernym_paths()
def simple_path(path):
return [s.lemmas[0].name for s in path]
for path in paths:
print simple_path(path)
This prints:
['entity', 'physical_entity', 'object',
'whole', 'living_thing', 'organism',
'animal', 'domestic_animal', 'dog']
['entity', 'physical_entity', 'object',
'whole', 'living_thing', 'organism',
'animal', 'chordate', 'vertebrate',
'mammal', 'placental', 'carnivore',
'canine', 'dog']
For more neat examples, take a look at the NLTK book.
Installation notes
While setting up NLTK, I bumped into a few problems.
Problem: The dispersion_plot function returns immediately without displaying anything.
Fix: Configure your matplotlib back-end correctly.
Problem: The nltk.app.concordance() GUI fails with the error:
out of stack space (infinite loop?)
Fix: Recompile Tcl with threads. On the Mac:
sudo port install tcl +threads
Tags NLP, Python