<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Random Hacks: AWS outage timeline &amp; downtimes by recovery strategy</title>
    <link>http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Technology and Other Fun Stuff</description>
    <item>
      <title>"AWS outage timeline &amp; downtimes by recovery strategy" by Eric</title>
      <description>&lt;p&gt;The links should be fixed. Thanks, linkchecker!&lt;/p&gt;</description>
      <pubDate>Mon, 25 Apr 2011 19:44:49 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:a551bba6-cd2b-40bf-8e8b-9f4703b22ccf</guid>
      <link>http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes#comment-818</link>
    </item>
    <item>
      <title>"AWS outage timeline &amp; downtimes by recovery strategy" by linkchecker</title>
      <description>&lt;p&gt;Under &amp;#8220;Lessons Learned&amp;#8221;, #3, both links point to the reddit blog. Presumably one should point to Netflix instead.&lt;/p&gt;</description>
      <pubDate>Mon, 25 Apr 2011 16:56:45 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:b3348deb-8a96-4c8c-8210-c938d0c59070</guid>
      <link>http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes#comment-817</link>
    </item>
    <item>
      <title>"AWS outage timeline &amp; downtimes by recovery strategy" by Noah</title>
      <description>&lt;p&gt;I&amp;#8217;ve been following this story pretty closely. We&amp;#8217;ve used Amazon in the past, and it is facinating to me that the &amp;#8216;whole&amp;#8217; internet could go down as we consolidate on a few cloud providers.&lt;/p&gt;


	&lt;p&gt;Compare this to even a few years ago and one web host didn&amp;#8217;t host more than 1 or 2 &amp;#8216;big&amp;#8217; sites back then. Now Reddit et al go down in one big fireworks display.&lt;/p&gt;</description>
      <pubDate>Mon, 25 Apr 2011 16:33:42 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:d11b5bdc-9573-44b7-9296-7f7c076ea7f5</guid>
      <link>http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes#comment-816</link>
    </item>
    <item>
      <title>AWS outage timeline &amp;amp; downtimes by recovery strategy</title>
      <description>&lt;p&gt;Renting a server from Amazon is no substitute for a disaster recovery plan.&lt;/p&gt;

&lt;p&gt;If you run your own servers, you need backups.  If you can&amp;#8217;t afford to go
down, you also need offsite replication. But if you lease servers in the
cloud, how can you protect against problems like this week&amp;#8217;s Amazon outage?&lt;/p&gt;

&lt;p&gt;Keep reading for a timeline of the outage, plus a list of recovery
strategies and the minimum downtime that each would have incurred.&lt;/p&gt;

&lt;h3&gt;A timeline of the Amazon outage&lt;/h3&gt;

&lt;p&gt;Here&amp;#8217;s a timeline of what went wrong, and when it was fixed. Note, in
particular, the window from roughly 1:00 AM to 1:48 PM PST when several of
Amazon&amp;#8217;s availability zones were partially unavailable. (For a
glossary of Amazon Web Service terminology, see the bottom of this post.)&lt;/p&gt;

&lt;p&gt;I&amp;#8217;ve also included Heroku&amp;#8217;s status reports on this timeline.&lt;/p&gt;

&lt;div style="font-weight: bold; text-align: center"&gt;21 April 2011&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;1:15 AM PDT&lt;/strong&gt; Heroku begins investigating high error rates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1:41 AM PDT&lt;/strong&gt; Amazon admits they are seeing problems with EBS volumes and
EC2 instances in US East 1.  The outage affects multiple availability
zones.  Amazon later described the problem as follows:&lt;/p&gt;

&lt;blockquote&gt;
A networking event early this morning triggered a large amount of
re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a
shortage of capacity in one of the US-EAST-1 Availability Zones, which
impacted new EBS volume creation as well as the pace with which we could
re-mirror and recover affected EBS volumes. Additionally, one of our
internal control planes for EBS has become inundated such that it&amp;#8217;s
difficult to create new EBS volumes and EBS backed instances. We are
working as quickly as possible to add capacity to that one Availability
Zone to speed up the re-mirroring, and working to restore the control plane
issue. We&amp;#8217;re starting to see progress on these efforts, but are not there
yet. We will continue to provide updates when we have them.
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;1:52 AM PDT&lt;/strong&gt; Heroku reports that applications and tools are functioning
intermittently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3:05 AM PDT&lt;/strong&gt; Amazon reports that RDS databases replicated across
multiple Availability Zones are not failing over as expected.  This is a
big deal, because these multi-AZ RDS databases are intended to be an
expensive, highly-reliable option for storing data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1:48 PM PDT&lt;/strong&gt; EBS volumes and EC2 instances are now working correctly in
all but one availability zone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2:15 PM PDT&lt;/strong&gt; Heroku reports that they can now launch new EBS instances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2:35 PM PDT&lt;/strong&gt; Amazon restores access to &amp;#8220;majority&amp;#8221; of multi-AZ RDS
databases.  (There&amp;#8217;s nothing in the Amazon timeline to indicate when &lt;em&gt;all&lt;/em&gt;
of the multi-AZ RDS databases came back online.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3:07 PM PDT&lt;/strong&gt; Heroku brings core services back online, and restores
service to many applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4:15 PM PDT&lt;/strong&gt; Heroku reports: &amp;#8220;In some cases the process of bringing many
applications online simultaneously has created intermittent availability
and elevated error rates.&amp;#8221;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8:27 PM PDT&lt;/strong&gt; Heroku finishes restoring API services.&lt;/p&gt;

&lt;div style="font-weight: bold; text-align: center"&gt;22 April 2011&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;2:19 AM PDT&lt;/strong&gt; Heroku reports that all dedicated databases are back
online.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6:25 AM PDT&lt;/strong&gt; Heroku reports that new application creation is enabled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1:30 PM PDT&lt;/strong&gt; Amazon reports &amp;#8220;majority&amp;#8221; of EBS volumes in affected zone
have been recovered.  Remaining volumes will require a more time-consuming
recovery process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;9:11 PM PDT&lt;/strong&gt; Amazon reports that &amp;#8220;control plane&amp;#8221; congestion is limiting
the speed at which they can recover the remaining volumes.&lt;/p&gt;

&lt;div style="font-weight: bold; text-align: center"&gt;23 April 2011&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;11:54 AM PDT&lt;/strong&gt; Amazon is still wrestling with control plane congestion.&lt;/p&gt;

&lt;blockquote&gt;
Quick update. We&amp;#8217;ve tried a couple of ideas to remove the bottleneck in
opening up the APIs, each time we&amp;#8217;ve learned more but haven&amp;#8217;t yet solved
the problem.  We are making progress, but much more slowly than we&amp;#8217;d
hoped. Right now we&amp;#8217;re setting up more control plane components that should
be capable of working through the backlog of attach/detach state changes
for EBS volumes. These are coming online, and we&amp;#8217;ve been seeing progress on
the backlog, but it&amp;#8217;s still too early to tell how much this will accelerate
the process for us.
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;8:39 PM PDT&lt;/strong&gt; Amazon finishes re-enabling their APIs for all recovered
volumes in the affected zone.  Not all EBS volumes have been recovered yet,
however.&lt;/p&gt;

&lt;blockquote&gt;
We continue to see stability in the service and are confident now that that
the service is operating normally for all API calls and all restored EBS
volumes.
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;8:39 PM PDT&lt;/strong&gt; Heroku reports that all applications are back online,
though a few still cannot deploy new code via git.&lt;/p&gt;

&lt;div style="font-weight: bold; text-align: center"&gt;24 April 2011&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;3:26 AM PDT&lt;/strong&gt; Amazon re-enables RDS APIs in the affected zone, but not
all databases have been recovered:&lt;/p&gt;

&lt;blockquote&gt;
The RDS APIs for the affected Availability Zone have now been restored. We
will continue monitoring the service very closely, but at this time RDS is
operating normally in all Availability Zones for all APIs and restored
Database Instances. Recovery is still underway for a small number of
Database Instances in the affected Availability Zone.
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;5:21 AM PDT&lt;/strong&gt; Heroku reports that all functionality is fully restored,
including deploying new applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7:35 PM PDT&lt;/strong&gt; Amazon reports that all EBS volumes are back online.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7:39 PM PDT&lt;/strong&gt; Amazon reports that all RDS databases are back online.&lt;/p&gt;

&lt;h3&gt;Strategies for surviving a major cloud outage, and associated downtime&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Rely on a single EBS volume with no snapshots.&lt;/strong&gt; If you relied on
  single EBS volume with no shapshots, there&amp;#8217;s a chance that your site
  would have been offline for &lt;strong&gt;over 3.5 days&lt;/strong&gt; after the initial outage.
  There&amp;#8217;s also at least a 0.1% to 0.5% annual chance of losing your EBS
  volume entirely.  This is &lt;em&gt;not&lt;/em&gt; a recommended approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Deploy into a single availability zone, with EBS snapshots.&lt;/strong&gt; In this
  scenario, if an availability zone goes down, you can theoretically
  restore from backup into another availability zone.  During this recent
  outage, your site might have remained offline for over &lt;strong&gt;12 hours&lt;/strong&gt;, and you
  might have lost any changes since your last backup (unless you
  reintegrated them manually).  Given Amazon&amp;#8217;s record during 2009
  and 2010, this could still give you 99.95% uptime if no other EBS volume
  failures occurred.  Despite the recent events, this may still be a viable
  strategy for many smaller, lower-revenue sites.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Rely on multi-AZ RDS databases to fail over to another availability zone.&lt;/strong&gt; This approach &lt;em&gt;should&lt;/em&gt; have lower downtime than
  relying on EBS snapshots, but in this case, the multi-AZ RDS failover
  mechanisms took &lt;strong&gt;longer than 14 hours&lt;/strong&gt; for some users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Run in 3 AZs, at no more than 60% capacity in each.&lt;/strong&gt; This is the
  approach taken by &lt;a href="https://twitter.com/#!/adrianco/status/61076362680745984"&gt;Netflix&lt;/a&gt;, which sailed through this
  outage without &lt;strong&gt;no known downtime&lt;/strong&gt;.  If a single AZ fails, then the
  remaining two zones will be at 90% capacity.  And because the extra
  capacity is running at all times, Netflix doesn&amp;#8217;t need to launch new
  instances in the middle of a &amp;#8220;bank run&amp;#8221; (see below).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Replicate data to another AWS region or cloud provider.&lt;/strong&gt; This is still
  the gold standard for sites which require high uptime guarantees.
  Unfortunately, it requires transmitting large amounts of data over the
  public internet, which is both expensive and slow.  In this case,
  downtime is function of external systems and how quickly they can fail
  over to the replicated database.&lt;/p&gt;

&lt;p&gt;There are some other approaches, such as writing backups and transaction
logs to S3, where they are likely to remain available even in the case of
severe outages.&lt;/p&gt;

&lt;h3&gt;Lessons learned&lt;/h3&gt;

&lt;p&gt;For some excellent post-mortems, see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://agilesysadmin.net/ec2-outage-lessons"&gt;Today’s EC2 / EBS Outage: Lessons learned&lt;/a&gt;. A good overall analysis, with recommendations.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://joyeur.com/2011/04/22/on-cascading-failures-and-amazons-elastic-block-store/"&gt;On Cascading Failures and Amazon’s Elastic Block Store&lt;/a&gt;. How emergency fail-over code can actually make an outage worse.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Update:&lt;/em&gt; &lt;a href="http://blog.rightscale.com/2011/04/25/amazon-ec2-outage-summary-and-lessons-learned/"&gt;Amazon EC2 outage: summary and lessons learned&lt;/a&gt;. RightScale has posted an excellent post-mortem. They note that the outage actually spread to more EBS volumes over time, and link to a long list of related posts. (They also claim that the other AZs were functioning again after 4 hours, which doesn&amp;#8217;t match either Amazon&amp;#8217;s public claims or the experiences of people I&amp;#8217;ve spoken to.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here are some of the most important points:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The biggest danger in a well-engineered cloud system is a &amp;ldquo;&lt;a href="http://joyeur.com/2011/04/22/on-cascading-failures-and-amazons-elastic-block-store/"&gt;run on the bank&lt;/a&gt;&amp;#8221;, where initial failures trigger error-recovery code, which in turn may drive the load far beyond normal limits.&lt;/strong&gt; According to Amazon, an initial network problem triggered an
  EBS re-mirroring, which in turn overloaded their management plane.  This,
  in turn, triggered emergency recovery scripts written by AWS customers,
  forcing the total load even higher.  To stabilize the situation, Amazon
  was forced to disable API access to multiple zones.  Just as in 1933, the
  easiest solution to a bank run is a &lt;a href="http://en.wikipedia.org/wiki/Emergency_Banking_Act"&gt;bank holiday&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Availability Zone failures are correlated.&lt;/strong&gt; Even though Amazon claims
  that multiple availability zones should not fail at the same time, it&amp;#8217;s
  clear that all the availability zones within a region share a management
  plane.  This means that a large enough failure can overload the shared
  management plane.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. EBS remains the weakest link.&lt;/strong&gt; Recent months have seen widespread
  &lt;a href="http://blog.reddit.com/2011/03/why-reddit-was-down-for-6-of-last-24.html"&gt;complaints about EBS&lt;/a&gt;, and Netflix has published an article
  on &lt;a href="http://perfcap.blogspot.com/2011/03/understanding-and-using-amazon-ebs.html"&gt;working around those limitations&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Few cloud providers publish their disaster recovery plans, making it hard to estimate downtime.&lt;/strong&gt;  If you were a Heroku customer last week,
  you had no way to evaluate how Heroku would respond to a major outage, or
  their plans for keeping your site on the air.  As it turns out, they had
  widespread dependencies on EBS, and no plan for getting Heroku-based
  sites back on the air if an availability zone failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Test your disaster recovery plan.&lt;/strong&gt;  If you haven&amp;#8217;t tested your
  disaster recovery plan, then you have no idea how long it will take you
  to get back on the air.&lt;/p&gt;&lt;p&gt;&lt;a href="http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes"&gt;Read More&lt;/a&gt;&lt;/p&gt;</description>
      <pubDate>Mon, 25 Apr 2011 08:41:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:a86cd5b6-62f2-493f-adab-e9955d543b3a</guid>
      <author>Eric Kidd</author>
      <link>http://www.randomhacks.net/articles/2011/04/25/aws-outage-timeline-and-recovery-strategy-downtimes</link>
      <trackback:ping>http://www.randomhacks.net/articles/trackback/815</trackback:ping>
    </item>
  </channel>
</rss>
