You may have heard about problems with Amazon’s server farm earlier this week and, I think, last weekend. I just ran across a technical description of what happened:
When this network connectivity issue occurred, a large number of EBS nodes in a single EBS cluster lost connection to their replicas. When the incorrect traffic shift was rolled back and network connectivity was restored, these nodes rapidly began searching the EBS cluster for available server space where they could re-mirror data. Once again, in a normally functioning cluster, this occurs in milliseconds. In this case, because the issue affected such a large number of volumes concurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes “stuck” in a loop, continuously searching the cluster for free space. This quickly led to a “re-mirroring storm,” where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica. At this point, about 13% of the volumes in the affected Availability Zone were in this “stuck” state.
After the initial sequence of events described above, the degraded EBS cluster had an immediate impact on the EBS control plane. When the EBS cluster in the affected Availability Zone entered the re-mirroring storm and exhausted its available capacity, the cluster became unable to service “create volume” API requests. Because the EBS control plane (and the create volume API in particular) was configured with a long time-out period, these slow API calls began to back up and resulted in thread starvation in the EBS control plane. The EBS control plane has a regional pool of available threads it can use to service requests. When these threads were completely filled up by the large number of queued requests, the EBS control plane had no ability to service API requests and began to fail API requests for other Availability Zones in that Region as well. At 2:40 AM PDT on April 21st, the team deployed a change that disabled all new Create Volume requests in the affected Availability Zone, and by 2:50 AM PDT, latencies and error rates for all other EBS related APIs recovered.
I don’t think any gaming companies were affected, though Zynga might have been. But I don’t care. I love failure analysis in general, and in computer systems in particular. I’m such a geek. A 3000 year old, fabulously scarlet-coiffed geek, to be sure, but a geek, through and through.
UPDATE: The outage started just after midnight April 21 PDT.