Storm Clouds on the Horizon?

This post sponsored by the Enterprise CIO Forum and HP.

The recent Amazon cloud outage at its Northern Virginia data center will raise concerns among CIO’s looking to the “public cloud” to improve IT service delivery and reduce day-to-day operating costs. Industry claims of superior “up-time” performance, reliability and massive redundancy must now be revisited and re-evaluated. In these early days immediately following the outage, the prevailing sentiment likely will be of shaken confidence in “public cloud” services accompanied by much finger-pointing amid claims of breached service level agreements (SLA’s) and misrepresentation of the cloud’s resiliency. With more time and a calmer approach, however, most enterprises will realize that no cloud implementation is fail-proof and that shifting significant portions of their enterprises to the cloud should not be undertaken without adequate contigency plans and risk mitigation.

The outage appeared to be limited to a single “availability zone” in only one region. AWS customers that heeded Amazon’s advice to spread their services across multiple availability zones to ensure resiliency felt little impact. Those that chose not to pay the extra cost associated with multiple zones may be regretting that decision now. Amazon claims its web services are now operating normally for most customers and the company said it will post a detailed report on what went wrong last week.

The after-action report is going to be interesting but we’re already lots of lessons learned spring up. The key points seem to be:

Cloud outages may be rare but they can happen. Enterprises need to architect their cloud services for failure by spreading mission-critical, customer-facing services across multiple zones, physical separate data centers and/or multiple cloud providers. Putting 100% of your IT service eggs in one “cloud” basket is risky.
Service level agreements are important and will become even more so. Gartner’s Lydia Leong provides a great recap of what went wrong and insight into Amazon’s standard EC2 SLAs.

Amazon’s SLA for EC2 is 99.95% for multi-availability zone deployments. That means you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was Elastic Block Store (EBS) and Relational Database Service (RDS) which weren’t, and neither of those services have SLAs.

It seems likely that some of Amazon’s customers, those most affected by the outage, likely will seek to renegotiate SLA’s to ensure they have more “teeth” and offer compensation for lost revenue.

Plan for disaster and how to recovery from it–even in the cloud. The outage highlighted the fact that some of Amazon’s customers did not have a disaster recovery strategy. A number of solution providers now cover cloud disaster recovery and most can recover physical or virtual machines in a cloud within minutes. But if you don’t have plan for it, it won’t happen.
Cloud deployments are still cheaper that traditional data centers. For most organizations, the cost of deploying in the cloud remains about 10 times cheaper than building your own data center or even private cloud.
Hybrid IT Service Delivery Can Help Reduce Cloud Exposure. Most mature enterprises move to the cloud in stages resulting in a hybrid portfolio of IT services provisioned from a mix of public cloud, private cloud and traditional IT infrastructure. While these hybrid service models can be challenge to manage effectively, they do offer some built in resiliency in case any one segment fails.

This post sponsored by the Enterprise CIO Forum and HP.