June 15, 2009 at 12:38 pm #73943
If The Clouds Burst
Andy Greenberg, 06.05.09, 6:00 AM ET
The much-hyped vision of information technology’s cloud computing future is often described with an analogy to the power industry: Just as we pipe in electricity from a central utility, the comparison goes, so will we someday pay for processing and storage as utility-like services in a central location, a model that’s more efficient, more flexible and cheaper.
But cloud computing, argues the National Institute of Science and Technology’s (NIST) cloud computing lead Peter Mell, can also be described with an analogy to another industry, and one with a less savory track record: banking. In a cloud computing scenario, as in banking, businesses and governments entrust their precious digital assets to a single central repository–one that’s more interested in maximizing profit than in creating a costly safety buffer.
Just as banks responded to financial incentives that caused them to over-leverage their capital, Mell argues that cloud computing vendors, including Google and Amazon, face pressures to leverage as many of their computing resources as possible, potentially risking the same sort of sudden and catastrophic collapse that’s sent the U.S. economy reeling.
NIST, a part of the U.S. Department of Commerce, doesn’t have the power to impose regulation on cloud computing vendors. But the agency does function as a creator of standards and a watchdog group with a close eye on technology’s security and its impact on the economy. Forbes spoke with Mell about cloud computing’s appeal, what he sees as its hidden instabilities, and what needs to happen to prevent a future cloud computing meltdown.
Forbes: What’s the cloud computing “crisis” scenario that you worry about?
Peter Mell: NIST’s definition of cloud computing talks about resource pooling, location independence and elasticity. Those characteristics have led to the technology’s increasing adoption for important applications, and they give users the appearance of unlimited capacity. The idea is they can have as much storage or processing as they want whenever they want. The only limiting factor is the cost.
But clouds do have a capacity limit, and major cloud vendors don’t publish their overall capacity or utilization rates. So we have no idea if they’re sitting on several idle data centers waiting for customers or if they’re scrambling feverishly to add new capacity to keep up with demand.
Are you comparing the situation to the financial sector, where the economy suffered from a lack of transparency into the industry’s risk?
The analogy is that in the banking crisis we’ve been very concerned with the cash reserves that banks have to ensure their viability to extend credit. We’re very interested in their cash cushion, and in the banking industry there’s visibility into the size of that cushion.
In the cloud computing industry there’s no visibility into that reserve capacity. So as we become dependent on cloud computing we’re relying on cloud computing vendors to have enough reserve capacity to continue our operations. And any unused computers that they maintain they have to pay for. It costs them resources to keep a large reserve. So it’s a valid question: If you’re profiting more by reserving less capacity, how much are you actually keeping for us?
How much capacity do you think vendors should keep in reserve?
Cloud computing is new enough that I’m not sure anyone knows. There will be certain situations where the demand suddenly increases or the supply disappears: For instance, a natural disaster, or a hacking attack, or a large influx of customers, as may happen when the economy rebounds over the next few years. And just as important as their reserve is how fast they can add capacity.
Do we need to force cloud vendors to publish their capacity? Or do we need to create a sort of FDIC for cloud computing that can rescue failed clouds?
I can’t take part in any discussion of regulatory policy. But those are extremely interesting ideas.
You mention some scary hypotheticals, but are there real scenarios where this has happened?
We’ve seen clouds go down. Currently we don’t have very good portability between clouds, so the multi-cloud outage concern will only exist in the future.
Within a single cloud, we’ve certainly had a history of public clouds experiencing outages. None of them have been of great duration. We know they can go down, but it’s never been for a significant amount of time, and to my knowledge we’ve never had a cloud overcapacity situation cause an outage.
So when Google’s services temporarily went down last month, that wasn’t due to undercapacity? (See “An (Internet) Day Without Google.”)
No, though it did show us what it looks like when an entire cloud goes down.
When a cloud is overloaded, would it actually collapse, or merely not work as well?
When an individual server starts to go above 80%, you have thrashing–the computer is constantly moving data from disk to memory and back again, and the computer slows to a crawl–nothing works.
If a cloud isn’t built to expect an overcapacity situation, it could conceivably result in the entire cloud similarly ceasing to work. What I would hope, and what is also possible, is that when clouds reach their capacity limit, they could be architected so that applications can request no more computing capacity. They could gracefully degrade each applications’ usage, which could prohibit the application from working, but allow the cloud itself to remain functional. It’s possible that cloud computing will be architected that way, and it’s critical, because otherwise the individual servers making up the cloud could be overloaded and cease to function altogether.
What about the interoperability between clouds? Will the ability to move data and applications from one cloud to another alleviate some of the risk of cloud overload?
There’s a lot of effort now to achieve that data and application portability, moving from one cloud to another. But if you have many vendors, all with razor-thin capacity reserves, and one provider loses its data center because of greater economic activity or greater usage, an over-capacity situation could cause customers to migrate to another cloud vendor. If the cloud vendors don’t gracefully handle the capacity situation–I hope they would, but we don’t know if they have–there could be a chain reaction of outages.
So you’re saying interoperability between clouds could also lead to a kind of cascading cloud blackout?
Interoperability could be a good thing too. If there was an outage in one cloud, customers could be redistributed among other clouds. But in a black swan sort of event, or any sudden increase in demand, it could also possibly trigger this kind of a chain reaction.
If there were a massive cloud outage, who would be held liable?
The typical service level agreement just refunds the customers’ cost of service. It doesn’t cover the cost resulting from an outage. So the cloud vendor wouldn’t pay for the full cost of those outages to the customer.
That includes the SLAs of companies including Google and Amazon?
Right. And another issue is that many of these vendors also use their clouds for their own internal purposes. So if there were an overcapacity situation there would arise the question of whether the customers receive service or would the vendor support their own uses first.
Just as when a bank collapses, some investors are paid back their investment while others lose money.
Exactly. Even in the government, we’ve talked about creating community clouds, where one agency will host another’s applications. We have to ask the question, when there’s a problem, who loses their capacity–the host agency or the one being hosted? Those issues need to be worked out in any sort of shared cloud, and to my knowledge they haven’t been.
You must be logged in to reply to this topic.