Technology

Yes, the cloud is still safe despite the AWS outage – if you learn from failure


By Dave Bartoletti

Earlier this week, many Internet sites and business services suffered disruptions. Some early speculation that it was another hack like the October DDOS attack on Dyn. The reality was less nefarious but nonetheless far-reaching. Amazon’s AWS unit soon revealed a failure in its S3 storage service in a particular region (Eastern US). Just about anyone who uses AWS has a dependency tied to S3, which maximized the impact of this outage.

However, Forrester’s position can be briefly summarized as, “Wake up, but don’t panic!” The major public cloud provider platforms are incredibly reliable, but they all have failures. This was one of those, it was widely felt, and it was important – but it changes nothing about the viability or future prospects of public cloud.

While the true root cause has not yet been disclosed, this particular incident highlights some aspects of business technology – especially in the cloud – that all companies should understand:

  • Technology breaks – get accustomed to that. No tech platform is bulletproof. Cloud services do come close because they employ redundancy in their services, but they do indeed fail. 99.99% availability target means at least an hour a year of failure is expected. Don’t count on any layer of your technology stack to be there all the time. Expect it to fail and design accordingly. Then test. Then test again.
  • AWS remains an exemplar of dependable design. AWS continually proves itself among the most innovative companies in the tech world. Its data centers, hardware, software, security, and overall philosophy are rock-solid. With several trillions of data objects under management, the track record of S3 is remarkable. The typical on-premises data center is orders of magnitude more fragile.
  • Other cloud services are equally as vulnerable. Don’t rush into the arms of Microsoft, Google, or IBM based on this event alone – that’s not a rational response. It’s certainly rational to explore the object storage offerings at AWS’s competitors regularly, though, and we encourage that. In fact, other cloud services also fail and also make the news. Because AWS commands such a large share of the market, it tends to make bigger news.
  • Dependability is in YOUR hands, not those of your suppliers. The bottom line here is that your providers – regardless of who they are – are not responsible for your business or your storage resiliency strategy. You are! Like any technology stack, you must choose the cloud materials you use, design solid apps around them, assemble them, and maintain the assembly. The assembly is where you attain true resilience and how you deliver dependability to your customers. Note that S3 was down in one region – not all – and many customer apps handles the failure.
  • Tight dependencies are fragile. One main principle of good systems design is to reduce rigid dependencies. Design your code to adapt to such failure, and your code is less likely to fail. Newer methods and technologies allow modern developers to loosen the coupling between software components so failures are less likely to cascade through the dependency chain.
  • Human error is your bigger enemy. We don’t yet know precisely why S3 failed, but our best is either a software glitch or human error. Most technology failures are the result of smart people doing unfortunate things. Every public cloud provider employs an extreme amount of automation to eliminate human error, but, like storage services, that automation is not foolproof.
  • Cloud is not the untamed frontier. Many cloud critics have already pounced on this failure as vindication of their anti-cloud positions. We respectfully disagree. Significant business is already in production in public cloud because these services keep getting better, faster, broader, and yes, safer. And the value keeps expanding, proven by the rapid growth in cloud service spending. Cloud is not only here to stay; it will be the platform for much more of your technology in the future.

Do hold your suppliers accountable for their performance. This includes AWS and other cloud providers. Forrester will be staying close to this situation and we’ll share what we know about root cause as soon as we can verify. For now, assume failures will happen. Build around them. Netflix has demonstrated this principle beautifully at incredible scale with its Chaos Monkey approach. Chaos Monkey navigates the Netflix systems and intentionally wreaks havoc (chaos). When things break, Netflix learns, redesigns, and starts breaking things again.

This is not in any way a “blame the user” post. AWS owns the failure and the customer experience problems it caused. AWS customers should keep the pressure on for a root cause description and a strategy to avoid future S3 problems. In parallel, all cloud customers should take this opportunity to check their own apps for any single point of failure.

Dave Bartoletti is a principal analyst at Forrester, serving infrastructure and operations professionals. Follow Dave on Twitter: @DaveBartoletti.





Source link