Many of us were effected one way or another by the Amazon S3 outage that occurred at the end of February 2017. The outage lasted for a few hours and affected around 150K websites, and 121K unique domains that utilized S3 in one capacity or another. Some sites were crippled, while others were unavailable.
Many service providers utilize cloud services with the misguided notion that cloud providers are immune to outages. This could not be further from the truth. All of the top cloud providers, including Amazon, have had serious outages. Even Amazon’s AWS service status page relied on S3 for storage for its health marker graphics, and the status page continued to show all services green despite all evidence to the contrary. Perhaps this outage will serve as a wakeup call that you can’t put all of your eggs in one basket, or maybe not.
Vallum Software relies on a number of service providers that utilize Amazon S3 to provide their services. While it was not surprising that this outage occurred in the first place, the interesting aspect of it was how the effected sites reacted to it. A few of the sites had notifications that there was a problem, and a few of those had it identified as a result of an S3 outage. The bulk of them had nothing. I wonder how many of them even knew there was an outage, or knew what was causing it.
There are several important points to mention here. One, service providers need to have a failover for their cloud providers in the event of an outage. They cannot place all their eggs in one basket with a false sense of security. Second, service providers need to have network monitoring services in place to identify outages, slowdowns or network issues, so that they can quickly pin point the cause, address it, or at a minimum alert their customers to it ASAP. My guess is that many of these providers had nothing in place and simply wait for their customers to alert them to any issues. Relying on your customers for network monitoring is not only bad business, but it places you in a very poor light.
Finally, why did the outage last for hours? It would be very interesting to know if Amazon knew immediately what the cause of the outage was. Some of the reports quoted Amazon stating “they think they understand the root cause”. My guess is that they were scrambling to identify the cause for a while. As it turns out, the cause was a typo from an administrator that stopped some critical instances, which resulted in a cascading failure. This outage along with how the providers reacted or did not react, sheds a bright light on deficiencies with existing network monitoring capabilities.
About the Author:
Lance Edelman is a technology professional with 25+ years of experience in enterprise software, security, document management and network management. He is co-founder and CEO at Vallum Software and currently lives in Atlanta, GA.