Tech Today with Ken May
How did Amazon take down the internet?
On Tuesday, February 28th, an Amazon cloud server, specifically an AWS cluster of servers in the US-EAST-1 region, stopped responding. Sites and web apps like Mashable, Trello, Giphy, Quora, Netflix, Spotify, Slack, Pinterest and Buzzfeed, as well as tens of thousands of smaller sites all were suddenly down or slowed to a crawl. To the average person, all we saw was that a ton of sites and apps in common usage were not working. How does this happen?
It was so bad that Amazon wasn’t able to update its own service health dashboard for the first two hours of the outage because the dashboard itself was hosted on AWS.
“This is a pretty big outage,” said Dave Bartoletti, a cloud analyst with Forrester. “AWS had not had a lot of outages and when they happen, they’re famous. People still talk about the one in September of 2015 that lasted five hours,” he said.
The reason this affected so many sites is because Amazon’s AWS platform hosts virtual servers used by all of these businesses. Amazon’s S3 cloud storage systems were also affected. SO, even a site not running on an AWS server might have issues if it’s data was on S3. For example, a business might store its videos, images or databases on an S3 server and access it via the Internet.
As it turns out, it was all due to human error. A simple typo. As Amazon explains it, some of its S3 servers were operating rather sluggish, so a tech tried fixing it by taking a few billing servers offline. A fix straight from the company’s playbook, it says. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.” Whoops.
As for why the problem took so long to correct, Amazon says that some of its server systems haven’t been restarted in “many years.” Given how much the S3 system has expanded, “the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.”
Cyence, an economic modeling platform, shared some data that show the ramifications:
-Losses of $150 million for S&P 500 companies
-Losses of $160 million for U.S. financial services companies using the infrastructure
Apica Inc., a website-monitoring company, said 54 of the internet’s top 100 retailers saw website performance slow by 20% or more.
Ouch!
Amazon apologized for the issue and said that it has put schemes in place to avoid the same problems caused by human error in the future. Let’s have this stand as a reminder to have adequate failover systems in place! Never put all your eggs in one basket.