Amazon says software problem reason for huge Internet outage

The main reason for the flowing outages across the internet on December 7th is automated processes in Amazon’s cloud computing business, according to Amazon.com, which affected everything from Disney amusement parks and Netflix videos to robot vacuums and Adele ticket sales.

In a statement Friday, Amazon said the problem began December 7th when an automated computer program — designed to make its network more reliable — ended up causing a “large number” of its systems to behave strangely unexpectedly. Therefore, that created a wave of activity on Amazon’s networks, ultimately preventing users from accessing some of its cloud services.

“Basically, a bad piece of code was executed automatically, and it caused a snowball effect,” Forrester analyst Brent Ellis said.

The outage persisted because their internal controls and monitoring systems were taken offline by the storm of traffic caused by the original problem,” he noted.

As such, the nature of the failure prevented teams from pinpointing and fixing the problem, the company added. “They had to use logs to find out what happened, and internal tools were also affected. The rescuers were “extremely deliberate” in restoring service to avoid breaking still-functional workloads and had to contend with a “latent issue” that prevented networking clients from backing off and giving systems a chance to recover,” it noted.

The AWS division has temporarily disabled the scaling that led to the problem and won’t switch it back on until solutions are found.

A solution for the latest glitch is coming within two weeks; however, there’s also an extra network configuration to protect devices in the event of a repeat failure, Amazon said.

On the other hand, Corey Quinn, cloud economist at Duckbill Group, said that “Amazon didn’t explain what this unexpected behavior was, and they didn’t know what it was. So they were guessing when trying to fix it, which is why it took so long.”

While AWS is a reliable service, Amazon’s cloud division last suffered a major incident in 2017, when an employee accidentally turned off more servers than intended during repairs of a billing system.