Facebook Blames Engineering Error of ‘Our Own Making’ for Outage

A simple technical mistake caused a global outage Monday that left more than 2.9 billion internet users unable to access Facebook, Instagram, WhatsApp and other tools.

Contents

Newsletter Sign-up Technology Alert The Facebook Files

The roughly six-hour disruption, which was the largest in the company’s history, based on the number of users affected, arose when Facebook Inc. FB 2.06% was trying to do routine maintenance related to how internet data routes back and forth through its network systems, according to a company blog post Tuesday.

Seeking to get a read on Facebook’s networking capacity, engineers issued a networking command that inadvertently pulled all of Facebook’s data centers off the company’s network. That led to a cascade of failures that pulled all of Facebook’s properties off the internet.

Ultimately, Facebook engineers—a team of people who built one of the world’s most sophisticated networks—had to use pre-internet technology to solve the problem. They had to drive to data centers and restart systems there, the company said.

The outage was “caused not by malicious activity, but an error of our own making,” Santosh Janardhan, Facebook’s vice president of infrastructure, wrote in the blog post.

The outage had widespread and global ripple effects. It cut off essential communication in some areas of the world, upended e-commerce in some countries, hampered some small businesses and led others to see a marketing opportunity. In some quarters, it was a cause for reflection about the extent to which Facebook and its platforms are woven into global connectivity.

Internet giants such as Facebook have poured billions of dollars into their sprawling global data centers over the past decades, designing their own networking gear and the software that powers them.

That has allowed these companies to operate with unmatched speed and efficiency, but it also creates vulnerability. The scale and complexity required to operate and maintain such a network, and the extent to which its infrastructure is managed and controlled by one company, can lead to circumstances in which small errors can have outsize impact, networking experts say.

“This is an outfit with infinite resources and some of the most talented people,” said Doug Madory, director of internet analysis at the network monitoring firm Kentik. He said Facebook may not have applied enough scrutiny to its own solutions and backup processes.

One key question that Facebook has yet to answer is why the company’s backup network, called an out-of-band network, failed to work on Monday. This network is designed to be separate from the rest of Facebook and was supposed to provide engineers with a way to fix systems remotely within minutes when they go down.

In his blog post, Facebook’s Mr. Janardhan said that the out-of-band network didn’t work yesterday, but didn’t explain why.

Instead, with engineers unable to reset their misconfigured gear, a cascading set of failures ensued.

Once the data centers were offline, servers that used the Domain Name System, or DNS, to direct internet traffic pulled themselves off the internet. DNS is what browsers and mobile phones use to find Facebook’s services on the internet, and without that connection it was “impossible for the rest of the internet to find our servers,” Mr. Janardhan said.

The DNS changes also disabled internal tools that would have allowed Facebook’s engineers to restore service remotely, forcing engineering staff to drive to data centers and restart systems there.

That took more time. “They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them,” Mr. Janardhan said. “So it took extra time to activate the secure access protocols needed to get people on-site and able to work on the servers.”