The simultaneous disruption of Facebook, Instagram, and WhatsApp services on Monday (4/10/2021) became a public conversation.
The services of these three social media companies have all returned to normal. However, it still leaves a question mark why a company as big as the Facebook Group can be down for more than 6 hours.
Vice President of Engineering & Infrastructure at Facebook, Santosh Janardhan finally opened up about the cause of Facebook, Instagram, and WhatsApp downtime.
“This outage was triggered by the system that oversees the capacity of our global backbone network,” Santosh said on the official Facebook blog, quoted Wednesday (6/10/2021).
The backbone in question is the network that Facebook built to connect all computing facilities.
This backbone network is tens of thousands of miles of fibre optic cable that connects all of Facebook’s data centres. Data traffic between these computing facilities is managed by routers.
Periodically, Facebook performs infrastructure maintenance, such as repairing broken cables and updating software on routers.
During yesterday’s infrastructure maintenance, Facebook intended to enter a command to check the availability of backbone capacity. However, the command actually cut off their data centre instead.
“Our system was designed for auditing commands like this to prevent errors like this, however, a bug in the audit tool prevented the system from shutting down the command,” said Santosh.
This configuration change causes the server connection from the data centre to the internet to drop. The problem is made worse because as a result, the DNS server cannot communicate with the border gateway protocol (BGP).
“The end result is that our DNS servers are unreachable even though they are still operational. As a result, it is impossible for the internet to find our servers,” said Santosh.
Since Facebook implements high-security safeguards, physically resetting the system takes a lot of time.