On Monday, Facebook went completely offline and took Instagram and WhatsApp (not to mention a few other sites) down with it. Many have been quick to say that the incident had to do with BGP or the Border Gateway Protocol, citing sources from within Facebook, traffic analysis, and the gut instinct that “it’s always DNS or BGP.” Facebook is on its way back, but it all begs the question:
What is BGP?
On a very basic level, BGP is one of the systems that the internet uses to get your traffic to where it needs to go as quickly as possible. Because there are lots of different ISPs, backbone routers and servers that are responsible for getting your data to e.g. Facebook, there are lots of different routes your packages may end up taking. BGP’s task is to show them the way and ensure that it is the best route.
I have heard BGP described as a system of post offices, an air traffic controller and more, but I think my preferred explanation was one that resembled it with a map. Imagine BGP as a bunch of people making and updating maps that show you how to get to YouTube or Facebook.
When it comes to BGP, the Internet is divided into large networks, known as autonomous systems. You can imagine them as island nations – they are networks controlled by a single entity, which can be an ISP, like Comcast, a company, like Facebook or another large organization like a government or a larger university. It would be extremely difficult to build bridges that connect each island with all the others, so BGP is the one responsible for telling you which islands (or autonomous systems) you need to go through to get to your destination.
As the Internet is constantly changing, maps need to be updated – you do not want your ISP to lead you down an old path that no longer goes to Google. Because it would be a massive task to map the entire Internet all the time, autonomous systems share their maps. They will occasionally talk to their island neighbors to see and copy all the updates they have made to their maps.
By using cards as a frame, it is easy to imagine how things can go wrong. Back when consumers first got access to GPS, there were always jokes about driving out of a cliff or into the middle of the desert. The same can happen with BGP – if someone makes a mistake, it can end up driving traffic somewhere it should not go, which will cause problems. If it is not caught, it will end up on everyone’s cards. There are other ways it can go wrong, but we get a little to them.
Yes, yes, short. Give me an example.
Of course! This is massively simplified, but imagine you want to connect to an imaginary technical news site called Convergence. Convergence uses ISP NetSend and you use DecadeConnect. In this example, DecadeConnect and NetSend cannot talk directly to each other, but your ISP can talk to Border Communications, who can talk to Form, who can talk to NetSend. If that’s the only route, then BGP would make sure you and Convergence could communicate through it. But if, alternatively, both DecadeConnect and NetSend were connected to ThirdLevel, BGP would probably choose to direct your traffic through it, as it is a shorter jump.
Okay, so BGP is like a map that describes all the fastest ways from you to a site?
Right! Unfortunately, it can get even more complicated because the shortest is not always the best. There are plenty of reasons why a routing algorithm would choose one path over another – cost can also be a factor where some networks charge others if they want to include them on their routes.
Cards are also super difficult! I discovered this recently to try to plan a trip where roads existed on one map and not another or were different between maps. One road even had three different names across three maps. If it’s so hard to figure out a “city” that has all five roads, can you imagine what it’s like trying to connect the entire internet together. Real roads do not change that often, but websites can move from one country to another or change, add or subtract service providers, and the Internet just has to deal with it.
I remember something like that from my algorithm and data structures – tried to build algae to find the shortest route.
I take your word for it. I dropped by as soon as I heard about graphs.
But Facebook did not! In fact, it has built its own BGP system, which allows it to make “rapid incremental updates” according to a paper presented earlier this year. That said, the system the company describes there is intended for communication within data centers – at this point, it’s hard to say what caused Facebook’s problems on Monday, and it would take someone wiser than me to say if Facebook’s data center communications could cause this kind of problem. Cyber Security Reporter Bryan Krebs claims that the interruption was due to a “routine BGP update”.
What does DNS have to do with all this?
To borrow an explanation from Cloudflare: DNS tells you where to go and BGP tells you how to get there. DNS is how computers know what IP address a website or other resource can be found on, but that knowledge in itself is not helpful – if you ask your friend where their house is, you probably still need GPS for to get you there.
Cloudflare also has a great technical overview of how BGP errors can also corrupt DNS requests – the article is specifically about Monday’s Facebook event, so it’s worth reading if you’re looking for an explanation of what it looked like from a autonomous system perspective.
What can go wrong with BGP?
Many things. According to Cloudflare, two notable incidents include that a Turkish ISP accidentally told the entire Internet to direct its traffic to its service in 2004, and a Pakistani ISP accidentally banned YouTube worldwide after doing so only for its users. Because of BGP’s ability to spread from the autonomous system to the autonomous system (as a reminder is one of the things that makes it so useful), a group that makes a mistake can cascade.
A group that gets owned can also cause problems – in 2018, hackers could hijack requests for Amazon’s DNS and steal thousands of dollars in Ethereum by compromising a separate ISP’s BGP servers. Amazon was not the one that was hacked, but traffic intended for it ended up somewhere else.
Or you could ruin it and delete your entire service from the internet with a bad BGP update. BGP is affectionately called the Internet’s duct tape, but no adhesive is perfect.
So what happened to Facebook?
It seems that Facebook’s servers for some reason told everyone to take them off their cards. We’ll probably have to wait for a report from Facebook if we want to know exactly what happened to its BGP configuration and why that change was made. However, Cloudflares CTO reports that the service saw lots of BGP updates from Facebook (most of which were route extraction or deletion of lines on the map that led to Facebook) just before it got dark. One of Fastly’s tech leads tweeted it Facebook stopped providing routes to Fastly when it went offline, and KrebsOnSikkerhet backs up that it was some update to Facebook’s BGP that knocked out its services.
I would recommend Cloudflare’s explanation if you want nitty-gritty technical details.
If BGP was the problem, how does Facebook solve it?
Given that the interruption continued for hours, the answer seems to be “not easy.” Facebook needed to make sure that it advertised the correct records and that these records were downloaded by the Internet as a whole. In other words, it was necessary to ensure that its maps were correct and that everyone could see them.
However, it is easier said than done. There were reports of Facebook employees to be locked out of branded doors and employees struggling to communicate. In situations like these, you need to find out not only who has the knowledge to solve the problem and who has the permissions to solve the problem, but how to connect these people. And when your entire business is dead in the water, it’s no easy task – The edge received reports that engineers were physically sent to a Facebook data center in California to try to fix the problem.
Would Web3 solve this problem?
Stop it. I want to cry.
But to answer the question quickly, probably not – even though Facebook jumped on the decentralized train, there should still be a protocol that tells you where to find its resources. We have seen that it is possible to misconfigure or destroy blockchain contracts before, so I would be a little suspicious of anyone who said that a contract and blockchain based internet would be immune to these kinds of problems.
Sure it was a wild timing on that interruption given all the bad Facebook news, huh ‘?
Clearly, so clear that it all happened while a whistleblower was going on TV and airing Facebook’s dirty laundry makes it really easy to come up with alternative explanations. But it’s just as possible that this is an innocent mistake committed by a (very, very unfortunate) person on Facebook’s IT staff.