Peering on an Internet exchange like LU-CIX relies on a routing protocol called BGP. This protocol has been around for almost 30 years and has been able to scale to the gigantic size of today’s Internet.
Back when BGP was developed, the largest computers were less powerful than today’s smartphones, the worldwide web hadn’t been invented yet, and store-and-forward email was hip. Fast failure detection wasn’t necessary at all, and nobody cared about a few minutes of outage.
Today networks carry all types of data, and some can’t tolerate the slightest interruption. The “new normal” videoconferences are just bearable when they work, and you don’t want to be interrupted in your critical VoIP call with that super-important prospective customer.
BGP and its failure detection through timers hasn’t followed. In default configurations, if a peer fails silently and undetected, BGP still needs between 90 and 180 seconds of downtime before removing a failed route from its table. That could mean up to 3 minutes of interruption in your phone call – in other words the call is dead.
It doesn’t need to be a failure of line or equipment that causes the interruption, it could also be a planned maintenance. Procedures exist to better manage planned maintenance and limit its impacts (eg. BCP214), but these are difficult to implement and not always available.
While timers in BGP can be tuned for more aggressive failure detection, this brings its own set of limitations and drawbacks.
BFD is a generic keepalive and failure detection protocol that runs over almost any communication media. It allows sub-second failure detection, if participating systems are fast enough. It is designed as a lightweight protocol that can run autonomously in the forwarding engine of network devices, independent of the control plane.
In our case at hand, BFD can be configured as an additional failure detection mechanism for BGP. Each BGP session will be doubled by a dedicated BFD session that runs on UDP ports 3784 and 3785, on the same IPv4 or IPv6 addresses as the BGP session itself. When BFD detects the failure of a neighbor, it informs the BGP process which triggers immediately the withdrawal of the neighbor’s routes, shortcutting the overly generous BGP timeouts and enabling the immediate use of an alternate route.
On LU-CIX, we now encourage members to use BFD on their peer-to-peer peering sessions. This allows for extremely fast failover in case of data path outages for whatever reason. In addition, we will start offering BFD support on the route servers. Thus, BGP sessions with the route servers can also be protected with BFD.