informa

Cybersecurity In-Depth

4 min read
article

Adding Resiliency to BGP Avoids Network Outages, Data Loss

Cisco Umbrella has mechanisms in place to ensure that end users don't lose connectivity even if ISPs and service providers experience outages.

Business suffers when the network goes down or performance lags in today’s hyper-connected, always-on world. A dropped video call potentially means a lost sale. An error message on the website impacts customer experience and brand reputation. Partners can’t deliver the services they are contracted to. And employees struggle to perform the basic parts of their jobs.

Since the network is the foundation of all business functions, the modern network architecture has to be resilient enough to maintain connectivity during network disruptions. Security also has to be part of the conversation to minimize potential issues such as downtime and data loss, says Pier Carlo Chiodi, a senior network engineer and technical lead at Cisco. Even more critical, the network needs to be designed to be self-healing so that it can automatically adapt to problems and resume operations as soon as possible.

Resiliency was also part of the plan during a short-lived outage involving Akamai Technologies and its network of authoritative Domain Name System (DNS) servers last July. While many users were unable to access large swathes of the Internet, most Cisco Umbrella users didn't experience any issues.

The outage was avoided because unlike most recursive name servers, Cisco Umbrella's recursive DNS servers don’t delete expired DNS records, Cisco says. Instead, Umbrella marks expired DNS records as expired and stores them in a separate database. When Akamai’s authoritative DNS servers failed, Cisco Umbrella looked at the expired records and connected users to the last known IP address for the domain they were trying to access. Cisco Umbrella recursive DNS servers were able to complete between 40% to 50% of queries since the IP addresses hadn’t changed for those domains.

Another area where resiliency can make a difference is in Border Gateway Protocol (BGP), the routing protocol which lets networks know how to reach a given IP address. When a major transit provider experienced a “severe network issue” that impaired transatlantic connectivity for approximately 12 hours last October, Cisco Umbrella customers experienced virtually no interruption, says Chiodi. That was the case because customers were rerouted over different providers during the course of the disruption.

Adding Resiliency to BGP
On the Internet, every network announces the IP prefixes that can be reached by going through itself to other networks. Internet service providers use BGP to exchange routes with other ISPs and network providers towards a specific IP prefix via a specific network link. BGP lets each network be aware of all the paths that exist to reach a given IP address at a given time. However, BGP on its own doesn't change routing policy to bypass potential issues.

Umbrella adds intelligence to the network via its “special sauce,” the purpose-built systems and tools that check for latency and packet loss for each network path, Chiodi says. The tools are designed to automatically instruct the network to change the path as soon as they detect a network issue along the current path, Chiodi says. For situations where the network disruption is confined to a specific number of locations, Umbrella automatically reroutes traffic away from any of the affected sites by shutting down the BGP session with that network.

However, for a widespread outage where the same ISP is affecting a large number of sites, just removing that faulty ISP can potentially overload the remaining sites, Chiodi says. The “servers” would max out their CPU, services would respond slowly, and traffic to and from users would potentially be dropped. This is why it's not enough to shut down all BGP sessions with the faulty ISP at the same time. There needs to be a mechanism to evenly spread out end users across the remaining sites so that traffic does not overload any specific one.

Having complete visibility into all the combinations available to route internal traffic is key because the network needs to know what possible alternative routes exist if the current route experiences issues, Chiodi says.