Quantitative Comparison of MPLS Resiliency Approaches
Ramesh Nagarajan
Bell Labs, Lucent Technologies
The desire to support simultaneously mission-critical applications and dynamic, reliable signaling and control for optical transport network infrastructure and voice/video services is driving the needs for new protocols and architectures with increased reliability in IP/MPLS networks. Several resiliency approaches for IP/MPLS networks have been proposed and are being or already standardized. We compare four different MPLS resiliency approaches that are being deployed in service-provider networks. These include packet and connection 1+1, fast reroute and standby LSP. Packet and connection 1+1 have been standardized in ITU-T G.7712 and Y.1720 for optical signaling and MPLS protection switching. Fast reroute and standby LSP are being standardized in the IETF.
We provide a comprehensive quantitative comparison of the above resiliency approaches in terms of their failure coverage, restoration time, network capacity overbuild for restoration and service availability. While some qualitative understanding of the above attributes exist as well as selected quantitative comparisons for some attributes, we believe this is a first quantitative comparison with real service provider network data in all four dimensions to provide comprehensive view of the relative performance of each of the schemes. The types of failure detection and notification mechanisms used for each architecture dictate the failure coverage and restoration times. Failure coverage specifies the types of data plane failures that the approach can protect against including hard physical layer failures and soft failures such as label table corruption. Restoration time is the time taken at the network layer to recover from a failure including detection, notification and switching of LSP paths (as appropriate to each approach). We discuss detection and notification mechanisms for each resiliency approach in detail and derive the failure coverage and restoration times.
We also compare the resiliency approaches in terms of redundant capacity. Redundant capacity is the excess network capacity needed in each resiliency approach for recovery from network failures. We first discuss approaches to minimize the amount of excess capacity needed in each approach. These include both distributed (implemented in network devices) and centralized approaches (implemented in network management/operation support systems) with a special focus on dynamic networks where LSPs are setup and torn down. We evaluate the redundant capacity needs in such dynamic environments. Most of the previous work in this area has focused on static evaluation of approaches where traffic is assumed static and known a priori.
Finally, we compare the service availability for the various resiliency mechanisms. Service availability is the percent of time the service, as experienced by the "user" of the MPLS network, is up. The particular resiliency approach and amount of redundant network capacity have a large impact on the service availability. We quantify and compare the availability of each resiliency approach in sample service provider networks.