Fault Tolerance for Network Functions and Middleboxes

Justine Sherry

UC Berkeley



Session Description

This talk will present FTMB, a system for fault tolerance in middleboxes and network functions.  Network middleboxes must offer high availability, with automatic failover when a device fails. Achieving high availability is challenging because failover must correctly restore lost state (e.g., activity logs, port mappings) but must do so quickly (e.g., in less than typical transport timeout values to minimize disruption to applications) and with little overhead to failure-free operation (e.g., additional per-packet latencies of 10-100s of µs). No existing middlebox design provides failover that is correct, fast to recover, and imposes little increased latency on failure-free operations. FTMB is a new design for fault-tolerance in middleboxes and network functions that achieves these three goals. FTMB adopts the classical approach of “rollback recovery” in which a system uses information logged during normal operation to correctly reconstruct state after a failure. FTMB adds only 30µs of latency to median per packet latencies and can reconstruct lost state in 40-275ms for practical systems.