Optionality: Realizing the Full Potential of Redundant Application Infrastructure

by Clare Gollnick

April 22, 2019

Modern traffic steering has evolved into a multi-dimensional optimization problem. From ensuring availability, minimizing latency, maximizing throughput, preventing flapping, and meeting CDN commits; we ask more from our infrastructure today than ever before. With so many competing interests, how do we guarantee an exceptional user experience while still achieving a diverse set of business goals?

Redundancy is an essential first requirement in modern application architecture. Investing in redundancy means investing in drop-in replacements to be available in times of crisis. Back-up infrastructure prevents downtime for users and reduces the scale and impact of incidents when they occur.

Good Enough is Great

The key to optimization of redundant architecture is using data to steer the traffic to the endpoint that will give the best user experience. The self-organizing nature of the internet produces substantial variability in performance and user experience, due to geography, ISP peering relationships, current network conditions, among others. The most performant CDN or endpoint is likely different in Asia than in the United States, and for users on in the same location but with different service providers. With latency-based routing, we choose the endpoint with the lowest latency taking into account the specific the user’s characteristics. This can be done even when multiple endpoints are online and available, increasing the ROI in the redundant infrastructure.

The secret to multi-dimensional optimization is to avoid unnecessary optimization. If two out of three endpoints would provide acceptable performance, keep both good ones. Not reducing to the single best alternative means we still have options. Options are remaining degrees of freedom and create the opportunity to do more — like choosing the cheaper CDN or load balancing. Traffic engineering teams can maximize the return on investment of redundant infrastructure by adopting a philosophy of ‘good enough is great’ at each stage of the optimization process.

It’s Not As Simple As A = B

Equivalence is harder to measure than one might expect. Variability and error occur every time a measurement is taken. Repeated measurements of the round trip time (RTT) between a user and a service endpoint result in a distribution. Data differ from each other even though the measurements are ostensibly of the same physical route (see the blue distributions below). Measure RTT to a different endpoint (red), and observe a different distribution. Are these routes different, or the same? Our confidence in this inference depends on many factors, including temporal dynamics of the signal as well as the sample size (thought this is just a start).

Figure 1 Sampling variability tends towards over-optimization

Each panel above shows the variability of latency measurements of identical paths. It is nearly impossible to differentiate which endpoint is reliable faster (red or blue). The question of optionality, is whether we should try to differentiate or be satisfied that they are, for all intents and purposes, equivalent.

It is tempting to start by summarizing a distribution with a single number (i.e P50 or P90). This allows a simple logical test that can easily be written into code (if A equals B do this, if A is greater than B do that). However, we expect that measured summary statistics will differ by some amount even if the underlying process is identical. Sampling and measurement error are unavoidable. Using only a summary statistic to steer traffic assumes, incorrectly, that measurements have the same level of precision as the bytes of memory used to encode the information, effectively removing ‘good enough’ options and limiting the total potential of the investment in redundant infrastructure.

There are statistical approaches to test for equivalence. These algorithms are known as equivalence or non-inferiority testing. However, all of these approaches start with defining an equivalence margin, a step so critical it ends up eclipsing the details of the math. An equivalence margin is an answer to the question: how different must two outcomes be to matter?

How Different Must Two Outcomes Be to Matter?

There are a few ways to choose an equivalence margin. In almost all cases, the practical approach is both simple and effective: acknowledgment of diminishing returns. For example, great UX is a common business goal. Improving page load time can dramatically improve customer experience and increase customer engagement. However, page load time (measured in milliseconds or seconds) is still an indirect measurement of customer engagement (measured in purchases or clicks). When we consider cause and effect, we find practical limitations exist on the ability of page load time to impact customer experience.

The human brain processes information slower than a modern computer network can send it. Suppose we’re given the option to two choose between two CDNs. CDNA is expected to result in a page load time of 500ms, and CDNB is predicted to take 502ms. It’s true that CDNA is more performant; however, in the context of the business goal of user engagement, CDNA is equivalent to CDNB. This is a perfect opportunity to adopt a minimally viable optimization strategy, keeping satisfactory options in order to optimize in other dimensions.

Figure 2 One Bad, Two Good Options

Business outcomes are not measured in milliseconds. A customer engagement goal can be achieved with either CDN A or CDN B, letting traffic engineers use this remaining flexibility to load balance, reduce cost or meet CDN commits. The latency data displayed in the bottom panel is real data. The scale of difference between ‘good enough’ and ‘bad’ outcomes is representatives of real outcomes.

It’s usually possible to achieve the same overall performance improvement with a ‘good enough’ optimization strategy as a choose the ‘absolute best’ strategy. The intuition is that the majority of the improvements of an optimization strategy are realized from systematically avoiding bad outcomes. Bad outcomes are substantially and meaningfully different than ‘good enough’ outcomes like CDNC in Figure 2.

Even use cases that require the absolute best performance can benefit from avoiding over-optimization. Performance itself can be modeled as multi-dimensional: latency (time to first byte), throughput (kb/sec), or jitter (ms2 or peak-to-peak time). Each of these performance dimensions reaches its own point of diminishing returns, often with an equivalence margin substantially more flexible than the precision of the measurement itself.

The Practice of Minimal Optimization

As a rule-of-thumb, if the goal has multiple outcome measures, (i.e cost, stability, reaching CDNs commits, and complying with legal regulations) it’s best to retain all good-enough options. As CDNs and edge networks improve and multi-endpoint infrastructures become the norm, there will be many ways to achieve a given outcome and many viable paths to success.

At NS1 we’ve designed our filter chain, and Pulsar active traffic steering, in particular, to retain good options out-of-the-box. By allowing the maximum degrees of freedom to be fed in other outcomes, traffic engineering teams are able to achieve more than ever before, without ever writing a piece of code.

Author's Bio

Clare Gollnick

Director of Data Science at NS1