If software is indeed moving to the cloud, new tools will be required for debugging when errors culminate into outages. This is due to the reliance on higher level abstractions through monitoring now that there is no physical access to the servers running the applications. Lately, I’ve been thinking a lot about the types of assurances (and startup solutions) enterprises will need in a cloud native world to guarantee uptime.
Site Reliability Engineering is a functional role in most businesses today, but has seen a recent spike in popularity with the growth of highly available services that power internet businesses. Distributed systems have many points of breakage and testing, and improving the reliability of services is at the core of what matters to maintain uptime. Google, for example, has done extensive work in defining how the software lifecycle has changed and best practices it has deployed to ensure reliability and prevent failure. Similarly, Netflix has popularized the notion of chaos engineering, which is stress testing components of your environment in order to find and fix issues proactively and safely.
Reliability and Chaos Engineering are emerging trends worth digging into and I think that there’s an opportunity to build systems of intelligence around this data. Sharing these learnings across enterprises would be a valuable proposition for an enterprise startup. As I get up to speed, here are some resources I really enjoyed.
SRE@Xero: Managing Incidents Part I (and II) by Karthik Nilakant
- “Over a series of blog posts Karthik discusses some of the tools and frameworks developed at Xero to manage incidents.”
The Rise of Chaos Engineering by Danny Crichton
- “The methodology of chaos engineering is simple in concept, but hard in execution. Software systems today are complex and tightly-coupled, meaning that the delivery of a webpage may actually rely on hundreds of database, file, image, and other requests in order to render. There has been a “combinatorial explosion” according to Andrus, particularly for engineering teams that have chosen a microservices architecture.”
2018 and the Dawn of Network Reliability Engineering (NRE) by James Kelly
- “The thing about creating simplicity is that it’s not just about tools or products. If you’re a network operator, another tool won’t make a revolutionary nor lasting impact toward simplifying your life any more than the momentary joy of a holiday gift. Big impacts and life changes start inside out. They don’t happen have-do-be, rather they are be-do-have. Juniper is doing its own work to put simplicity at its company’s core being, but this article, besides some gratuitous predictions, is about transforming your network operations, putting simplicity at your core for a happier prosperous 2018.”
SRE at Google
- JC walks through the life of an SRE at Google, including key responsibilities and how the teams are organized.
The Journey to Chaos Engineering begins with a single step
- This talk aims to equip any company with a strategy to start Chaos Engineering today through tips, tricks and lessons learned from Twilio’s journey with Chaos Engineering.
“GameDay” — Achieving Resilience through Chaos Engineering
- Pete Cohen and Matt Fellows discuss GameDay and chaos engineering, what they are, and how they were done successfully by some organizations.
People to follow
- Charity is a cofounder and engineer at Honeycomb.io, a startup that blends the speed of time series with the raw power of rich events to give you interactive, iterative debugging of complex systems.
- Brendan Gregg is an industry expert in computing performance and cloud computing. He is a senior performance architect at Netflix, where he does performance design, evaluation, analysis, and tuning.
- Russ Miles is the CEO of ChaosIQ.io where he and his team build commercial and open source (ChaosToolkit.org) products and provide services to companies applying Chaos Engineering to build confidence in their Cloud Native, Microservice-based systems on Kubernetes, AWS, Pivotal Cloud Foundry and more.