Chaos engineering started at Netflix when they publicly said that bringing down production systems helped them be more resilient. Taking down servers on purpose isn’t something companies do, even after hearing Netflix’s success stories.

Some people might be thinking that chaos engineering is simply another way of testing how resilient systems are. And, to a certain extent, that’s true. However, it goes beyond testing, and it’s not only for companies like Netflix. So what exactly is chaos engineering? What is it not? And what are the steps to practice chaos engineering? Let me answer these questions briefly. Let’s start!

What’s Chaos Engineering?

Chaos engineering is an emerging discipline of running experiments to get new knowledge from a complex system. You might think that these experiments sound much like a test case and wonder why would you bother using a different name to test resiliency. Well, test cases make assertions based on existing knowledge from the system. If all tests pass, it means that the system behaves as expected.

However, the purpose of running chaos experiments is to generate new knowledge from the system. For instance, how does the system behave right now when you bring a server down? Is it able to continue functioning? Initially, you don’t know how it will behave. You might have a hypothesis on what could happen, but you’re not completely sure. These experiments can then be transformed into a regression test case. But you start by experimenting, getting new knowledge, and improving the system’s resilience or security.

What’s Not Chaos Engineering?

Perhaps the scariest chaos engineering practice that companies have identified is when a company like Netflix runs chaos experiments in production. Companies might already have difficulties trying to have a stable and secure system to inject more chaos into because that makes the system better. Chaos engineering aims to bring chaos with a clear purpose in mind. Every experiment you run is based on a hypothesis you have. Breaking things is easy, and we can do it in countless ways with minimum effort.

So chaos engineering isn’t about injecting random chaos experiments into the system and seeing what happens. Chaos engineering exposes us to the chaos that’s already present in the system; it shouldn’t create new ones. To implement this practice correctly, you have to start in a controlled environment and turn off experiments right away if things go badly.

Steps to Perform Chaos Engineering

Now that you have a better idea of what chaos engineering is and isn’t, how would you run a chaos experiment? The idea behind chaos engineering principles is to have a blueprint for how to implement chaos engineering. The first thing you need to do is define what “normal” or “steady” looks like in your system. For instance, if the system normally responds under 500 ms with a throughput of 300 requests per second, then you can create a hypothesis that the “normal” state of the system will continue if certain things happen. For example, a server goes down, and the system won’t be affected.

Once you have the theory in place, it’s time to observe how the system behaves when you vary real-world events and run experiments. For instance, what happens if you bring a server down on purpose or inject networking problems that simulate a system’s disruption? It would help if you always looked for ways to minimize the blast radius. An example would be to whitelist customers that will use only healthy servers.

Finally, you’ll observe the results of the experiments and confirm your hypothesis. From those findings, you can then decide how to improve the system regarding reliability, security, performance, etc.

Conclusion

Chaos engineering is not only for companies like Netflix, and you don’t need to break systems in production if you’re not ready yet. Be mindful that the idea behind this emerging practice is to distill all those problems that are already present in the system but that you don’t see. At the end of the day, the purpose of running chaos experiments in your systems is so that you can help them become better.

Read part 2 of this blog series, Chaos Engineering In Practice.

An Alten Company, Cprime is a global consulting firm helping transforming businesses get in sync.