Chaos Engineering – Chaos in a productive environment

Ronaldo Sales

Chaos Theory states that a small change at the beginning of any event can have unknown consequences in the future. Who doesn’t know the famous phrase (one version) of the butterfly effect?

“The fluttering of a butterfly’s wings in Brazil could trigger a tornado in Texas.”

BUTTERFLY EFFECT – VIA WIKIPEDIA

Based on this principle, the Chaos Engineering methodology, a concept created and executed by Netflix, as well as other large companies, is based on inserting disturbances in a systemic environment and evaluating its behavior in search of aberrations, weaknesses and unpredictable results (not imagined at non-functional requirements gathering time). In addition, of course, being done in a PRODUCTIVE environment, that’s right, in the environment your customers are using and during business hours.

The methodology is based on some principles:

Build a hypothesis

This hypothesis, despite its name, must be built on the standard behavior of the environment/system. For example, my response times are always below 5s and the error rate is around 1%. Or rather, more business-oriented assumptions: 500 purchases per minute are completed. 1000 proposals are closed per hour.

Vary Real World Events

What would be real world events? Common events that can happen in everyday life, such as a database node crash, a virtual machine crash, network slowdown, and so on. These events must be purposely simulated in the environment.

Run Experiments

Here you need to be creative, think of unusual situations such as opening hundreds of telnet connections on a server, killing a service, unplugging a machine (logging in a little), logging in and out massively and randomly.

Automate experiments to run continuously and randomly

Schedule events and experiments to run automatically varying day, time and target.

Minimize negative impacts

It is not an original principle of the methodology, but it must be observed that the customer experience is not affected by generating discomfort, loss of revenue or some other similar impact.

The big question must now be asked: how to apply this concept? should i use Chaos Monkey or some similar application?

For those unfamiliar, Chaos Money is a system that randomly picks one or more virtual machine instances or system and drops it. Of course this is done during business hours, when all the technical professionals are present and can act. So if your business has a well-implemented treadmill, or a very mature Lifecycle process, with and Always testing, shift-left in your team vein, excellent observability of your environment/system (with monitoring, logs, traces, etc), so you can choose to have a system in line with Chaos Monkey. But if this is not the scenario, a Chaos Day can be organized (mobilizing professionals) and outline a Failure Injection Testing strategy, following the principles already listed, and observe the behavior of your environment, identifying the points of failure, the points slowness and points without monitoring/observation coverage (I have used this term because of the English Observability). This should be adjusted accordingly, either to correct an error, optimize slowness or improve observability.

It may seem a little intimidating to run a Chaos Engineering in your environment, but it can provide an increase in the reliability of your production environment. Of course, this should be done in mature environments in the lifecycle of an application/environment, as the main motive is to identify unexpected behaviors, those that have not been raised in mapping non-functional requirements, and perhaps even identify some kind of inconsistency. functional result of purposely injected failures into the environment. If your maturity is not so high you can choose to impute failures at times of low utilization and for a very short period (Minimize negative impacts).

Netflix even simulates a failure of an entire Amazon EC2 region (EC2 is Amazon’s server hosting service, with the region being the geographic division of the service). Besides the already mentioned Chaos Monkey there is also injection of other flaws; This has led all system engineers to increasingly build services/systems that can handle failures no matter which ones. It is a search for resilience, high availability, care for the user experience.

So, let’s break things into production on purpose?

Main reference: Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, Casey Rosenthal, “Chaos Engineering”, IEEE Software, vol.33, no. 3, pp. 3541, MayJune 2016, DOI:10.1109/MS.2016.60

*By Ronaldo Sales – Bachelor in Computer Science from Unesp Rio Claro, at www.Yaman.com.br is Manager of the SRE & Automation Services Division.