Chaos Testing Guide
Chaos Engineering and Testing
Resilience
Reliability
04.05.2022 — 13 min read
Chaos Engineering has never been more important to ensure the best user experience. Whether for e-commerce, AI/ML jobs or business critical information gathering, no one likes it when an application undergoes downtime or has poor response time. To meet SLAs and SLOs for a service, it’s important to ensure that both platform as well as the services running on top of it are resilient to failures with minimal degradation in terms of performance for best user experience and avoid any potential downtime. Chaos Testing Guide helps achieve it. It covers the needs for chaos testing, test methodology to embrace, best practices, tooling to uncover areas that need to be hardened, and test environment recommendations - how and where to run the chaos tests. This blog supports Chaos Engineering and Testing ADR: https://github.com/operate-first/sre/pull/12.
Table of Contents
Introduction
Let’s start with understanding the need for chaos testing. There are a couple of false assumptions that users might have when operating and running their applications in distributed systems:
- The network is reliable.
- There is zero latency.
- Bandwidth is infinite.
- The network is secure.
- Topology never changes.
- The network is homogeneous.
- Consistent resource usage with no spikes.
- All shared resources are available from all places.
These assumptions led to a number of outages in production environments in the past. The services suffered from poor performance or were inaccessible to the customers, leading to the service missing Service Level Agreement uptime promises, revenue loss, and a degradation in the perceived reliability of said services.
The issues can arise at various levels - either at the cloud provider layer, at the software layer which manages the applications, for example OpenShift or Kubernetes or at the application layer.
How can we best avoid this from happening? This is exactly where Chaos testing can add value.
Here are examples of recent service outages where companies running large scale production environments could have benefited from Chaos Testing.
Cloudflare degraded performance and limited API availability Source
- On November 2nd, 2020, Cloudflare infrastructure encountered degraded performance and limited availability because of issues with a switch in one of the racks that isolated the traffic flow and caused network partitioning which led to multiple leader elections in the distributed system with a master, worker setup. API was dropping 25% of the requests and users were experiencing 80 times slower performance than usual.
Facebook services outage for 6 hours due to network failures Source
- Facebook has been the victim to an outage most recently in October 2021 where over 3 million users and businesses that rely on Facebook services couldn’t access instagram.com, facebook.com, whatsapp etc. for nearly six hours. The root cause analysis revealed that the audit system missed a misconfiguration in the backbone routers that controls the traffic between the data centers, which led to the outage. The good news is that Facebook has been doing chaos testing, capacity planning which helped them get the systems back online without much issues once they resolved the issue.
Netflix Outage on a Saturday Night Source
- The streaming service was reported down in the US and at least a dozen countries. This led to negative feedback from customers on social media.
Test Strategies and Methodology
Best Practices
Now that we understand the test methodology, let’s take a look at the best practices for an OpenShift cluster. On that platform there are user applications and cluster workloads that need to be designed for stability and to provide the best user experience possible:
Alerts with appropriate severity
- Alerts are key to identify when a component starts degrading, and can help focus the investigation effort on affected system components.
- Alerts should have proper severity, description, notification policy, escalation policy, and SOP in order to reduce MTTR for responding SRE or Ops resources. Detailed information on the alerts consistency can be found here.
Minimal performance impact - Network, CPU, Memory, Storage, Throughput etc.
- The system, as well as the applications, should be designed to have minimal performance impact during disruptions to ensure stability and also to avoid hogging resources that other applications can use.
- We want to look at this in terms of CPU, Memory, Storage, Throughput, Network etc.
Minimal initialization time and fast recovery logic
- The controller watching the component should recognize a failure as soon as possible. The component needs to have minimal initialization time to avoid extended downtime or overloading the replicas if it is a highly available configuration. The cause of failure can be because of issues with the infrastructure on top of which it’s running, application failures or because of service failures that it depends on.
Appropriate CPU/Memory limits set to avoid performance throttling and OOM kills
- There might be rogue applications hogging resources ( CPU/Memory ) on the nodes which might lead to applications underperforming or worse getting OOM killed. It’s important to ensure that the applications and system components have reserved resources for the
kube-scheduler
to take into consideration in order to keep them performing at the expected levels.
High Availability deployment strategy
- There should be multiple replicas (both OpenShift and application control planes) running preferably in different availability zones to survive outages while still serving the user/system requests.
Services dependent on the system under test need to handle the failure gracefully to avoid performance degradation and downtime - appropriate timeouts
- In a distributed system, services deployed coordinate with each other and might have external dependencies. Each of the services deployed as a deployment, pod or container needs to handle the downtime of other dependent services gracefully instead of crashing due to not having appropriate timeouts, fallback logic etc.
Proper node sizing to avoid cascading failures and ensure cluster stability especially when the cluster is large and dense
- The platform needs to be sized taking into account the resource usage spikes that might occur during chaotic events. For example, if one of the main nodes goes down, the other two main nodes need to have enough resources to handle the load. The resource usage depends on the load or number of objects that are running being managed by the Control Plane ( Api Server, Etcd, Controller and Scheduler ). As such, it’s critical to test such conditions, understand the behavior, and leverage the data to size the platform appropriately. This can help keep the applications stable during unplanned events without the control plane undergoing cascading failures which can potentially bring down the entire cluster.
Proper node sizing to avoid application failures and maintain stability
- An application pod might to use more resources during reinitialization after a crash, so it’s important to take that into account for sizing the nodes in the cluster to accommodate it. For example, monitoring solutions like Prometheus need high amounts of memory to replay the write ahead log ( WAL ) when it restarts. As such, it’s critical to test such conditions, understand the behavior, and leverage the data to size the platform appropriately. This can help keep the application stable during unplanned events without undergoing degradation in performance or even worse hogging the resources on the node which can impact other applications and system pods.
Backed by persistent storage
- It’s important to have the system/application backed by persistent storage. This is especially important in cases where the application is a database or a stateful application given that a node, pod or container failure will wipe off the data.
There should be fallback routes to the backend in case of using CDN, for example, Akamai in case of console.redhat.com - a managed service deployed on top of OpenShift dedicated:
- Content delivery networks (CDNs) are commonly used to host resources such as images, JavaScript files, and CSS. The average web page is nearly 2 MB in size, and offloading heavy resources to third-parties is extremely effective for reducing backend server traffic and latency. However, this makes each CDN an additional point of failure for every site that relies on it. If the CDN fails, its customers could also fail.
- To test how the application reacts to failures, drop all network traffic between the system and CDN. The application should still serve the content to the user irrespective of the failure.
Appropriate caching and Content Delivery Network usage should be enabled to be performant and usable when there’s a latency on the client side:
- Not every user or machine has access to unlimited bandwidth, there might be a delay on the user side ( client ) to access the API’s due to limited bandwidth, throttling or latency depending on the geographic location. It’s important to inject latency between the client and API calls to understand the behavior and optimize things including caching wherever possible, using CDN’s or opting for different protocols like HTTP/2 or HTTP/3 vs HTTP.
Tooling
Now that we looked at the best practices, In this section, we will go through how Krkn - a chaos testing framework can help test the resilience of OpenShift and make sure the applications and services are following the best practices.
Workflow
Let us start by understanding the workflow of krkn: the user will start by running krkn by pointing to a specific OpenShift cluster using kubeconfig to be able to talk to the platform on top of which the OpenShift cluster is hosted. This can be done by either the oc/kubectl API or the cloud API. Based on the configuration of krkn, it will inject specific chaos scenarios as shown below, talk to Cerberus to get the go/no-go signal representing the overall health of the cluster ( optional - can be turned off ), scrapes metrics from in-cluster prometheus given a metrics profile with the promql queries and stores them long term in Elasticsearch configured ( optional - can be turned off ), evaluates the promql expressions specified in the alerts profile ( optional - can be turned off ) and aggregated everything to set the pass/fail i.e. exits 0 or 1. More about the metrics collection, cerberus and metrics evaluation can be found in the next section.
Cluster recovery checks, metrics evaluation and pass/fail criteria
Most of the scenarios have built in checks to verify if the targeted component recovered from the failure after the specified duration of time but there might be cases where other components might have an impact because of a certain failure and it’s extremely important to make sure that the system/application is healthy as a whole post chaos. This is exactly where Cerberus comes to the rescue.
If the monitoring tool, cerberus is enabled it will consume the signal and continue running chaos or not based on that signal.
Apart from checking the recovery and cluster health status, it’s equally important to evaluate the performance metrics like latency, resource usage spikes, throughput, etcd health like disk fsync, leader elections etc. To help with this, Krkn has a way to evaluate promql expressions from the incluster prometheus and set the exit status to 0 or 1 based on the severity set for each of the query. Details on how to use this feature can be found here.
The overall pass or fail of krkn is based on the recovery of the specific component (within a certain amount of time), the cerberus health signal which tracks the health of the entire cluster and metrics evaluation from incluster prometheus.
Scenarios
Let us take a look at how to run the chaos scenarios on your OpenShift clusters using Krkn-hub - a lightweight wrapper around Krkn to ease the runs by providing the ability to run them by just running container images using podman with parameters set as environment variables. This eliminates the need to carry around and edit configuration files and makes it easy for any CI framework integration. Here are the scenarios supported:
Pod Scenarios (Documentation)
- Disrupts OpenShift/Kubernetes and applications deployed as pods:
- Helps understand the availability of the application, the initialization timing and recovery status.
- Demo
Container Scenarios (Documentation)
- Disrupts OpenShift/Kubernetes and applications deployed as containers running as part of a pod(s) using a specified kill signal to mimic failures:
- Helps understand the impact and recovery timing when the program/process running in the containers are disrupted - hangs, paused, killed etc., using various kill signals, i.e. SIGHUP, SIGTERM, SIGKILL etc.
- Demo
Node Scenarios (Documentation)
- Disrupts nodes as part of the cluster infrastructure by talking to the cloud API. AWS, Azure, GCP, OpenStack and Baremetal are the supported platforms as of now. Possible disruptions include:
- Terminate nodes
- Fork bomb inside the node
- Stop the node
- Crash the kubelet running on the node
- etc.
- Demo
Zone Outages (Documentation)
- Creates outage of availability zone(s) in a targeted region in the public cloud where the OpenShift cluster is running by tweaking the network acl of the zone to simulate the failure, and that in turn will stop both ingress and egress traffic from all nodes in a particular zone for the specified duration and reverts it back to the previous state.
- Helps understand the impact on both Kubernetes/OpenShift control plane as well as applications and services running on the worker nodes in that zone.
- Currently, only set up for AWS cloud platform: 1 VPC and multiples subnets within the VPC can be specified.
- Demo
Application Outages (Documentation)
- Scenario to block the traffic ( Ingress/Egress ) of an application matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during the downtime.
- Helps understand how the dependent services react to the unavailability.
- Demo
Power Outages (Documentation)
- This scenario imitates a power outage by shutting down of the entire cluster for a specified duration of time, then restarts all the nodes after the specified time and checks the health of the cluster.
- There are various use cases in the customer environments. For example, when some of the clusters are shutdown in cases where the applications are not needed to run in a particular time/season in order to save costs.
- The nodes are stopped in parallel to mimic a power outage i.e., pulling off the plug
- Demo
Resource Hog
- Hogs CPU, Memory and IO on the targeted nodes
- Helps understand if the application/system components have reserved resources to not get disrupted because of rogue applications, or get performance throttled.
Time Skewing (Documentation)
- Manipulate the system time and/or date of specific pods/nodes.
- Verify scheduling of objects so they continue to work.
- Verify time gets reset properly.
Namespace Failures (Documentation)
- Delete namespaces for the specified duration.
- Helps understand the impact on other components and tests/improves recovery time of the components in the targeted namespace.
Persistent Volume Fill (Documentation)
- Fills up the persistent volumes, up to a given percentage, used by the pod for the specified duration.
- Helps understand how an application deals when it is no longer able to write data to the disk. For example, kafka’s behavior when it is not able to commit data to the disk.
Network Chaos (Documentation)
- Scenarios supported includes:
- Network latency
- Packet loss
- Interface flapping
- DNS errors
- Packet corruption
- Bandwidth limitation
Test Environment Recommendations - how and where to run chaos tests
Let us take a look at few recommendations on how and where to run the chaos tests:
Run the chaos tests continuously in your test pipelines:
- Software, systems, and infrastructure does change – and the condition/health of each can change pretty rapidly. A good place to run tests is in your CI/CD pipeline running on a regular cadence.
Run the chaos tests manually to learn from the system:
- When running a Chaos scenario or Fault tests, it is more important to understand how the system responds and reacts, rather than mark the execution as pass or fail.
- It is important to define the scope of the test before the execution to avoid some issues from masking others.
Run the chaos tests in production environments or mimic the load in staging environments:
- As scary as a thought about testing in production is, production is the environment that users are in and traffic spikes/load are real. To fully test the robustness/resilience of a production system, running Chaos Engineering experiments in a production environment will provide needed insights. A couple of things to keep in mind:
- Minimize blast radius and have a backup plan in place to make sure the users and customers do not undergo downtime.
- Mimic the load in a staging environment in case Service Level Agreements are too tight to cover any downtime.
Enable Observability:
- Chaos Engineering Without Observability ... Is Just Chaos.
- Make sure to have logging and monitoring installed on the cluster to help with understanding the behaviour as to why it is happening. In case of running the tests in the CI where it is not humanly possible to monitor the cluster all the time, it is recommended to leverage Cerberus to capture the state during the runs and metrics collection in Krkn to store metrics long term even after the cluster is gone.
- Krkn ships with dashboards that will help understand API, Etcd and OpenShift cluster level stats and performance metrics.
- Pay attention to Prometheus alerts. Check if they are firing as expected.
Run multiple chaos tests at once to mimic the production outages:
- For example, hogging both IO and Network at the same time instead of running them separately to observe the impact.
- You might have existing test cases, be it related to Performance, Scalability or QE. Run the chaos in the background during the test runs to observe the impact. Signaling feature in Krkn can help with coordinating the chaos runs i.e., start, stop, pause the scenarios based on the state of the other test jobs.
Contributing
We are always looking to enhance Krkn, krkn-hub and cerberus and welcome any and all suggestions and help. Please feel free to open issues or if you want to contribute yourself refer to these documents below:
What's next?
In summary, we looked at the need for chaos testing, test stragies and methodology, best practices and recommendations and how we can leverage tooling to ensure that the system as well as the service running on it are resilient and performant. Do give this guide a try and feel free to open issues and enhancement ideas on github. Any feedback or contributions are most welcome and appreciated. We would love to hear from you:
Embrace chaos and make it part of your environment!