Chaos Engineering and Testing for Availability and Resiliency in Cloud

June 21, 2018

2018, Technical, Technology and Tools 2018, Amith Shetty, McAfee

Amith Shetty, McAfee

As enterprises embrace the cloud for large-scale distributed application deployments, it has become crucial to making sure that systems are always on available and resilient to failures. We are quick to adopt practices that increase the flexibility of development and velocity of deployment. But as systems scale, we expect part of the infrastructure to fail ungracefully in random and unexpected ways. We must design cloud architecture where part of the system can fail and recover without affecting the availability of the entire system.

We make our architecture fault-tolerant and resilient by adopting multi-availability zone deployments, multi-region deployments, and other techniques. However, it is equally important to test these failure scenarios to be confident about surviving in these disruptions and recover quickly. Functional specifications fail to describe distributed systems because we cannot characterize all possible inputs, thus the need to validate system availability in production with chaos engineering.

Chaos Engineering is the discipline of experimenting on a distributed system to build confidence in the system’s capability to withstand turbulent conditions in production.

Chaos Testing is a deliberate introduction of disaster scenarios into our infrastructure to test the system’s ability to respond to it. This is an effective method to practice, prepare, and prevent/minimize downtime and outages before they occur.

If adopted and experimented in a controlled manner, we will be able to learn system behavior in chaotic state and device preventive actions to recover from disaster situations and reduce the downtime

Chaos Testing generally includes the following:

Collect data on the health of the system and Hypothesize about steady state
Introduce real-world events by turning off a server availability zone failure to simulate failures
Run tests close to the production environment d. Automate these experiments to run continuously
Minimize the effects of your experiments to keep from blowing everything up
Learn to design Chaos engineering experiments

This talk is about how our team took first small steps in learning and building Chaos Engineering and Testing tools for our cloud deployment in AWS

Key takeaways:

Introduction to Chaos Engineering and Testing
Factors to consider in building highly available, redundant architecture in the cloud
When do you need Chaos Testing?
How our team started Chaos testing and benefits we gained from it
Overview of open source tools available for chaos testing in AWS and Azure
Chaos testing for microservices and containers

Amith Shetty, Krithika Hadge, Atul Ahire, 2018 Technical Presentation, Abstract, Paper, Slides

(Visited 88 times, 1 visits today)

Chaos Engineering and Testing for Availability and Resiliency in Cloud

Amith Shetty, McAfee

Search the Blog

Latest Tweets

Trending Content

Search the Archives

Trending Content

2022 Conference