Amith Shetty, McAfee
As enterprises embrace the cloud for large-scale distributed application deployments, it has become crucial to making sure that systems are always on available and resilient to failures. We are quick to adopt practices that increase the flexibility of development and velocity of deployment. But as systems scale, we expect part of the infrastructure to fail ungracefully in random and unexpected ways. We must design cloud architecture where part of the system can fail and recover without affecting the availability of the entire system.
We make our architecture fault-tolerant and resilient by adopting multi-availability zone deployments, multi-region deployments, and other techniques. However, it is equally important to test these failure scenarios to be confident about surviving in these disruptions and recover quickly. Functional specifications fail to describe distributed systems because we cannot characterize all possible inputs, thus the need to validate system availability in production with chaos engineering.
Chaos Engineering is the discipline of experimenting on a distributed system to build confidence in the system’s capability to withstand turbulent conditions in production.
Chaos Testing is a deliberate introduction of disaster scenarios into our infrastructure to test the system’s ability to respond to it. This is an effective method to practice, prepare, and prevent/minimize downtime and outages before they occur.
If adopted and experimented in a controlled manner, we will be able to learn system behavior in chaotic state and device preventive actions to recover from disaster situations and reduce the downtime
Chaos Testing generally includes the following:
- Collect data on the health of the system and Hypothesize about steady state
- Introduce real-world events by turning off a server availability zone failure to simulate failures
- Run tests close to the production environment d. Automate these experiments to run continuously
- Minimize the effects of your experiments to keep from blowing everything up
- Learn to design Chaos engineering experiments
This talk is about how our team took first small steps in learning and building Chaos Engineering and Testing tools for our cloud deployment in AWS
Key takeaways:
- Introduction to Chaos Engineering and Testing
- Factors to consider in building highly available, redundant architecture in the cloud
- When do you need Chaos Testing?
- How our team started Chaos testing and benefits we gained from it
- Overview of open source tools available for chaos testing in AWS and Azure
- Chaos testing for microservices and containers
Amith Shetty, Krithika Hadge, Atul Ahire, 2018 Technical Presentation, Abstract, Paper, Slides