Chaos Testing and Engineering in the AI era

In the era of AI-driven microservices and cloud driven infrastructure, the application complexity has grown significantly, making failures harder to predict. Downtime caused by outages is costly for organizations and can significantly impact customers trying to access the application, transact business, and complete tasks. To address this challenge, more companies are turning to Chaos Engineering, a disciplined approach to identifying failures before they become outages.

This presentation will provide an overview of Chaos Engineering techniques, including how to intentionally break things to improve application availability, reduce mean time to resolution (MTTR), decrease mean time to detection (MTTD), ship fewer bugs to production, and experience fewer outages. The role of QA in chaos engineering will also be discussed, focusing on how they can collaborate with architects to carry out experiments on known knowns, known unknowns, unknown knowns, and unknown unknowns. The presentation will provide guidance on how QAs can transition from learning about Chaos Testing to practicing it, including preparing for Gameday and simulating production-like behavior to achieve site reliability.

Finally, best practices for QAs will be highlighted, incorporating insights on how AI based monitoring and automated incident response contribute to streamlined recovery processes during production outages

  • Design and Run Chaos Experiments by simulating real-world failures in the system and collect data to analyze and identify areas for improvement.
  • QA and architects can collaborate to carry out experiments and improve overall system reliability.
  • Preparing for Gameday and simulating production-like behavior can help achieve site reliability.
  • Best practices such as implementing monitoring and alerting, creating runbooks, and conducting post-incident reviews can help minimize the impact of production outages.

Author profile pictureJigesh Shah

Jigesh Shah is an Engineering Manager specializing in Quality Assurance, with extensive experience in the eLearning and Healthcare industries. He has developed numerous reusable testing tools, frameworks, and templates, significantly enhancing testing efficiency. Currently, he provides innovative testing solutions that improve quality for both US and global clients. Jigesh has played a pivotal role in test advisory, management, planning, and execution, and has led offshore testing efforts. His expertise also extends to Performance Engineering, SQL optimization, system resource monitoring, and the implementation of Generative AI technologies in testing processes, achieving a 20% increase in overall efficiency.

Author profile pictureMradul Kapoor

Mradul Kapoor is a QA leader working with Deloitte Consulting. He specializes in fast-paced, large-scale implementations with low- and no-code platforms and QA delivery. He leads the overall quality assurance (QA) for client's deliverables including setting up a large-scale testing practice and Testing Center of Excellence (TCoE) to achieve optimum quality standards across various pillars of quality per business specifications, agreed timelines, and effective cost. He is experienced in setting up and managing large-scale TCoE across the shore with the latest QA technology stacks for Fortune 500 firms, as well as startups. Mradul has strong hands-on experience in test team management, and has demonstrated skills across all test levels, including specialist testing such as automation, performance, and accessibility testing.