Real time failure point detection and operational health monitoring of Cloud Infra

Cloud Infra down time directly results in lost Engineering productivity time. Infra resilience and robust is every engineering team's responsibility. In a complex cloud software solution, there are interconnected dependencies. This exposes us to multi point failure. Current cloud infra monitoring process has challenges, namely: The current tools monitor individual parameters and a real time co-relation is missing. In a complex interconnected services infra, if a component fails, reaching actual issue component is major portion of the troubleshooting time. Proactive health monitoring is limited to checking feature running status post deployment. There is a gap how the system behaves during the actual deployment. Our proposed solution to building robust software health monitoring system draws parallel from the health care process. During a critical operation, the entire patient monitoring is kept active to keep the overall situation under control. On a similar analogy, in this paper we present how we addressed infra failure issues and real time monitoring during active deployment and regular time. We corelate Dependency tree, AWS Realtime logs, db queries, Jenkins and Harness logs, network connectivity, CPU consumption and many other attributes. We take inputs from multiple monitoring systems and corelate the data for better infra health status.

Paper | Presentation

Vittalkumar Mirajkar

Vittalkumar Mirajkar has a Degree in Electronics and Communication Engineering. He has been with the McAfee India team since 2006 and has experience across different domains covering Tech Support, Research, and Testing. He is an experienced exploratory tester and has a wide range of testing experience ranging from Device Driver testing, application testing and server testing. He specializes in testing security products, ranging from Anti-Virus, Firewall, Hooking and Injection, Data Loss prevention product lines etc, to name a few. His area of interest are Performance testing, Soaktesting, Data Analysis and Exploratory testing. He has been actively working in bringing newer data-driven test techniques to detect early bugs. He has vast experience in testing both Consumer Security Products as well as Enterprise Security Products. Vittal has expertise in testing intercompatiblilty issues when multiple security products are deployed together.