Wayne Roseberry, Microsoft
Server products operate under complex conditions with unpredictable workload characteristics, concurrency rates, infrastructure conditions, operational demands and resource utilization. Driving for high quality in such cases is extremely difficult, particularly when production and testing environments lack consistent, symmetric monitoring methods and models to aid in diagnosing issues, predicting failure rates after ship, discovering product flaws and assessing readiness to ship.
One way to solve part of this problem is to establish monitoring tools and methods that map an end user based view of software quality to a product engineering based view of system flaws and failures. By basing the metrics on experiences that map directly to impact, you can focus the product team time on issues that will truly matter to the customer. By building a monitoring solution that can be applied in both production systems and in test environments you enable a quality feedback loop that facilitates improved test engineering, product design, investigational methodology and ship schedule management.
This paper presentation will describe how the SharePoint 2010 team deployed such a solution for monitoring both in-house production environments and performance and stress testing lab environments. It will describe the components in the solution that span in-product instrumentation, custom solutions built on existing product features, as well as team methods and processes for monitoring and triaging issues coming from production systems. It will demonstrate the gains we achieved in bug discovery yield, improved fix rates and more successful pre-release production runs.
2010 Poster Paper, Wayne Roseberry, Abstract