Wayne Roseberry, Microsoft
Reliability is one of the most difficult areas to establish confidence before shipping the product. There is always that nagging question, “Will it be reliable enough for real world load and demand? Did what we build this time get better than what we had before?”
There are two parts to the solution:
- Choose the right set of metrics and consistently measure against those metrics in every deployment. By carefully defining what you mean by availability and reliability metrics you can better assess if your product is headed toward success or miserable failure.
- Build a toolset and methodology that joins the metrics to investigation and data collection.
This paper will describe how the Microsoft SharePoint 2010 team used reliability and monitoring tools in lab and real-world environments to substantially improve service availability and performance. The presentation will discuss what our key definitions were for availability, failure and performance targets, and show how we used those to establish confidence in reliability before the product shipped.
Wayne Roseberry, 2011 Technical Paper, Abstract, Paper






