Keith Stobie, Microsoft
There are many benefits to be realized by Testing Services in Production when the risks are properly mitigated. Testing in production finds problems at a scale that most groups can’t afford to duplicate with a test environment. Using production systems for testing is critical to business success of an effective software service. This paper describes and demonstrates several different approaches to using production systems for testing including: when each approach is appropriate, what prerequisites are needed, and how each approach would be used.
Monitoring of services, Controlled Experiments, and production data are well-known forms of testing. You can also use tracers to follow service flow, do destructive testing (killing services, networks, etc.), and even do load, capacity, performance, and stress testing in production. In short, almost all kinds of testing can be done in production, but how do you mitigate the risk? Testing in production allows customers to benefit (and occasionally to suffer) from the most current advances. Throttling requests or work, exposure control, incremental rollout, and especially superb monitoring are all needed to control production testing risk.
Keith Stobie, 2011 Technical Paper, Abstract, Paper, Slides