The Quality Death Spiral
This is the outline for a talk that Jeff Bonwick gave on Solaris 2.5 to the Systems Group in September, 1994. (Jeff was the gatekeeper for Solaris 2.5.) It would only be of historical interest were it not for the section on the Quality Death Sprial -- a timeless phenomenon that remains our omnipresent fear:
goals - FCS quality all the time - minimal process overhead - rapid deployment and bug discovery single gate model - one-stop shopping for source, archives, tools, etc. - single golden source eliminates gate merges - faster, wider exposure (via bfu) for all changes - over 1000 bfus since the gate opened - on495 running on nearly 200 different machines - currently averaging over 20 bfus per day - on495-clone to address bandwidth problems - gatekeeping now consumes two engineers instead of six if it's broken, rip it out - FCS quality all the time - put it back today, 20 of your (current) friends will be running it tomorrow - gate breakage grinds other development to a halt - the product, not any one project, is what matters - mistakes will happen; negligence cannot is bonwick gonna tear me a new address space gap? - only if you really beg: - integrate a source file that doesn't compile - a kernel that doesn't boot or panics hourly - code that *could not possibly* have been tested *at all* - gatekeeper's job: keep the golden source golden - fix dozens of minor mistakes weekly - rip out major breakage FCS quality all the time ~-- why is this so important? - only way to avoid the quality death spiral: - people hear the gate is broken - decide not to risk a bringover - fewer people run the latest stuff - less real-life testing - new bugs not found - quality drops further - morale tracks quality - downward spiral hard to break - recovery time can be very long how are we doing? - after a rough start, things are going well - eliminated gatekeeper putback approval - still no netinstall, but lots of bfu - high volume of change, very little breakage - several major projects (KBI, C-O-C, NFS V3, NFS/TCP, ...) - over 300 bug fixes - on495 already deployed on jurassic, updated weekly - lots more good stuff on the way
on 2009/10/26 12:08