Reliability Surrogates for a Corporate Balanced Scorecard SRE98 Workshop, Ottawa, Canada, July 14-15, 1998 James Cusick Senior Technical Staff Member, AT&T 30 Knightsbridge Road, Piscataway, NJ 08854
[email protected]
Today’s Talk Introduction & Context ✦ Balanced Scorecard Background ✦ Components of a Scorecard ✦ A Scorecard Process ✦ Reliability Surrogates Considered ✦ Reliability’s Optimal Role Considered ✦
AT&T 2
So Many Metrics So Little Time ✦
Large-scale business operations require software intensive systems - which metrics are most useful to manage these systems?
✦
What role should reliability measurement play in IT management?
AT&T 3
AT&T Frame Relay Outage Reliability Matters
“AT&T has stood for reliability and still does. It is at the core of our service and our customer relationship, and it is a top priority for every AT&T person and for me personally.” Mike Armstrong, AT&T CEO Press Briefing, Basking Ridge, NJ April 14, 1998
AT&T 4
Introducing the BBS Model ✦
Balanced Business Scorecard developed by Kaplan & Norton Financial In iti
at iv es
s re ts su ge ea ar M T
O bj ec tiv es
“Are we meeting shareholder expectations for value delivered?”
Customer
How well do we deliver software solutions to the business?
at iv es
s re ts su ge ea Tar M
In iti
Vision and Strategy
O bj ec tiv es
In iti
at iv es
O bj ec tiv es
“Do we meet business partner expectations?”
Process s re ts su ge ea Tar M
In iti
s re s su et ea rg M Ta
O bj ec tiv es
“How will we sustain our ability to change and meet future needs?
at iv es
Learning and Growth
Kaplan, S., et. al., “The Balanced Business Scorecard”, HBR, 1993.
AT&T 5
BM-CIO Scorecard Metrics Financial IT Spending vs. Revenue Development vs. Maintenance Labor Cost per Function Point
Customer Defects per Function Point System Availability Response Time
Development Process On Time Delivery Cycle Time Effort Variance SEI Level Distribution
Learning & Growth Attrition Rate Training Days per Employee
* Scope of reporting: Business Markets CIO “critical” systems with active development. These are the 1997 metrics. AT&T 6
Organization CIO Office Development Directorate A
Development Directorate n
Metrics Coordinators
Metrics Coordinators
System A
System A
System B
System B
System n
System n
Architecture Directorate Metrics Office
AT&T 7
BM-CIO Scorecard Process Data Collection Development Directorates (via Metrics Coordinators)
Data Analysis
Data Reporting Internal Review
Metrics Definition
VP Review
Benchmark Research
Operations Release Mgmt
Draft 1-n
Collection Requests Draft & Publish Scorecard
HR
Training
Published Scorecard
Template Collection
Finance Data Merge & Analysis
Scorecard Archives
Biz Staff Meeting Feedback (goals, tracking)
CIO Staff Meeting
* BM-CIO = Business Markets CIO AT&T 8
Measuring Reliability in DPMs From the AT&T Web Site (5/10/98) http://www.att.com/network/standrd.html “At AT&T's Network and Computing Services organization, one of the most important gauges of network reliability is Defects Per Million. This measurement is a statistically valid record of how many calls per million did not go through the first time because of a network procedural, hardware or software failure.” “During 1997, AT&T's Defects-Per-Million performance was 173, which means that of every one million calls placed on the AT&T network, only 173 did not go through the first time due to a network failure. That equals a network reliability rate of 99.98 percent for 1997.”
DPMs are now being used in ever widening metrics applications within AT&T.
AT&T 9
Reliability Surrogates 1997 Availability for Selected Systems in DPMs
JAN
FEB
MAR
APR
MAY
JUN
JUL
AUG
SEPT
OCT
NOV
DEC
OTHER MEASURES IN DPMs: - System Response Time - Batch Runs (e.g., bill production) - Customer Satisfaction AT&T 10
Using DPMs within SRE Process Modifications Require up-front specification of reliability expressed in DPMs which balance cost of development, cost of a failure, and other factors to set an appropriate reliability goal..
Architecture and Architecture Reviews Architecture phase must consider the DPM rating which the system must attain.
Software Reliability Engineering Techniques Operational Profiles, reliability modeling, profile testing with consequence factors, used to provide estimates of production availability and DPM rates.
Operational Feedback Verify application defect rates in production, make appropriate adjustments to process.
Training Rewards Executive Sponsorship
Proposal Status: Under Consideration AT&T 11
Which Reliability to Use? ✦
AT&T’s Operational Model – Defect’s per Million given an “operational” unit
✦
HP’s FURPs model – Traditional reliability figures
✦
SRE model(s)?
AT&T 12
Issues with Reliability & DPMs Is rolled-up reliability across systems meaningful? In what instances? ✦ Is Reliability or DPMs more understandable to management or non-technical audiences? ✦ What are the limitations to using DPMs? ✦ If failure rate per million is a reliability figure, can it be a predictor as well? ✦
AT&T 13
Wrap-Up Thanks for your attention … ✦ Thanks to the Scorecard team ... ✦
Any Questions?
AT&T 14