government customers and users: A software system ... The customer is NOT interested in: â Underlying ..... Helpdesk and other support mechanisms. ⢠Training ...
Towards Understanding Software Reliability (brief notes from the trenches)
Duncan Hall 23 September 2004
Software Reliability: • • • • • • •
Why worry about it? Definitions While in development While in production Why does software wear out? Cautionary tales So what?
Airplane video
A few stories from the wild:
• • • •
They Write the Right Stuff (Fishman) ScAirbus A340: Tokyo to London www.risks.org UBS
UBS: Unbundled Bit Stream
Stories coming down the track:
• Product liability – e-commerce – embedded software
• Government (USA, UK and NZ) requirements • Business risk management
Most of the time you will be working on systems for business or government customers and users:
A software system exists to support/allow a person/system/organisation to perform a function (an algorithmic process) to generate a deliverable triggered by a business event to achieve a specific purpose/objective/ outcome
“Reliability” is one example of a business-driven requirement:
Requirements Dimension A
Requirements Dimension B
Getting requirements sorted is important ...
• Glass’s Law: – Requirements deficiencies are the prime source of project failures
• Boehm’s First Law: – Errors are most frequent during the requirements and design activities and are more expensive the later they are removed
Sources: Glass, R. L. (1998) Software Runaways Boehm, B. W. et alia (1975) Some experience ... IEEE Trans. Software Engineering; 1/1, 125ff Boehm, B. W. et alia (1984) Prototyping ... IEEE Trans. Software Engineering; 10/3, 290ff
... but hard to get right
“The hardest single part of building a software system is deciding precisely what to build. No other part of the conceptual work is as difficult as establishing the requirements. ... No other part of the work so cripples the resulting system if done wrong. No other part is as difficult to rectify later.”
Source: Fred Brooks (1987) No Silver Bullet
Eliciting business requirements requires dialogue, analysis and iteration: Understand the technology context/drivers Produce an assessment
Start Here
Understand the business context/drivers
Check that it makes sense
Collect the requirements
Document the requirements
Prioritise the requirements
Resolve any conflicts Classify the requirements
Requirements have many dimensions:
• • • • • • • • • • •
Necessary? Unambiguous? Traceable? Testable? Measurable? Ranked? Complete? Consistent? Modifiable? Customer’s view? Users’ view?
Source: IEEE SESC Standard 830-1998
• • • • •
Functional Performance External Interfaces Design Constraints Quality Attributes: – Reliability – Survivability – Maintainability – Portability – Security – User friendliness – ++
Pick any two:
Functionality/Quality
Cost
Time
What's important … … is the customer's experience
• The customer is NOT interested in: – Underlying technologies – Technological constraints – Challenges to delivering reliability – Why it might not work some of the time
• The customer IS interested in: – Predictability of performance – Reliability of service – Getting it back working with no losses if it goes wrong
Golden Rule 1:
From the customer's perspective: it's more important to get an approximate number which describes the right measure of performance than an exact number for the wrong measure
Golden Rule 2:
Them's that have the gold … … they make the rules!
The end-users’ reliability perceptions are driven by many factors: (1/5)
• Latency – response to client request to server: – For greater than PQ% of requests: • Expected less than XYZ milliseconds • Minimum less than ABC milliseconds
– 24x7? – Maintenance window?
The end-users’ reliability perceptions are driven by many factors: (2/5)
• Throughput – workstation to server: – For greater than PQ% of requests • Expected less than A seconds for X Megabits of data • Minimum B seconds for X Megabits of data
– 24x7? – Maintenance window?
The end-users’ reliability perceptions are driven by many factors: (3/5)
• Availability: – AB.XYZ% – 24x7? – Maintenance windows – Less than X% of all sessions lost?
• Recovery from outage: – Service restoration within XY minutes for greater than PQ% of incidents – Client application restoration time within AB minutes for greater than RS% of incidents
The end-users’ reliability perceptions are driven by many factors: (4/5)
• Security: – For greater than AB% of incidents: • Disconnect if user session greater than XY hours or • No activity for more than PQ minutes
The end-users’ reliability perceptions are driven by many factors: (5/5)
• Moves, Adds, Changes: – Time to enable user to use service less than XY hours for greater than PQ% of requests – Time to disable user to use service less than AB hours for greater than RS% of requests – Time to modify user s resources to use service less than CD hours for greater than TU% of requests
Software Reliability: • • • • • • •
Why worry about it? Definitions While in development While in production Why does software wear out? Cautionary tales So what?
Software reliability engineering:
• Quantitatively planning and guiding: – Software development – Software testing – Software maintenance
• With emphasis on: – Availability – Reliability
John Musa, SRE tutorial for IEEE
Quantitatively: Lord Kelvin
“When you can measure what you are speaking about, and express it in numbers, you know something about it. But when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind. It may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of science, whatever the matter may be.”
What is “measurement”?
The process by which numbers or symbols are assigned to entities in the real world in such a way as to describe them according to clearly defined rules
see for example Gary Ford, “Lecture Notes on Engineering Measurement for Software Engineers”; CMU/SEI-93-EM-9; SEI
The pragmatic engineer’s view:
If you can’t measure it ...
... you can’t control it!
What is “availability”?
• • • • •
probability at any given time that a system functions satisfactorily in a specified environment
John Musa, SRE tutorial for IEEE
Philip Koopman, Carnegie Mellon
What is “reliability”?
• • • • •
probability that a system functions without failure for a specified time in a specified environment
John Musa, SRE tutorial for IEEE
Reliability is a conditional probability:
Reliability
Availability (now)
Availability (future)
Philip Koopman, Carnegie Mellon
Philip Koopman, Carnegie Mellon
Philip Koopman, Carnegie Mellon
Philip Koopman, Carnegie Mellon
Philip Koopman, Carnegie Mellon
Philip Koopman, Carnegie Mellon
Philip Koopman, Carnegie Mellon
Software reliability doesn’t exist in isolation:
People Software
Hardware Process
Software reliability doesn’t exist in isolation:
People Software
Hardware Process
Philip Koopman, Carnegie Mellon
Software reliability doesn’t exist in isolation:
People Software
Hardware Process
Philip Koopman, Carnegie Mellon
Software Reliability: • • • • • • •
Why worry about it? Definitions While in development While in production Why does software wear out? Cautionary tales So what?
Many models exist to predict reliability of software in development ...
Jiantao Pan, Carnegie Mellon
... and some could even be used in practise ...
Jiantao Pan, Carnegie Mellon
... BUT very little use made of them! WHY??
• “[Reliability] measurement is far from commonplace in the software development world” • “Despite some heroic efforts from a small number of research centres and individuals ... there continues to be a dearth of published empirical data relating to the quality and reliability of realistic commercial software systems” • “Measurement in software is still in its infancy. No good quantitative methods have been developed to represent Software Reliability without excessive limitations.” see: Jiantao Pan, Carnegie Mellon; Fenton and Ohlsson, IEEE Trans. Software Engineering, Aug00
Reliability by itself is hard to measure
• So we measure: – Lines Of Code – Function Point Metrics – Complexity-Oriented Metrics – Project Management Metrics – Test Metrics – Change control requests – Fault incidence by time and severity of impact – Process metrics
Apart from delivering “trivial solutions”, software development processes have a lot of room for improvement beyond “code and fix”
People Software
Hardware Process
Pure specification-driven development processes are rare in practice
Paul Krause, Bernd Freimut and Witold Suryn, New Directions in Measurement for Software Quality Control, IEEE STEP’02 Conference Proceedings
A “Leap of Faith”: Reliable processes deliver reliable software: Optimizing (5) Process chang e management Technology change managemen t Defect prevention
CMM levels of maturity:
Managed (4) Software quali ty mana gement Quantitative proces s mana gement
Defined (3) Peer reviews Intergroup coordination Software produc t engineering Integrated software ma nagement Traini ng program Organization proces s defi nition Organization pr ocess focus
Repeatable (2) Software con figuration management Software qual ity assurance Softwar e subcontract management Software project tracki ng and oversight Software project p lanning Requirements manag ement
Initial (1 )
Why work to build process reliability? • All those practising as software engineers should desire to evolve out of the chaotic activities and heroic efforts of a Level 1 organisation • Because no one likes a ‘painful’ work environment
• Good software can be developed by a Level 1 organisation, but often at the expense of the developers • People get tired of being the hero
• At the repeatable level, Level 2, software engineering processes are under basic management control and there is a management discipline • Even the most die-hard techie needs time away from work
Climbing up the CMM / CMMI ladder is a major investment:
CMMI uses production engineering measurement and control approaches: Maturity Le vels
Process Area 1
Process Area 2
Specific Goals
Process Area n
KPAs
Generic Goals Common Features
Commitment to Perform
Ability to Perform
Directing Implementation Implementation
Specific Practices
Generic Practices
Verifying Implementation
CMMI-SW KPAs Maturity Level 5 Optimising 4 Quant.Managed 3 Defined
2 Managed
Key Process Area (KPA) Name
# of Key Practices
Organisational Innovation and Deployment Causal Analysis and Resolution
19 17
Organisational Process Performance Quantitative Project Management
17 20
Requirements Development Technical Solution Product Integration Verification/Validation Organisational Process Focus Organisational Process Definition Organisational Training Integrated Project Management Risk Management Requirements Management Project Planning Project Monitoring and Control Process and Product Quality Assurance Configuration Management Supplier Agreement Management Measurement and Analysis
20 21 21 20 19 17 19 20 19 15 24 20 14 17 17 18
CMMI-SW cross-reference to IEEE standards (for example): Level 2 CMMI-SW KPA Requirements Management
Project Planning Project Monitoring and Control Process and Product Quality Assurance Configuration Management Supplier Agreement Management Measurement and Analysis
IEEE Standards IEEE Std 830 – 1998 IEEE Recommended Practice for Software Requirements Specifications IEEE Std 1058 – 1998 IEEE Standard for Software Project Management Plans IEEE Std 1058 – 1998 IEEE Standard for Software Project Management Plans IEEE Std 730 – 2002 IEEE Standard for Software Quality Assurance IEEE Std 828 – 1998 IEEE Standard for Software Configuration Management Plans IEEE Std 1062 – 1998 IEEE Recommended Practice for Software Acquisition IEEE Std 1045 – 2002 IEEE Standard for Software Productivity Metrics
Software Reliability: • • • • • • •
Why worry about it? Definitions While in development While in production Why does software wear out? Cautionary tales So what?
Philip Koopman, Carnegie Mellon
Error message haiku
Error message haiku
A file that big? It might be very useful. But now it is gone.
Error message haiku
Having been erased, The document you’re seeking Must now be retyped.
Error message haiku
Yesterday it worked. Today it is not working. Windows is like that.
Glass’s 1992 observation:
• Hardware deteriorates due to lack of maintenance – • But software deteriorates because of maintenance!
Glass, Robert L., Building Quality Software, Prentice Hall, Englewood Cliffs, NJ, 1992.
Five Types of Software Maintenance
• Corrective Maintenance
• Perfective Maintenance – Improve performance, dependability, maintainability – Update documentation
– Identify and remove defects – Correct actual errors
• Adaptive Maintenance – Adapt to a new/upgraded environment (e.g., hardware, operating system, middleware) – Incorporate new capability
• Preventative Maintenance – Identify and detect latent faults – Systems with safety concerns
• Emergency Maintenance – Unscheduled corrective maintenance (Risks due to reduced testing)
Mechanisms to maintain reliability in production include (1/2):
• Rigorous configuration management • Change Control Board reviews • Use of SCM tools – They can force compliance to processes
• Testing, testing, testing: – Component, Staging, Integration – Regression – UAT
• Infrastructure capacity management
Mechanisms to maintain reliability in production include (2/2):
• Documentation of all changes – Business process and software systems
• Application Reference Documents: – What it does – What it interfaces to, parameters, interface standards – What data it masters, what data is sourced elsewhere – Who owns it, who understands it
• Helpdesk and other support mechanisms • Training
Software Reliability: • • • • • • •
Why worry about it? Definitions While in development While in production Why does software wear out? Cautionary tales So what?
Philip Koopman, Carnegie Mellon
Software Reliability: • • • • • • •
Why worry about it? Definitions While in development While in production Why does software wear out? Cautionary tales So what?
Software reliability doesn’t exist in isolation:
People Software
Hardware Process
Pick any two:
Reliability
Cost
Time
Software reliability theory appears to work accurately in telecommunications and aerospace ...
WHY?
• Governments regulate product quality: – Telephones must work – Airplanes must fly
• In other disciplines, quality – and specifically reliability – has historically been an add-on, of lesser market value than feature-richness or short release cycles • Telecommunications and aerospace software works embedded environments: – Tied to the hardware on which it resides – Apply hardware reliability directly to embedded software
• Elsewhere there is scepticism and doubt among potential practitioners Whittaker and Voas; “Toward a More Reliable Theory of Software Reliability”; IEEE Computer, December 2000, pp 36-42
[1] UAT of telephone traffic statistics gathering system (1981)
• • • • • • •
International traffic Large investment $$ at stake Add-on to billing system 9,000+ hours development “Cunning” database schema What is “reliability”? Pareto approach
[2] How much is the client willing to pay? (1/4)
Design Reliability
Maximum Total (Planned and Unplanned) Annual Downtime
Typical Hardware Configuration and Performance Characteristics An active / passive cluster with RAID protected database. The database operation may be affected and will need to be synchronised to ensure integrity.
99% (two nines)
3.5 days
If the data recovery includes the restore transaction ‘redo’ logs, then there could be a sizeable delay while the database is rebuilt. The end users will need to be prompted when the secondary system is ready
Application Software and Performance Characteristics
In the event of a failure, the secondary system database needs to be restored to an effective operational status and the respective server and client application restarted. Data associated with one or more transactions may be lost. The end user will be prompted to restart the client application including login and to re-input one or more transactions.
[2] How much is the client willing to pay? (2/4)
Design Reliability
Maximum Total (Planned and Unplanned) Annual Downtime
Typical Hardware Configuration and Performance Characteristics Requires an active / active cluster with RAID protected database and automated server application failover.
99.9% (three nines)
The database operation should not be affected. 8 hours The cluster failover may need to physically restart the server application on the secondary host. The end users may need to be prompted to restart their client applications.
Application Software and Performance Characteristics
In the event of a failure, the secondary system automatically cuts in with only the data associated with the last incomplete transactions (disk and memory) being lost. The server and possibly the client application will need to be restarted. The end user may be requested to restart the client application, login and then reinput the last transaction.
[2] How much is the client willing to pay? (3/4)
Design Reliability
99.99% (four nines)
Maximum Total (Planned and Unplanned) Annual Downtime
1 hour
Typical Hardware Configuration and Performance Characteristics
Application Software and Performance Characteristics
Requires either a fault tolerant configuration or an active / active cluster with RAID 1/0 database and automated application failover.
In the event of a failure, the secondary system automatically cuts in.
The client application needs to be automatically pointed in real time to the secondary server operation.
Only the data associated with the last incomplete transactions (disk and memory) is lost.
It may be necessary to alert the end users there has been a problem and that it may be necessary to check their last transaction has been successfully completed.
The end user experiences a slight delay in service (less than 5 minutes) and is then prompted to re-input the last transaction.
[2] How much is the client willing to pay? (4/4)
Design Reliability
Maximum Total (Planned and Unplanned) Annual Downtime
Typical Hardware Configuration and Performance Characteristics Normally requires the presence of a real time secondary data centre with the processing shared between two systems located in each of the respective centres.
99.999% (five nines)