Towards Understanding Software Reliability

5 downloads 8496 Views 3MB Size Report
government customers and users: A software system ... The customer is NOT interested in: – Underlying ..... Helpdesk and other support mechanisms. • Training ...
Towards Understanding Software Reliability (brief notes from the trenches)

Duncan Hall 23 September 2004

Software Reliability: •  •  •  •  •  •  • 

Why worry about it? Definitions While in development While in production Why does software wear out? Cautionary tales So what?

Airplane video

A few stories from the wild:

•  •  •  • 

They Write the Right Stuff (Fishman) ScAirbus A340: Tokyo to London www.risks.org UBS

UBS: Unbundled Bit Stream

Stories coming down the track:

•  Product liability –  e-commerce –  embedded software

•  Government (USA, UK and NZ) requirements •  Business risk management

Most of the time you will be working on systems for business or government customers and users:

A software system exists to support/allow a person/system/organisation to perform a function (an algorithmic process) to generate a deliverable triggered by a business event to achieve a specific purpose/objective/ outcome

“Reliability” is one example of a business-driven requirement:

Requirements Dimension A

Requirements Dimension B

Getting requirements sorted is important ...

•  Glass’s Law: –  Requirements deficiencies are the prime source of project failures

•  Boehm’s First Law: –  Errors are most frequent during the requirements and design activities and are more expensive the later they are removed

Sources: Glass, R. L. (1998) Software Runaways Boehm, B. W. et alia (1975) Some experience ... IEEE Trans. Software Engineering; 1/1, 125ff Boehm, B. W. et alia (1984) Prototyping ... IEEE Trans. Software Engineering; 10/3, 290ff

... but hard to get right

“The hardest single part of building a software system is deciding precisely what to build. No other part of the conceptual work is as difficult as establishing the requirements. ... No other part of the work so cripples the resulting system if done wrong. No other part is as difficult to rectify later.”

Source: Fred Brooks (1987) No Silver Bullet

Eliciting business requirements requires dialogue, analysis and iteration: Understand the technology context/drivers Produce an assessment

Start Here

Understand the business context/drivers

Check that it makes sense

Collect the requirements

Document the requirements

Prioritise the requirements

Resolve any conflicts Classify the requirements

Requirements have many dimensions:

•  •  •  •  •  •  •  •  •  •  • 

Necessary? Unambiguous? Traceable? Testable? Measurable? Ranked? Complete? Consistent? Modifiable? Customer’s view? Users’ view?

Source: IEEE SESC Standard 830-1998

•  •  •  •  • 

Functional Performance External Interfaces Design Constraints Quality Attributes: –  Reliability –  Survivability –  Maintainability –  Portability –  Security –  User friendliness –  ++

Pick any two:

Functionality/Quality

Cost

Time

What's important … … is the customer's experience

•  The customer is NOT interested in: –  Underlying technologies –  Technological constraints –  Challenges to delivering reliability –  Why it might not work some of the time

•  The customer IS interested in: –  Predictability of performance –  Reliability of service –  Getting it back working with no losses if it goes wrong

Golden Rule 1:

From the customer's perspective: it's more important to get an approximate number which describes the right measure of performance than an exact number for the wrong measure

Golden Rule 2:

Them's that have the gold … … they make the rules!

The end-users’ reliability perceptions are driven by many factors: (1/5)

•  Latency – response to client request to server: –  For greater than PQ% of requests: •  Expected less than XYZ milliseconds •  Minimum less than ABC milliseconds

–  24x7? –  Maintenance window?

The end-users’ reliability perceptions are driven by many factors: (2/5)

•  Throughput – workstation to server: –  For greater than PQ% of requests •  Expected less than A seconds for X Megabits of data •  Minimum B seconds for X Megabits of data

–  24x7? –  Maintenance window?

The end-users’ reliability perceptions are driven by many factors: (3/5)

•  Availability: –  AB.XYZ% –  24x7? –  Maintenance windows –  Less than X% of all sessions lost?

•  Recovery from outage: –  Service restoration within XY minutes for greater than PQ% of incidents –  Client application restoration time within AB minutes for greater than RS% of incidents

The end-users’ reliability perceptions are driven by many factors: (4/5)

•  Security: –  For greater than AB% of incidents: •  Disconnect if user session greater than XY hours or •  No activity for more than PQ minutes

The end-users’ reliability perceptions are driven by many factors: (5/5)

•  Moves, Adds, Changes: –  Time to enable user to use service less than XY hours for greater than PQ% of requests –  Time to disable user to use service less than AB hours for greater than RS% of requests –  Time to modify user s resources to use service less than CD hours for greater than TU% of requests

Software Reliability: •  •  •  •  •  •  • 

Why worry about it? Definitions While in development While in production Why does software wear out? Cautionary tales So what?

Software reliability engineering:

•  Quantitatively planning and guiding: –  Software development –  Software testing –  Software maintenance

•  With emphasis on: –  Availability –  Reliability

John Musa, SRE tutorial for IEEE

Quantitatively: Lord Kelvin

“When you can measure what you are speaking about, and express it in numbers, you know something about it. But when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind. It may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of science, whatever the matter may be.”

What is “measurement”?

The process by which numbers or symbols are assigned to entities in the real world in such a way as to describe them according to clearly defined rules

see for example Gary Ford, “Lecture Notes on Engineering Measurement for Software Engineers”; CMU/SEI-93-EM-9; SEI

The pragmatic engineer’s view:

If you can’t measure it ...

... you can’t control it!

What is “availability”?

•  •  •  •  • 

probability at any given time that a system functions satisfactorily in a specified environment

John Musa, SRE tutorial for IEEE

Philip Koopman, Carnegie Mellon

What is “reliability”?

•  •  •  •  • 

probability that a system functions without failure for a specified time in a specified environment

John Musa, SRE tutorial for IEEE

Reliability is a conditional probability:

Reliability

Availability (now)

Availability (future)

Philip Koopman, Carnegie Mellon

Philip Koopman, Carnegie Mellon

Philip Koopman, Carnegie Mellon

Philip Koopman, Carnegie Mellon

Philip Koopman, Carnegie Mellon

Philip Koopman, Carnegie Mellon

Philip Koopman, Carnegie Mellon

Software reliability doesn’t exist in isolation:

People Software

Hardware Process

Software reliability doesn’t exist in isolation:

People Software

Hardware Process

Philip Koopman, Carnegie Mellon

Software reliability doesn’t exist in isolation:

People Software

Hardware Process

Philip Koopman, Carnegie Mellon

Software Reliability: •  •  •  •  •  •  • 

Why worry about it? Definitions While in development While in production Why does software wear out? Cautionary tales So what?

Many models exist to predict reliability of software in development ...

Jiantao Pan, Carnegie Mellon

... and some could even be used in practise ...

Jiantao Pan, Carnegie Mellon

... BUT very little use made of them! WHY??

•  “[Reliability] measurement is far from commonplace in the software development world” •  “Despite some heroic efforts from a small number of research centres and individuals ... there continues to be a dearth of published empirical data relating to the quality and reliability of realistic commercial software systems” •  “Measurement in software is still in its infancy. No good quantitative methods have been developed to represent Software Reliability without excessive limitations.” see: Jiantao Pan, Carnegie Mellon; Fenton and Ohlsson, IEEE Trans. Software Engineering, Aug00

Reliability by itself is hard to measure

•  So we measure: –  Lines Of Code –  Function Point Metrics –  Complexity-Oriented Metrics –  Project Management Metrics –  Test Metrics –  Change control requests –  Fault incidence by time and severity of impact –  Process metrics

Apart from delivering “trivial solutions”, software development processes have a lot of room for improvement beyond “code and fix”

People Software

Hardware Process

Pure specification-driven development processes are rare in practice

Paul Krause, Bernd Freimut and Witold Suryn, New Directions in Measurement for Software Quality Control, IEEE STEP’02 Conference Proceedings

A “Leap of Faith”: Reliable processes deliver reliable software: Optimizing (5) Process chang e management Technology change managemen t Defect prevention

CMM levels of maturity:

Managed (4) Software quali ty mana gement Quantitative proces s mana gement

Defined (3) Peer reviews Intergroup coordination Software produc t engineering Integrated software ma nagement Traini ng program Organization proces s defi nition Organization pr ocess focus

Repeatable (2) Software con figuration management Software qual ity assurance Softwar e subcontract management Software project tracki ng and oversight Software project p lanning Requirements manag ement

Initial (1 )

Why work to build process reliability? •  All those practising as software engineers should desire to evolve out of the chaotic activities and heroic efforts of a Level 1 organisation •  Because no one likes a ‘painful’ work environment

•  Good software can be developed by a Level 1 organisation, but often at the expense of the developers •  People get tired of being the hero

•  At the repeatable level, Level 2, software engineering processes are under basic management control and there is a management discipline •  Even the most die-hard techie needs time away from work

Climbing up the CMM / CMMI ladder is a major investment:

CMMI uses production engineering measurement and control approaches: Maturity Le vels

Process Area 1

Process Area 2

Specific Goals

Process Area n

KPAs

Generic Goals Common Features

Commitment to Perform

Ability to Perform

Directing Implementation Implementation

Specific Practices

Generic Practices

Verifying Implementation

CMMI-SW KPAs Maturity Level 5 Optimising 4 Quant.Managed 3 Defined

2 Managed

Key Process Area (KPA) Name

# of Key Practices

Organisational Innovation and Deployment Causal Analysis and Resolution

19 17

Organisational Process Performance Quantitative Project Management

17 20

Requirements Development Technical Solution Product Integration Verification/Validation Organisational Process Focus Organisational Process Definition Organisational Training Integrated Project Management Risk Management Requirements Management Project Planning Project Monitoring and Control Process and Product Quality Assurance Configuration Management Supplier Agreement Management Measurement and Analysis

20 21 21 20 19 17 19 20 19 15 24 20 14 17 17 18

CMMI-SW cross-reference to IEEE standards (for example): Level 2 CMMI-SW KPA Requirements Management

Project Planning Project Monitoring and Control Process and Product Quality Assurance Configuration Management Supplier Agreement Management Measurement and Analysis

IEEE Standards IEEE Std 830 – 1998 IEEE Recommended Practice for Software Requirements Specifications IEEE Std 1058 – 1998 IEEE Standard for Software Project Management Plans IEEE Std 1058 – 1998 IEEE Standard for Software Project Management Plans IEEE Std 730 – 2002 IEEE Standard for Software Quality Assurance IEEE Std 828 – 1998 IEEE Standard for Software Configuration Management Plans IEEE Std 1062 – 1998 IEEE Recommended Practice for Software Acquisition IEEE Std 1045 – 2002 IEEE Standard for Software Productivity Metrics

Software Reliability: •  •  •  •  •  •  • 

Why worry about it? Definitions While in development While in production Why does software wear out? Cautionary tales So what?

Philip Koopman, Carnegie Mellon

Error message haiku

Error message haiku

A file that big? It might be very useful. But now it is gone.

Error message haiku

Having been erased, The document you’re seeking Must now be retyped.

Error message haiku

Yesterday it worked. Today it is not working. Windows is like that.

Glass’s 1992 observation:

•  Hardware deteriorates due to lack of maintenance – •  But software deteriorates because of maintenance!

Glass, Robert L., Building Quality Software, Prentice Hall, Englewood Cliffs, NJ, 1992.

Five Types of Software Maintenance

•  Corrective Maintenance

•  Perfective Maintenance –  Improve performance, dependability, maintainability –  Update documentation

–  Identify and remove defects –  Correct actual errors

•  Adaptive Maintenance –  Adapt to a new/upgraded environment (e.g., hardware, operating system, middleware) –  Incorporate new capability

•  Preventative Maintenance –  Identify and detect latent faults –  Systems with safety concerns

•  Emergency Maintenance –  Unscheduled corrective maintenance (Risks due to reduced testing)

Mechanisms to maintain reliability in production include (1/2):

•  Rigorous configuration management •  Change Control Board reviews •  Use of SCM tools –  They can force compliance to processes

•  Testing, testing, testing: –  Component, Staging, Integration –  Regression –  UAT

•  Infrastructure capacity management

Mechanisms to maintain reliability in production include (2/2):

•  Documentation of all changes –  Business process and software systems

•  Application Reference Documents: –  What it does –  What it interfaces to, parameters, interface standards –  What data it masters, what data is sourced elsewhere –  Who owns it, who understands it

•  Helpdesk and other support mechanisms •  Training

Software Reliability: •  •  •  •  •  •  • 

Why worry about it? Definitions While in development While in production Why does software wear out? Cautionary tales So what?

Philip Koopman, Carnegie Mellon

Software Reliability: •  •  •  •  •  •  • 

Why worry about it? Definitions While in development While in production Why does software wear out? Cautionary tales So what?

Software reliability doesn’t exist in isolation:

People Software

Hardware Process

Pick any two:

Reliability

Cost

Time

Software reliability theory appears to work accurately in telecommunications and aerospace ...

WHY?

•  Governments regulate product quality: –  Telephones must work –  Airplanes must fly

•  In other disciplines, quality – and specifically reliability – has historically been an add-on, of lesser market value than feature-richness or short release cycles •  Telecommunications and aerospace software works embedded environments: –  Tied to the hardware on which it resides –  Apply hardware reliability directly to embedded software

•  Elsewhere there is scepticism and doubt among potential practitioners Whittaker and Voas; “Toward a More Reliable Theory of Software Reliability”; IEEE Computer, December 2000, pp 36-42

[1] UAT of telephone traffic statistics gathering system (1981)

•  •  •  •  •  •  • 

International traffic Large investment $$ at stake Add-on to billing system 9,000+ hours development “Cunning” database schema What is “reliability”? Pareto approach

[2] How much is the client willing to pay? (1/4)

Design Reliability

Maximum Total (Planned and Unplanned) Annual Downtime

Typical Hardware Configuration and Performance Characteristics An active / passive cluster with RAID protected database. The database operation may be affected and will need to be synchronised to ensure integrity.

99% (two nines)

3.5 days

If the data recovery includes the restore transaction ‘redo’ logs, then there could be a sizeable delay while the database is rebuilt. The end users will need to be prompted when the secondary system is ready

Application Software and Performance Characteristics

In the event of a failure, the secondary system database needs to be restored to an effective operational status and the respective server and client application restarted. Data associated with one or more transactions may be lost. The end user will be prompted to restart the client application including login and to re-input one or more transactions.

[2] How much is the client willing to pay? (2/4)

Design Reliability

Maximum Total (Planned and Unplanned) Annual Downtime

Typical Hardware Configuration and Performance Characteristics Requires an active / active cluster with RAID protected database and automated server application failover.

99.9% (three nines)

The database operation should not be affected. 8 hours The cluster failover may need to physically restart the server application on the secondary host. The end users may need to be prompted to restart their client applications.

Application Software and Performance Characteristics

In the event of a failure, the secondary system automatically cuts in with only the data associated with the last incomplete transactions (disk and memory) being lost. The server and possibly the client application will need to be restarted. The end user may be requested to restart the client application, login and then reinput the last transaction.

[2] How much is the client willing to pay? (3/4)

Design Reliability

99.99% (four nines)

Maximum Total (Planned and Unplanned) Annual Downtime

1 hour

Typical Hardware Configuration and Performance Characteristics

Application Software and Performance Characteristics

Requires either a fault tolerant configuration or an active / active cluster with RAID 1/0 database and automated application failover.

In the event of a failure, the secondary system automatically cuts in.

The client application needs to be automatically pointed in real time to the secondary server operation.

Only the data associated with the last incomplete transactions (disk and memory) is lost.

It may be necessary to alert the end users there has been a problem and that it may be necessary to check their last transaction has been successfully completed.

The end user experiences a slight delay in service (less than 5 minutes) and is then prompted to re-input the last transaction.

[2] How much is the client willing to pay? (4/4)

Design Reliability

Maximum Total (Planned and Unplanned) Annual Downtime

Typical Hardware Configuration and Performance Characteristics Normally requires the presence of a real time secondary data centre with the processing shared between two systems located in each of the respective centres.

99.999% (five nines)

Suggest Documents