Software Reliability Engineering for System Test and Production Support by James Cusick AT&T Bell Laboratories
[email protected]
September 1993
ABSTRACT Presentation of the application of Software Reliability Engineering (SRE) practices to both the System Test phase and the ongoing Production Support phase of a teleconferencing software system. Discussion of setting a reliability target by baselining existing systems, timing software releases according to reliability tracking, and monitoring service quality in software reliability terms. Tools, techniques, theory, and processes of SRE are presented as applied in a resource restricted and deadline oriented shop. Costs of applying SRE and benefits resulting from its use are detailed.
NOTE: This version of the paper is missing its original graphs and drawings due to format incompatibilities. A paper copy with full diagrams can be provided upon request.
1
Software Reliability Engineering for System Test and Production Support
1. OVERVIEW In response to the classic challenges of software development, namely faster delivery of ever more complicated systems, AT&T TeleConference Service sought out new process improvement approaches. Among the approaches selected for integration with our existing software development processes, Software Reliability Engineering (SRE) quickly began to pay off in assisting the organization to meet its demands for software of known quality and reliability delivered on schedule. This paper presents the experiences of the AT&T TeleConference System Test Group in applying SRE to software development and maintenance. A brief discussion of Software Reliability Engineering precedes an overview of the software system delivered with the help of SRE and the integration of SRE within the development process. Data collection, tools, and results of the SRE effort are discussed. Finally, costs and benefits of the SRE process are listed.
2. SOFTWARE RELIABILITY ENGINEERING Software Reliability Engineering consists of processes and statistical methods used in predicting and tracking software reliability and related measures. SRE methodology and tools are derived from reliability theory transferred from the computer hardware industry. Software reliability is defined as follows: Software reliability is the probability that the software will execute for a particular period of time in a specified environment without a failure (Mollenkamp, 1992).
Many models for software reliability exist. Our team chose the Poisson Exponential Execution Time Model of Software Reliability as presented by Musa (1987). This model provides statistical methodology for measuring reliability and predicting reliability growth. The model allows for the calculation of reliability based on the nonnormal distribution of software failures. Equation 2.1 below yields a single number which represents the probability of the software executing without a failure for a given time period (Musa, 1987).
R(τ) = exp(-λτ)
where R(τ) exp λ τ
= reliability for time τ = ex = failure intensity = execution time
The execution time above simply represents some measure of software usage such as CPU hours or clock cycles. Failure Intensity represents the rate at which failures during operation of the software are encountered. Different usage patterns of any software often produce a variety of reliability levels. Operational Profiles quantify the usage pattern of the software by the intended users. Such a profile of utilization may be used to run tests on the system in order to track the reliability of the software as it might be seen by the eventual users (Ackerman, 1987). This acts to assure accuracy of the reliability projections between testing and production use.
3. SYSTEM OVERVIEW AND THE ROLE OF SRE
2
3.1 Teleconference System Configuration
data are typed into the PC. Simple calculations are performed by hand.
The target system for the application of SRE within AT&T TeleConference Service was a newly introduced support system for inbound conference calls. A simplified view of the service shows the placement of the software system called UTOSS (Universal Teleconference Operations Support System) within the service. Figure 1 indicates voice and data flows among the customers, Teleconference Specialists, and the equipment. The UTOSS system provides reservation support and semi-real time conference control support. The software consists of many related components running on multiple machines connected via a local area network.
4.1 Code Counting and Defect Prediction The first data element required by SRE and our test process was Lines of Code (LOC). Using simple UNIX Tm mshell scripts to call tools provided by "exptools" the UTOSS data were collected\*F. UNIX is a Registered Trademark of UNIX System Laboratories. Specifically, the "metrics" program was used for two purposes: 1) to collect total code counts; and 2) to conduct McCabe complexity analysis (McCabe, 1983). The LOC metric feeds many other calculations used for predicting the number of faults in the system and the length of time required to find them.
3.2 Process Integration of SRE The integration of SRE into the development process coincided with a revamping of the System Test process and the testing of some new releases of UTOSS. As shown in Figure 2, SRE captures data from the test process and from the ongoing monitoring of production software. The results from SRE analysis feed back into system design. Analysis of data from the production system served as our starting point in the application of SRE. After conducting a baseline of the reliability of the production system a reliability objective became a requirement for the next software version. Confirmation of adherence to the reliability objective occurred during the test phase of the next scheduled release.
4. DATA COLLECTION AND TOOLS We collect several types of data as part of our SRE work. Some data collection relies on manual techniques. Some data reside in the UTOSS data base and may be accessed via SQL (Structured Query Language). The team uses a Personal Computer equipped with the SRE TOOLKIT\*F Copyright " 1991 AT&T. or data storage and some SRE calculations. Failure and execution time
Three methods for predicting defects offered a range of estimates. Conducting a Function Point "Backfire" (Jones, 1991, 1992) provided a rough estimate of anticipated defects. The fault prediction formulas provided by Musa (1987) gave a closer approximation of defects. Finally, estimations of faults remaining in the software after input of failure and execution time data came from the SRE TOOLKIT software on a continuing basis through the system test phase. These predictions acted as a complement to the reliability calculations in gauging the readiness of the software for release.
4.2 Failure Tracking The next variable required by our study was the number of failures. Our earliest efforts to track failures counted only failures reported by the application programs themselves in the production environment. While these failures are important the team decided that they are not as important as the failures observed by the users. Furthermore, accurate counting would require software development resources beyond our budget. Production environment failures are now counted by the production site staff who use the system. These failures had always been carefully tracked and reported to our Production Support team. These reports now arrive weekly via
3
electronic mail to our team for analysis which required no additional costs to implement. We define failures as any behavior of the software which deviates from its planned or expected behavior. The working assumption in collecting failures from the field is that the users define what is and what is not a failure. (This does not however free the team from the additional task of monitoring internal software failures not seen by the users such as failures in record format as reported by "downstream" systems fed by UTOSS.) Failures in the laboratory environment are tracked by our System Test Group and directly input into the PC. Counting of failures does differ between test and production in one key way. Occurrences of unique failures are counted only once during testing on the assumption that the cause of the failure will be corrected. However, each occurrence of any failure is counted during production cycles. This includes repetitions of the same failure since corrections in production normally must wait for a new software release.
4.3 Execution Time Metric Our team uses an approximation of execution time and not the actual execution time of the software. The use of conferences run as the execution time metric was the most important choice in our SRE process. An attempt was made to use CPU usage of the application programs for the execution time metric. This number was hard to calculate, would have required additional development of supporting software, and the meaning of the reliability statistics based on CPU usage was not intuitive to the team. When we made the switch to the number of conferences the figures made more immediate sense. The UTOSS system has a built in pulse. The number of conferences processed each day can be used as the execution time metric (Ackerman, 1993). Failures per conferences run is a concept we could grasp immediately. This number can be unambiguously determined by the same reports which provide failures seen by the operator center. During laboratory tests simple SQL commands collect the same data.
4.4 Operational Profile Using an Operational Profile to guide System Testing produces failure data in accordance with the probabilities that similar functional usage patterns will be followed in production. Basing your testing on this profile assures that when you reach your reliability objective you can be sure that the most heavily used functions have been adequately tested (Musa, 1993). Data collected from the production version of UTOSS provided historical usage patterns for nearly all system functions. The granularity of the functional breakdown provided adequate direction to build test runs in accordance with actual customer and specialist usage patterns. Functional probability is calculated in relationship to conferences run. For example, if 10 conferences are run per hour, a relationship to new reservations created in that hour can be determined. Testing based on these observations was conducted during the System Test phase for the R1.2 version of UTOSS.
5. RESULTS AND DISCUSSION 5.1 Presentation Format of Results Due to the proprietary nature of the execution time metric used in this study the presentation of results excludes precise quantification. Each chart detailing Failure Intensity vs. Calendar Time uses the same exponential y-axis with an identical set of alphabetic markers. These charts are shown with confidence levels (dashed lines) above and below the most likely Failure Intensity (solid lines). The chart showing Failures vs. Execution Time provides failure data on the yaxis and uniformly spaced alphabetic markers representing linear execution time.
5.2 UTOSS Reliability Baseline The team needed to baseline the reliability of the software version in the field in order to know what level of reliability to aim for in the next
4
version. Plots of the field trouble reports versus the number of conferences run using the R1.1 and R1.1.1 software versions provided a baseline of reliability for the system. The initial baseline results showed a steady failure occurrence level of l = B for UTOSS version R1.1.1 (Figure 3). This provided a failure rate objective which we needed to match or improve upon when building subsequent software releases.
5.3 System Test Results of UTOSS R1.2 Tracking defects and execution time during system test produced a day by day estimate of software reliability. Early results tended to fit the theoretically predicted patterns of defect discovery and reliability growth. Figure 4 plots the failure discovery rate against the execution time of the system up to the planned release date. This plot indicates a flattening of the failure discovery rate. A key decision to make was if the software would be ready for release to production on or close to the promised delivery date. While many factors contributed to that decision, the SRE results provided a strong indicator that the release of the new software would match the reliability of the existing software. Figure 5 shows the reliability growth tracking available as a decision input at the time of the release to production. The lab results indicated a l of B, nearly identical to the value seen in the production system baseline.
5.4 Production Results of UTOSS R1.2 Even though the decision to ship the software was made, one major question remained. The relationship between the reliability demonstrated by software in the lab and how that would equate to operational reliability in production remained unknown. Many environmental conditions could not be simulated adequately in the lab. Further, the equipment differed somewhat in configuration and throughput. As a result a certain amount of variance from reliability observed in the lab was anticipated following the release of the software to the field.
After releasing the software to production several defects not observed in the lab became apparent. The failure intensity began to rise and stabilized at a new l between B and C as shown in Figure 6. After correcting these problems and releasing a final tape of the software to production the operational failure intensity dropped back towards the \(* objective of B. This information allows us to draw a ratio between the failure rate observed in the lab and the rate observed in production. Future releases of this software can benefit from this ratio by using the correct factor to gauge actual production behavior of the software from the failure rates observed in the lab. To meet or surpass the performance in the field, testing could for example, continue until the demonstrated l was a factor of one less in the lab than in the field.
5.5 Production Results: Customer Affecting Troubles A separation of Customer Affecting Troubles (CAT) from operational failures provides another significant view of the system performance. DMOQs (Direct Measures of Quality) for the service define several types of CAT. Breaking out these failure instances from the overall operational failure events gives us a plot indicating CAT failure rates. Figure 7 shows results for Customer Affecting Troubles during the first half of 1993. Such a breakout provides a simple discrimination of failures along severity or priority lines. Namely, any failure affecting customers is considered to fall into the highest severity failure category. Separation of Customer Affecting Troubles from all Operational Troubles thereby assists in prioritizing development fixes.
5.6 Reliability Comparisons of All UTOSS Software Versions A comparison of all UTOSS software versions provides perspective on the ability to control and monitor software reliability. Figure 8 plots Failure Intensity over the life of the service. (Single letter abbreviations for calendar months appear along the x-axis.) Precise values for the
5
R1.0 and R1.1 versions cannot be determined due to differences in the trouble reporting process before the introduction of SRE. The important point of comparison is the relative success of maintaining the software within much more limited failure occurrence levels following the integration of SRE into the test process.
5.7 Reliability Projections With failure intensity and conference volume established, the reliability formula 2.1 presented above calculates reliability for UTOSS at any point in its operational life. Of more interest is the reliability of the software now and what it will be in the future. Estimations of reliability levels of the software in the future can be calculated by substituting higher values of t, as provided by capacity planning, and holding l constant. These calculations feed the decision process by allowing for an unambiguous quantification of system behavior given a projected conference load level. Trade-off decisions can be made more rationally with such data in hand.
7. COSTS AND BENEFITS OF APPLYING SRE A variety of process, organizational, and technology changes took place at the same time that we began using SRE. It is not clear that all the savings and success of the team derive only from the use of SRE. Nevertheless, SRE clearly contributed more to our control of the development process than it cost to integrate the process into our work. The costs of SRE for our team included: • •
The benefits of using SRE included: •
•
6. LIMITATIONS IN THIS APPLICATION OF SRE We recognize that some limitations and errors in our SRE program remain. Ignoring software failures reported by the software itself and relying only on the accuracy of the trouble incidence reports of the Teleconference Center results in under-reporting of errors. However, we tend to view this low but consistent method of reporting errors by the users as an adequate barometer of reliability. Other problems were encountered in extending our analysis to test time prediction. While the projections of the number of defects remaining and reliability levels were useful, the staff time estimates produced by SRE were unusable. The staff time and calendar time estimates put delivery of the system many years in the future. As the lab results neared the failure objective the time estimates suddenly dropped to zero. Continued work in parameter definition may eventually yield worthwhile data.
1.5 month staff effort for start-up including: process research; 3 weeks training; data collection & analysis. 1 staff day per week ongoing effort
• •
An 89% cost savings in field test phase by reducing the number of field trial tests required as measured between the R1.1 and R1.2 releases. Release of new version on time with predictable reliability. Decrease of failure rate 54% while increasing traffic 34%. New process paradigm for objective decision making.
8. CONCLUSIONS The SRE process provided a confidence level going into production releases not previously available. Further integration of SRE earlier in the development cycle offers additional promise. AT&T TeleConference Service manages a variety of systems in order to provide our customers with the highest quality service. Our team sees SRE as a key component in our development toolkit and plans to employ SRE on future generations of systems now under development.
9. ACKNOWLEDGEMENTS Several AT&T TeleConference Service team members have been instrumental in assisting our
6
SRE initiative. Jerry Pascher, Jerry Manese, Joe Chacon, Max Fine and the entire System Test Group has lent support to the effort at each step. Members of the AT&T TeleConference Service Systems Development group, especially Bob Stokey, Grant Davis, Rao Karanam, and Andy Johnson, contributed valuable insights and suggestions during our work on metrics. Frank Ackerman of The Institute for Zero-Defect Software has also provided ongoing support by reviewing the application of SRE to AT&T TeleConference Service Development. Finally, I would like to acknowledge Bill Everett of AT&T Bell Laboratories and the Quest organization for helping to answer questions, suggest new directions for our work on SRE, and for serving as a source of materials for the presentation of this paper.
10. REFERENCES [1] Ackerman, F., "Software Reliability Engineering Applications", AT&T Technical Education, February, 1993. [2] Ackerman, F., and Musa, J. D., "Quantifying Software Validation: When to Stop Testing?", IEEE Software, May 1989. [3] Jones, C., Applied Software Measurement: Assuring Productivity and Quality, McGraw Hill, Inc., New York, 1991. [4] Jones, C., "Applied Software Measurement", AT&T Technical Education, November, 1992. [5] McCabe, T.J., “Structured Testing”, IEEE Computer Society, 1983. [6] Mollenkamp, D, ed., Software Reliability Engineering: Best Current Practices, AT&T Bell Laboratories, Software Quality & Productivity Cabinet, October, 1992. [7] Musa, J. D., Iannino, A., and Okumo to, K., Software Reliability: Measurements, Prediction, Application, McGraw-Hill, 1987. [8] Musa, J. D., “Operational Profiles in Software-Reliability Engineering”, IEEE Software, March 1993.
7