Software Testing: A Machine Learning Experiment 8 1995 ACM O ...

41 downloads 183 Views 640KB Size Report
Testing is a critical part of the somare development process. As the cost of so&are development has escalated, attempts to accurately estimate the cost (and ...
Software Testing: A Machine Learning

Experiment

Thomas J. Cheatham Jungsoon P. Yoo Nancy J. Wahl Department of Computer Science Middle Tennessee State University Murfreesboro, Tennessee 37 132

Abstract Testing is a critical part of the somare development process. As the cost of so&are development has escalated, attempts to accurately estimate the cost (and time) of so&are testing have become more important. Research is being done to predict soware development costs and to develop tools to help in cost estimations. In this research, machine learning techniques are applied to determine which sojlware testing attributes are important in predicting soBware testing costs and more specifically soflware testing time. Testing data on 25 sofrware projects were collected. Using this database, a machine karning system identifies the factors that affect testing time by creating a classification tree in which nodes share similar values. By analyzing the classification tree we are able to determine the salient factors which placed a program into its group. 77ze factors that we consider include code complexity measures, measures of programmer and tester experience, measures of the use of sohare engineering principles such as structured programming techniques, andstatistics collected during actual testing of the projects. ‘Ihe programs in a node have an average testing time that is reflective of any program whose attributes would place it into this group. Thus, the tree can be used, among other things, to predict testing time. PariialIy suFFo&

by NSF Iu-Grant

DUE-9352219.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copyright is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee

and/or swcific txxmission.

8 1995 ACM O-89791-737-5 135

I. INTRODUCTION The cost of developing a computer-based system consists primarily of the cost of developing the software. It is estimated that software development is 80% of the cost of developing a computer-based system. Testing is a critical part of the software As the cost of software development process. development has escalated, attempts to accurately estimate the costs of all phases of the software development process, including software testing, have become more important. Since resources are limited, it is critical to determine how they should be allocated throughout the software life-cycle. Tools are being developed to estimate software development costs. The purpose of this research is to demonstrate a technique using machine-learning to identify attributes that are important in predicting software testing costs and more specifically software testing time in a particular company. Understanding the factors that affect the cost of software testing allows management to plan for testing. It can be argued that a company will develop a certain style of software development which is unique to that company and that the types of software systems developed by the company will be somewhat consistent. Therefore, predictions of testing costs are based on data collected within the given company. These predictions are calibrated to the software development environment in the company. Our research focuses on the issue of determining testing time of a software system after all modules have been integrated. In this research, we address the following questions: 1) Can we classify software in order to predict resources needed for testing? 2) Which factors affect system testing time? Our approach consists of the following: 1) Create a database of attributes for software systems previously developed by a particular company. 2) Use a machine learning tool to “learn” which

attributes affect software testing by building a classification tree 3) Use the classification tree to estimate the testing time of a new system. Part of the appeal of this approach is that the actual attributes that affect one company’s testing costs may be different from the attributes affecting another company’s testing costs. The machine learning system determines what is important in a given environment, in a sense, customizing the outcome for a particular company. Our approach is explained in this article. Related research is discussed in Section II. The machine learning tool is explained in Section III. The experiment is outlined in Section IV and conclusions are given in Section V. II.

RELATED

affecting software development effort [ 191. These factors include user participation in definition of requirements, experience of personnel in software development team, use of software engineering principles, constraints on the design, concurrent hardware development, and so on. Bailey and Basili developed a model [2] to predict effort in developing software that grouped factors affecting effort into methodology factors (top-down design, formal documentation, chief programmer teams, formal test plans, and code readings), interface, and complexity factors (application, program flow complexity), and experiential factors (programmer qualifications, machine experience, language experience, application experience and team experience). Boehm’s COCOMO Model uses program “type” to determine costs [4]. After the program type is determined, the number of delivered source lines is estimated and the formula corresponding to the type of program is computed. The multipliers in the formula depend on project specific factors such as reliability, database size, software complexity, memory constraints, operating system volatility, use of software tools, schedule constraints, and analyst and programmer experience. Boehm developed his formulas based on his personal experience, the experience of others, and analysis of data from the development of 63 projects at TRW. This model would have to be recalibrated for a different software

RESEARCH

The research described in this article builds on the work of software development cost models, error projection techniques, and previous applications of machine learning tools. Research in each of these areas will be discussed starting with a brief description of relevant cost estimation models.

Cost Estimation Models There are many cost estimation models for predicting software development costs. These include experiential models such as Wolverton’s Software Cost Matrix (201, which is based on an expert’s judgement and experience with similar models, and Albrecht’s Function Points [ 11, which estimates costs based on the requirements for a software project. The problem with these models is that they are subject to the expert’s biases. Halstead’s Software Science [8] uses complexity measures such as the number of operators and operands to understand software systems. He related the “complexity” of a system to its cost, but his complexity measures are not better indicators of coding effort than the number of lines of code and they do not account for control flow complexity. Nelson’s System Development Corporation Study [ 131 was one of the early research efforts that developed a list of factors that affect the cost of software development. The factors Nelson identified include the following: percent math instructions, percent I/O instructions, number of subprograms, programming language, lack of requirements, application area and so on. Nelson’s linear model does not account for the interdependence of the factors that affect software development. Walston and Felix also developed a list of factors

development

environment.

Error Projection Another important factor in software testing is the number of failures observed during testing. Several researchers have analyzed factors such as complexity measures to predict errors. For example, Shen [17] found that metrics relating to the number of unique operands and Halstead’s effort measures are the best for predicting which modules will have the most errors. Basili and Selby [3] found that the number of faults observed, fault detection rates, and total effort in detection related to type of software tested.

Machine Learning Research Machine learning techniques have been applied to the problem of estimating software development time. Srinivasan and Fisher [ 181 used decision trees and a neural-network approach known as Backpropagation to estimate software development time. They concluded that machine learning techniques were comparable to SLIM [15], COCOMO and Function Points.

136

software, and analysis of results. These phases are described in the following sections.

We have applied a similar technique to estimating software testing time. A machine learning system was used to identify factors that affect testing. The system is described in the next section.

Phase One: Database Formation In the first phase a database of twenty-five programs was constructed. Software systems were collected from several sources. A dozen systems were selected from the examples in various chapters in the textbook C By Discovery written by Foster [6]. Six systems were chosen from the textbook Problem Solving and Program Design in C by Koffman, Hanly, and Friedman [lo]. Four systems were solutions to laboratory assignments in a C class. They were all written by the same student. The three largest programs were from different sources. One is a random graph generation system distributed by Johnsonbaugh and Kahn [9]. Another is a graduate student’s solution to a 3-D graphics laboratory The last system, assignment using X-windows. developed by a team of students in a software engineering class, is a spell-checker for FORTRAN programs. The systems were all written in C and range in size from 17 to 1198 non-commented source lines (NCSL). They contain from 1 to 50 functions and vary in complexity.

HI. A MACHINE LEARNING APPLICATION We have developed a methodology to help managers predict software testing time that is based on machine learning theories. Static attributes describing the software and actual test statistics were presented as the input to a conceptual clustering system named COBWEB/3. COBWEB/3 is a machine learning system which was developed at NASA Ames Research Center [ 111. The system organizes descriptions of objects or events into a classification tree. COBWEB/3 is based on COBWEB [5] which was originally developed as a model of incremental concept formation to demonstrate some psychological aspects of the human learning process. When humans observe objects, they often group the objects into classes by finding common features. This classification process aids our understanding of objects by finding common features within a class and differences among classes. COBWEB can be used only for attributes with symbolic values. COBWEB/3 is an extension of COBWEB that can be used for attributes with symbolic or numeric values. Since the majority of attributes in this study had numeric values, COBWEB/3 was used. Some of the software testing attributes were evaluated initially as high, medium, or low but these values were changed to numeric values. COBWEB/3 uses standard deviation to measure closeness of values which is used to classify the objects. COBWEB/S incrementally organizes objects into a classification tree which maximizes inference ability. Each node in the classification tree corresponds to a concept that represents the class of objects classified under the node. Each node is labeled by a probabilistic concept that summarizes the attribute-value distribution of objects classified under the node. The higher level nodes represent more general concepts than the lower level nodes and the terminal nodes are the individual objects. The goal is to apply the classification scheme of COBWEB/3 to software systems for the purpose of identifying factors that aid in the estimation of testing time as explained in the next section.

Phase Two: Data Collection The second phase involved the actual data collection. As mentioned earlier, a list of attributes was selected from the software testing literature (see Table 1). All attributes have integer values. A range l-3 means low (l), medium (2), and high (3). Testing time is measured in minutes. It ranges from half an hour to about three hours. Tools were developed to gather the static code complexity measures (attributes 8-15). The test designer and the tester (usually two different people) evaluated the attributes that required low, medium, or high answers. A set of functional requirements was developed for each system and then a set of test cases was constructed to test each requirement. One graduate student did the majority of the testing. An on-line summary sheet was completed for each system. Using an input test data file, a UNIX shell script executed the program with each test case and collected the output in files for comparison with The UNIX “diff” program was expected results. used to detect differences. This approach did not work for all software systems, for instance, it was not used on the 3-D X-windows program. Once the testing was completed, the tester encoded values for attributes 24-27 into the summary sheet.

IV. THE EXPERIMENT The empirical study consists of four phasesdatabase formation, data collection, classification of

137

Table 1. Attributes Range of Values

Attribute 1 2 3 4 5 6 7 8 9 10 11 12 13

Number of test cases Is data entered interactively? What is the user's expected skill level? What are the requirements for error recovery? Required software reliability level Non-commented source lines (NCSL) Nesting levels Number of decisions Token count Function count Depth of calling tree Math instructions I/O instructions

14 15 16 17 18 19 20 21 22 23 24 25 26 27

Use of of global structuredvariables programming techniques Use Use of meaningful variable names Internal documentation Programmer experience in the operating environment Programmer experience in the programming language Programmer experience in the applications of area Tester experience in the operating environment languages Tester experience in the programming Tester experience in the application area Number of errors found Testing time Classify errors by difficulty for testing Classify errors by difficulty for correcting

Phase Three: Classification of Software The third phase was the actual classification of the software systems based on conceptual clustering. C programs were developed to extract the attribute values from the summary sheets and “normalize” them for use by COBWEB/3. COBWEB/3 uses the same (user-supplied) minimum standard deviation for all attributes. Therefore, it is not acceptable to have one attribute that ranges from l-3 and another that ranges from 17-1198. Normalization made the ranges of attribute values consistent while preserving the relative significance of the values. COBWEB/3 produced the classification tree in Table 2 using normalized data and a minimum standard deviation of

2 - 20 o-1 1 -3 1 -3 1 -3 17 - 1198 1 -7 1 - 199 85 - 12515 1 - 50 1 -8 0 - 267 5 - 144 11 1 1 1 1 1 1 1

1 0 30 1 1

1; -3 -3 - 3 - 3 - 3 -3 1; -3 - 225 -3 -3

book. Node 15 has a combination of programs from both books. Phase Four: Analysis of Results As those who study software metrics would have predicted, the primary attributes that separated Node 27 and Node 8 in Table 2 are the following: number of non-commented source lines of code (NCSL), number of decisions (a la Thomas McCabe), and token count (a la Maurice Halstead). Programmer experience (attributes 18-20) is also significant. The Foster and Koffman programs were all given the same level of programmer experience so the difference obviously lies in Node 27 (the class assignments and larger programs). The class assignment programs were written by a student with little programmer experience. The essential distinguishing factor between Node 24 and Node 15 is the number of errors found. Fewer errors were found in Node 15 programs which are most of the Koffman programs. These programs range in size from 17 to 182 lines but testing time is

0.33. COBWEB/3 created a tree with two main branches--Node 27 and Node 8 (see Table 2). Node 27 consists of the three large programs which are clustered in Node 29 and the four class assignments. Node 8 consists of the remaining programs which are clustered into two nodes--Node 24 and Node 15. Node 24 consists of eight programs taken from Foster’s book and one program from Koffman’s

138

Table 2. Concept Hierarchy

>(run

"c29"

:acuity

Tree

0.33)

> (print-tree)

Concept

hierarchy

is:

ROOT Node27 Node29

large/spell large/graph large/jDgraph Node5

class/lab5 class/lab3 class/lab2 class/lab4 Node8 Node24

koffman/precision foster/string Node25

foster/struct2 foster/enumindx foster/classify Node21

foster/stack fosteriarraypar foster/array foster/mergesort Node15

koffman/water koffman/sets koffman/convert Node17 foster/linklist foster/fseek foster/chlincnt Node31

koffman/apply koffman/3parts foster/counter not significantly different. Perhaps because the test cases were derived from the requirements (or problem statement) and not from the code. The primary differences between the class assignment programs (Node 5) and the larger programs (Node 29) include the code complexity measures (attributes 6-13), programmer experience (attributes 18-19), the requirements for error recovery and reliability (attributes 4-5), and testing time (attribute 25). The larger programs are in the 1000 NCSL range while the class assignment programs are in the 150 NCSL range. More importantly, the larger programs have a higher requirements level for reliability and error recovery than the student’s solutions for class assignments. From the empirical study we see that, given a

system with properties similar to the class assignment programs (Node 5) or the larger programs (Node 29), we can reasonably estimate the testing time for the system since testing time is a significant attribute in Node 27. On the other hand, we may not be able to adequately estimate the testing time for a syitem with attributes similar to the Foster or Koffman systems since testing time is not a property that divides these classes. Even so the classification has merit. It can be used to estimate the number of errors such a system will have. The system would predict zero (0) errors in a program equivalent to the Koffman programs (Node 15) and one (1) error in a program similar to the Foster programs (Node 24). Further, the classification can point out possible errors in the data collecting process. For instance, in

139

an early classification tree the “koffmanlsets” program was classified with the “class/lab2” and “class/lab3” programs. Looking at the data we noticed that it had only half as many test cases as they did. Upon investigation we discovered that half the test cases for “sets” had been accidentally omitted. The problem was corrected and the system reclassified. A software development company could collect data on the attributes of its software systems and, using a tool like COBWEB/3, make reasonable estimates of testing time based on the classification tree. If a new system differs significantly from earlier work, it will be apparent in the classification tree and management can look for an explanation of the differences.

The approach helps management discover “unusual” attributes, and estimate missing attributes such as testing time and number of errors. The approach is customized to a particular software development environment allowing for differences due to development practices and/or type of software. It is critical that a company collect data about software systems as they are developed and tested. This database can then be used by the methodology outlined in this article to predict testing costs for new software projects. We expect to apply a similar process to measure reliability and estimate when a system is stable enough to release.

V. CONCLUSIONS AND FUTURE WORK

VI. REFERENCES

The methodology described in this article will allow management to estimate testing time or any other unknown attribute of the software. If finding the attributes that determine the testing time is the only objective, then a supervised machine learning system such as ID3 [16] may be used. However, unsupervised learning systems such as COBWEB/3 have the advantage of clustering software systems by considering all attributes of the systems not just testing time. Several statistical analyses were performed on the testing database in addition to the COBWEB/3 analysis. Multiple linear regression analysis with 95% confidence did not discover a subset of the attributes that could reasonably estimate testing time. A factor analysis was also performed on the data. The factor analysis grouped the software systems into two factors. Unfortunately it is hard to make a clear distinction between the two factors. In general, the statistical analyses were inconclusive. On the other hand, COBWEB/3 was able to group the systems satisfactorily. Obviously, our data set is small (25 programs) and the software systems are small. However, there is no apparent problem in scaling up the approach to real world systems. Software managers need help in allocating scarce software development resources especially testing time. We have demonstrated a methodology based on an analysis of real project data that can assist the manager. The method involves (1) creating a database of attributes for previously developed systems, (2) applying a machine learning tool to “learn” the salient attributes in this environment, (3) classifying a new system based on these attributes, and (4) analyzing results to predict testing costs. 140

1.

“Measuring Application Albrecht, A., Development Productivity,” Proceedings of the IBMApplication Development Symposium, Monterrey, CA, October 1979, pp. 83-92.

2.

Bailey, J. and V. Basili, “A Meta-Model for Software Development Resource Expenditures,” Proc. 5th Znt. ConJ: So&are Engineering, IEEE/ACM/NBS, March 198 1, pp. 107-l 16.

3.

Basili, V. and R. Selby, “Comparing the Effectiveness of Software Testing Strategies, ” IEEE Transactions on Sofiware Engineering, Vol. SE-13, No.12, December 1987, pp. 1278-1296.

4.

“Software Engineering Boehm, B., Economics,” IEEE Transactions of Software Engineering, Vol. SE-lo, No. 1, January 1984, pp. 4-21.

5.

Fisher, D., “Knowledge Acquisition via Incremental Conceptual Clustering, ” Machine Learning, 2, 1987, pp. 139-172.

6.

Foster, L., C by Discovery, Scott/Jones Inc., El Granada, CA, 1991.

7.

Gluck, M. and J. Cotter, “Information, Uncertainty, and the Utility of Categories,” Proceedings of the Seventh Annual Conference of the Cognitive Science Society, 1985, pp. 283- 287, Irvine, CA, Lawrence Erlbaum.

8.

Halstead, M., Elements of Software Science, Elsevier, NY, 1977.

9.

Johnsonbaugh, R. and M. Kalin, “A Graph SIGCSE Generation Software Package,” Bulletin, Vol. 23, No. 1, 1991, pp. 151-154.

10.

Koffman, E., Hanly, J., and F. Friedman, Problem Solving and Program Design in C, Addison-Wesley, Reading, MA, 1993.

11.

McKusick, K., and K. Thompson, COBWEBN: A Portable Implementation, Technical Report FIA-90-6-18-2, NASA Ames Research Center, June, 1990.

12.

Motley, R. and W. Brooks, Statistical Prediction of Programming Errors, Rome Air Development Center, Griffiss AFB, NY, RADC-TR-77-175, May 1977.

13.

Nelson, E., Management Handbook for the Estimation of Computer Programming Costs, System Development Corporation, ADA648750, October 31, 1966.

14.

Ptleeger, S., A Model of Cost and ProductivityforObject-OrienledDevelopment, Ph.D. dissertation, George Mason University (Fairfax, Virginia), March 1989.

15.

Putnam, L., “General Empirical Solution to the Macro Software Sizing and Estimating IEEE Trans. on SoJware Problem,” Engineering, Vol4, 1978, pp. 345-361.

16.

Quinlin, J.R., “Induction of Decision Trees,” Machine Learning, Vol. 1, No. 1, pp 81-106.

17.

Shen, V., Yu, T., Thebaut, S., and L. Paulsen, “Identifying Error-Prone Software-An Empirical Study,” IEEE Transactions on Sojiware Engineering, Vol. SE-11, No. 4, April 1985, pp. 317-324.

18.

Srinivasan, K. and D. Fisher, Machine Learning Approaches to Estimating Software Development Time, Technical Report CS-9209, Vanderbilt University, 1992.

141

19.

Walston, C. and C. Felix, “A Method of Programming Measurement and Estimation, ” IBM Systems Journal, Vol. 16, No. 1, 1977, pp. 54-73.

20.

Wolverton, R.W., “The Cost of Developing Large-Scale Software, ” IEEE Trans. Computing, June 1974, pp. 615636.

Suggest Documents