The Scaling Problem in Neural Networks for Software Reliability ... - ECT

6 downloads 261 Views 151KB Size Report
structure with three data sets and compare its predic- tive accuracy with that of ... Keywords: Software reliability growth prediction,. Neural network models ... Table 1: Analytic Models Versus Neural Network Models. The Cascade-correlation ...
The Scaling Problem in Neural Networks for Software Reliability Prediction N. Karunanithi and Y. K. Malaiya Computer Science Department Colorado State University Fort Collins, CO 80523. (303) 491-1943 Email: karunani=malaiya@cs:colostate:edu

Abstract Recently neural networks have been applied for software reliability growth prediction. Although the predictive capability of the neural network models are better than some of the well known analytic models, the scaling problem has not been completely addressed yet. With the present neural network models, it is necessary to scale the cumulative faults over a 0.0 to 1.0 range. So the user has to estimate in advance a maximum value for the total number of faults to be detected at the end of the test phase. In practice, such an estimate may not be accurate. Use of an inaccurate value for scaling the cumulative faults can severely affect the predictive capability of neural network models. This paper presents a solution to the scaling problem in terms of a clipped linear unit in the output layer. With a clipped linear output unit, the neural networks can predict positive values in any unbounded range. We demonstrate the applicability of the proposed network structure with three data sets and compare its predictive accuracy with that of our earlier models. Expressions for the failure rate process represented by the models of the proposed network structure are also derived. Keywords: Software reliability growth prediction, Neural network models, Linear output unit, Failure rate process.

1 Introduction

Recently Karunanithi et al. [5, 6, 7] demonstrated the applicability of neural networks for software reliability growth prediction. The neural network approach is di erent from the traditional analytic mod-

els in that the underlying failure process is learned from the failure history of a software system rather than based on a priori assumptions. Since the structure of the network is constructed from the failure history supplied by the user, the neural network models exhibit a better adaptability than analytic models. However, in order to apply the neural network models recommended in [6, 7] it is necessary to scale the accumulated faults (the output variable) over a 0.0 to 1.0 range. This means that the user has to estimate in advance a maximum value for the total number of faults to be detected at the end of the test phase. In practice such an estimate may not be accurate. Use of an inaccurate value for scaling the cumulative faults can severely a ect the predictive capability of the neural network models. In this paper we propose a solution to the scaling problem in terms of a clipped linear unit in the output layer. With a clipped linear output unit the neural network can predict positive values in any unbounded range. We demonstrate the applicability of the proposed network structure with three data sets and compare its predictive accuracy with our earlier models as well as with a few well known analytic models. Also we derive expressions for the failure rate process corresponding to the models developed by the proposed network structure. 2

Overview of Neural Network Models

In our previous research [6, 7] we studied the following neural network models: 1) feed-forward networks (FFN) and 2) Jordan's [4] semi-recurrent networks (JN) with \Teacher Forced" learning. Both of these networks were constructed and trained using Fahlman's [3] Cascade-correlation learning algorithm.

Features Compared Type of the structure

Analytic Models Analytic Expressions

Neural Network Models Network Structures (can be expressed mathematically)

Mapping function for (t)

f : tk 7! k

FFN f : tk 7! k JN f : ftk ; k 1g 7! k

Parameters adjusted

Model Parameters ( 0 ; 1) Network Weights

Parameter estimation techniques

(Non)Linear Regression, Learning Algorithms Maximum Likelihood etc., (Cascade-Correlation, BackPropagation etc.,) Model selection Manual, selected a priori No selection, evolved during training Model complexity Remains xed (2 or 3) May vary dynamically (number of parameters) (number of weights) Table 1: Analytic Models Versus Neural Network Models The Cascade-correlation algorithm is a constructive algorithm which starts with no hidden units and builds a suitable network by adding one hidden unit at a time until the learning is successfully completed. (For more details on neural networks and their implementation refer to [6, 7].) These neural network models are shown in Figure 1. A comparison between analytic models and the neural network models is shown in Table 1. The neural network models shown in Figure 1 have three types of layers. The processing units in the \input layer" receive inputs from the external world (i.e., cumulative execution time) and directly copy them to their output terminal. The \hidden layer" has one or more units which receive weighted inputs from all the input units and all preexisting hidden units. The power of a multilayer neural network lies in its hidden units; the hidden units are capable of transforming an external input into internal representations of the network. The \output layer" unit receives weighted inputs from the input units and all hidden units and maps them into a cumulative fault. The major di erence between the feed-forward network model and the Jordan network model is that the later has an additional input unit for the cumulative faults whereas the feed-forward network does not have the feedback input. During training, at each time step ti this additional input unit is fed with the actual cumulative faults observed at time step ti 1. While using for prediction this additional input unit does not receive any external input, rather the output of the

network at the previous time step is directly fed back as input. 3

The Proposed Method

Typically, neural networks are constructed using the same type of processing units in both the output layer and the hidden layer(s). The structure and the input/output response of processing units are shown in Figure 2. Figure 2a shows how the sum is computed as a weighted sum of inputs. The processing unit transfers this sum into an output value by applying its activation function. The output layer of the neural network models examined in our previous research [5, 6, 7] were constructed using one of the most commonly used processing units known as sigmoidal unit with a logistic activation function. The output of a sigmoidal unit is given by: Output = 1 + e1 sum (1) where sum is the weighted input from all units in the hidden layer, the input layer and a bias unit1 . Figure 2b shows the activation function of a sigmoidal unit. Note that the output of the sigmoidal unit is bounded between 0 and 1. When we use a sigmoidal unit in the output layer, the cumulative faults need to be properly scaled down to this range by using a known maximum 1 The bias unit is special input unit whose output always remains constant at 1.0. The bias unit acts as an additional input to all hidden and the output units of the network.

Output layer (Cumulative Faults)

Hidden units

Input layer

h1

h2

h1

I1

I1

h2

I2

(Execution Time)

a) A typical Feed-forward network

b) A typical Jordan network

Figure 1: A typical feed-forward network and a Jordan network developed by the Cascade-correlation algorithm. Note that the Jordan network has an additional input unit (I2) for the cumulative fault observed (or predicted) at the previous time step. This additional input unit acts as an external input during training and as a feedback link from the output unit during prediction. 1 x0

w0

Output x1 w1 sum . . wn x n sum = w 0 x0 + .. + wn x n

Output

Output

0

0 sum

( a ) A typical processing unit

sum

( b ) A Logistic function

(c) A Linear function

Figure 2: A typical processing unit used in the output and hidden layers of a neural network. The activation function of the most commonly used sigmoidal unit is shown in Figure 2b. The proposed clipped linear function is shown in Figure 2c. value. In practice there is no a priori maximum value for the cummulative fault that could be reasonably set. Figure 2c represents a possible solution to this problem. Instead of using a sigmoidal unit in the output layer, one can use a linear unit with a clipped linear function. The output of a linear unit is given by:  if sum > 0 Output = sum (2) 0 otherwise By replacing the sigmoidal output unit with a linear unit the network can produce any positive value as its output. However one diculty with this clipped linear function is that it does not have a continuous derivative for sum  0. The derivative must be continuous in order for the learning algorithm to work

correctly. So a simple alternative for the above clipped 1 type function for linear function is to make it a sum sum < 0. This will force the output of the linear unit to drop to zero very quickly when sum < 0. This type of function can provide a well behaved, non-zero derivative for all part of the activation function while adding a negligible value to the output for sum  0. The actual activation function examined in this paper is given by, Output =



sum if sum > 0 1=(a bsum) otherwise

(3)

where a and b are constants. The derivative of this

modi ed function is given by,  dOutput = 1 if sum > 0 2 otherwise: b=(a b  sum) dsum

P

(4)

By setting b = a2 , the above derivative reduces to 1.0 at sum = 0. However, it is necessary to select a reasonable value for a so that the \Output" (eqn. 3) rapidly drops to zero when sum < 0. In our experiments we set the following values: a = 1:0x103 and b = 1:0x106. Note that this function has a derivative of 1.0 at sum = 0 for the above values of a and b. The predictive capability of this new neural network structure is evaluated in the next section. 4 Prediction Results

The predictive capability of the neural network models with a clipped linear output unit and networks with a sigmoidal output unit are compared using three data sets Data1 [12], Data2 [13] and Data3 [1]. To further evaluate the predictive accuracy of the proposed neural network models we included results from ve well know analytic models. These analytic models are: the Logarithmic model proposed by Musa and Okumoto [12], the Inverse-Polynomial of Littlewood and Verrall [8], the Power model of Crow [2], the Delayed S-shaped of Yamada et al. [15] and the Exponential model [11]. All these analytic models are two parameter models. We trained neural networks with execution time as the input and the observed cumulative fault as the target output. The size of the training set was varied from a minimum size consisting of the number of data points that cover the initial 20% of the total execution time up to a set with all but the last point. Thus the training set at time ti consists of the complete failure of the system since the beginning of testing (t0 ). At the end of training we fed the network with a future value for the cumulative execution time and measured its prediction of the cumulative faults. To compare these competing models we used a simple predictability measure proposed by Malaiya et al. [9]. A brief overview of this measure is as follows. Let the data be grouped into m points (ti ; i ); i = 1 to m, where i is the cumulative number of faults found at the end of the ith test session, and ti is the accumulated execution time. Let kp n be the predicted total number of the faults for a future test session tn at tk , where tk < tn  tm . Using the predicted value at each point tk , we can calculate the model's prediction error (kp n )=n; for k = 1 to m 1. n Then the average prediction error measure is given





by Aerror = n 1 1 nk=11 n n n . Aerror provides a summary of how well a model predicted throughout the test phase. Two extreme horizons in which we compare these competing models are: n = k + 1, the next-step prediction; and n = m, the end-point prediction. The results reported for the neural network models are averaged over 50 experiments. A summary of prediction results in terms of Aerror is given in Table 2. In this table EPP represents the end-point prediction error while NSP is used to denote the next-step prediction error. To distinguish di erent neural network models the following legends were used: FFNS for the feed-forward network with a sigmoidal output unit, FFNL for the feed-forward network with a linear output unit, JNS for the Jordan network with a sigmoidal unit and JNL for the Jordan network with a linear output unit. For clarity sake let us call the neural network models with a linear output unit as the \proposed models" and the neural network models with sigmoidal output units as the \earlier models". First we compare these models in terms of their EPP. The proposed models exhibit a less predictive accuracy than that of our earlier neural network models. However, among the proposed models the JNL model has a better accuracy than all other models in Data 2 and Data3 while its accuracy is comparable to some of the analytic models in Data1. In Data1, the FFNL model has lost its predictive accuracy signi cantly with the introduction of linear output unit. But its performance is marginally better than the analytic models in Data2 and most of the analytic models in Data3. Thus among the proposed models, the JNL model is a better end-point predictor than the FFNL model. This result is similar to our observation in [6]. When we compare the next-step prediction accuracy of the proposed models with other competing models the results are less conclusive. Among the proposed models, the JNL model is slightly more accurate than the FFNL model. The proposed JNL model exhibits a competitive prediction accuracy with analytic models in Data3 and less accuracy than the analytic models in the remaining two data sets. However analytic models are more accurate in next-step predictions than the proposed FFNL model in all three data sets. 5

Discussion

kp

One main advantage with the Cascade-correlation algorithm is that the complexity of the network can be easily controlled by adjusting a parameter called \ErrorIndexThreshold". This is a unit free parameter given by the rms of the sum squared residual error of

Models Data1 Used EPP NSP Logarithmic 3.83 1.33 Inv. Polynomial 3.98 1.39 Exponential 8.57 1.33 Power 3.94 1.45 Delayed S-shape 10.99 1.58

Data2 EPP NSP 16.15 5.93 19.15 5.81 20.99 6.01 17.97 5.56 20.87 6.25

Data3 EPP NSP 15.59 2.66 15.68 2.68 23.45 2.97 9.25 2.59 30.59 4.53

FFNS JNS

3.52 3.11

2.45 3.79

5.69 4.24

FFNL JNL

20.11 3.36 16.18 7.36 15.13 5.01 5.41 2.66 9.27 6.76 3.68 2.87

3.95 1.49

5.28 8.84

5.40 4.33

Table 2: A Summary of Prediction Results for Data1, Data2 and Data3. the training set normalized by the standard deviation of the training outputs. This parameter represents the amount of residual error that is tolerated during the training phase. By controlling this value one can constraint the number of hidden units added to the network. Though a value of 0.20 is quite useful in a verity of applications, one can experiment with it and settle for a value that is most useful for the particular application in hand. In our experiment, we examined four values (0.05, 0.10, 0.15 and 0.20) and settled for the value of 0.10 for Data1 and 0.20 for Data2 and Data3 respectively. One heuristic that guided us to settle for these values is that the complexity of the network should be neither too low nor very high. Thus the results in Table 1 are for the FFNL model with 1 or 2 hidden units and the JNL model with 0 or 1 hidden unit. Once the training is complete the structure of the network can be represented in terms of a mathematical expression for the cumulative fault (ti ). The ensuing discussion is for the neural network models with a linear output unit. Details on the models with a sigmoidal output unit can be found in [6]. Assume that we are interested in (ti ) for sum > 0 (see Figure 2a for details on sum). Let (0) = 0. Thus the model developed by the feed-forward network with no hidden unit is (ti ) = w0 + w1ti (5) where w0 is the weight feeding the output from the bias unit and w1 is the weight from the input unit. This network is equivalent to a linear model. As expected this neural network model did not predict well. For the Jordan network with no hidden unit, the

corresponding expression is (ti ) = w0 + w1ti + w2(ti 1) (6) where (ti 1) is the output of an additional input unit that receives the cumulative fault of the previous time step as its input. This simple model exhibited a better predictive accuracy than other complex Jordan network models. This model has both a time component as well as an autoregressive component. By extending the above equations for networks with one hidden unit we can obtain a more complex expression. The feed-forward with one hidden unit can be expressed as (ti ) = w0 + w1 ti + w2h1 (ti ) (7) where w2 is the weight from the hidden unit to the output unit. The activation of the hidden unit h1 (ti ) is a sigmoidal expression. By adding a sigmoidal hidden unit to the network we have introduced a non-linear component to the model. Thus the network implements a linear relation in the beginning and adds a non-linear component if the training set is not learnable with the linear relation. Similarly for the Jordan network with one hidden unit, the resulting model is (ti ) = w0 + w1 ti + w2 (ti 1) + w3h1 (ti ) (8) where h1(ti ) is a sigmoidal expression computed using a weighted sum of ti and (ti 1 ). This equation represents an autoregressive model that has both a linear and a non-linear regressive components. By extending the above expressions we can write model equations for networks with 2 or more hidden units.

Another major advantage is that the neural network approach is a \black box" approach. If one is interested in predicting the MTTF or the time between failure using a time series model then the above network structures can be easily trained to learn a timeseries process. Alternatively the failure rate (ti ) can be derived for the above expressions by taking derivative with respect to time ti as shown below. For the feed-forward network with no hidden unit, the resulting process is a constant failure rate process of the form i) (ti ) =  (ti ) = d(t (9) dti = w1 For the feed-forward network with one hidden unit, the failure rate expression is given by (ti ) = w1 + w2h (ti ) (10) where i ) @h(ti ) @sum h (ti ) = dh(t dt = @sum  @t 0

0

0

i

i

For the sigmoidal unit the above derivative reduces to h (ti ) = h(ti)(1 h(ti ))w11 (11) Here w11 is the weight feeding the hidden unit from the input unit and w1 and w2 are weights feeding the output unit. By substituting equation (11) in equation (10) we get the failure rate function as 0

(ti ) = w1 + w2 (h(ti )(1 h(ti ))w11) (12) Similarly for the Jordan network we can express the failure rate equation as follows. For the network with no hidden unit, (ti ) = w1 + w2 (ti 1 ) (13) where  (ti ) = 0 at ti = 0. Since (ti 1 ) =  (ti 1) we can rewrite the above expression as (ti ) = w1 + w2(ti 1 ) (14) This is a rst order autoregressive process with a random coecient w2 which is analogous to the model proposed by Singpurwalla et al. [14]. Unlike in a true rst order autoregressive model we need not make any assumption about the structure of the coecient term w2. The failure rate process for the Jordan network with one hidden unit is (ti ) = w1 + w2 (ti 1) + w3h (ti ) (15) 0

0

0

0

0

where h (ti ) = h(ti )(1 h(ti ))(w11 + w12 (ti 1)) and w11 and w12 are weights feeding the hidden unit from 0

0

the input units. Since  (ti 1 ) = (ti 1), we can rewrite this expression in terms of the failure rate process as (ti ) = w1 + w2(ti 1 ) + w3h (ti ) = w1 + w3w11(h(ti )(1 h(ti ))) +(w2 + w3 w12(h(ti )(1 h(ti ))))(ti 1 ) (16) This autoregressive process has both a linear as well as a non-linear autoregressive components. A similar expressions for both the feed-forward and the Jordan networks with 2 or more hidden units can be derived by extending the above equations. Thus the feed-forward network is capable of modeling any reliability growth process that can be characterized by a non-linear function while the Jordan network with the teacher forced learning is capable of modeling any non-linear rst order autoregressive process. 0

0

6

Conclusion

We demonstrated the applicability of neural network models with a clipped linear output unit. This modi ed structure paves way for applying neural networks without prescaling the data. Among the proposed models the Jordan network model exhibits a better accuracy than the feed-forward model. We developed expressions for the network structures and showed how the underlying failure rate process (ti ) can be derived. These expressions could be used for further analysis on models developed by the neural network approach. In our research we restricted the applicability of the neural networks to model only the cumulative faults. However this need not be the case. Since the neural network approach is a \black box" approach, models that can incorporate additional knowledge such as the program complexity can be easily developed by adding more input units. This approach is currently being investigated.

Acknowledgements This work was partially supported from a project funded by SDIO/IST and monitored by ONR. References

[1]

B. M. Anna-Mary, \A Study of the Musa Reliability Model", M. S. Thesis, CS Dept., Univ. of Maryland, 1980.

[2] [3]

[4] [5]

L. H. Crow, \Reliability for complex repairable systems", Reliability and Biometry, SIAM, Philadelphia, 1974, pp. 379-410. S. E. Fahlman and C. Lebiere, \The CascadedCorrelation Learning Architecture", School of Computer Science, Carnegie Mellone University, Pittsburg, Tech. Rep. CMU-CS-90-100, Feb. 1990. M. I. Jordan, \Attractor Dynamics and Parallelism in a Connectionist Sequential Machine",

Proc. of the 8th Annual Conf. of the Cog. Science, 1986, pp. 531-546.

N. Karunanithi, Y. K. Malaiya, and D. Whitley, \Prediction of Software Reliability Using Neural Networks", Proc. 1991 IEEE Int. Symp. on Soft. Rel. Eng., May 1991, pp. 124-130. [6] N. Karunanithi, D.Whitley, and Y. K. Malaiya, \Prediction of Software Reliability Using Connectionist Models", IEEE Trans. on Software Eng., Vol. 18, No. 7, pp. 563-574, July 1992. [7] N. Karunanithi, D.Whitley, and Y. K. Malaiya, \Using Neural Networks in Reliability Prediction", IEEE Software, Vol. 9, No. 4, pp. 53-59, July 1992. [8] B. Littlewood and J. L. Verrall, \A Bayesian reliability model with a Stochastically monotone failure rate", IEEE Trans. Reliability., vol. R-23, no. 2, pp. 108-114, 1974. [9] Y. K. Malaiya, N. Karunanithi, and P. Verma, \Predictability of Software Reliability Models", To appear in IEEE Trans. Reliability, Dec. 1992. [10] Y. K. Malaiya and P. K. Srimani, Ed., Software

Reliability Models: Theoretical Developments, Evaluation and Applications, IEEE

Computer Society Press, 1990. [11] P. B. Moranda, \Predictions of Software Reliability during debugging", Proc. of Annual Rel. and Maint. Symp., Washington, DC, 1975, pp. 327-332. [12] J. D. Musa, A. Iannino, and K. Okumoto, Soft-

ware Reliability - Measurement, Prediction, Applications, McGraw-Hill, 1987.

[13] M. Ohba, \Software reliability analysis models", IBM J. Res. Development, vol. 28, no. 4, pp. 428-443, July 1984. [14] N. D. Singpurwalla and R. Soyer, \Assessing (Software) Reliability Growth Using a Random Coecient Autoregressive Process and Its Rami cations", IEEE Trans. Software Eng., vol. SE11, no. 12, pp. 1456-1464, Dec. 1985.

[15] S. Yamada, M. Ohba, and S. Osaki, \S-Shaped Reliability Growth Modeling for Software Error Detection", IEEE Trans. Reliability., vol. R-32, pp. 475-478, Dec. 1983.

Suggest Documents