Use of Methodological Diversity to Improve Neural ... - CiteSeerX

Use of Methodological Diversity to Improve Neural Network Generalisation W.B. Yates and D. Partridge Department of Computer Science University of Exeter Exeter EX4 4PT August 18, 1995 Abstract

Littlewood and Miller [1989] present a statistical framework for dealing with coincident failures in multiversion software systems. They develop a theoretical model that holds the promise of high system reliability through the use of multiple, diverse sets of alternative versions. In this paper we adapt their framework to investigate the feasibility of exploiting the diversity observable in multiple populations of neural networks developed using diverse methodologies. We evaluate the generalisation improvements achieved by a range of methodologically diverse network generation processes. We attempt to order the constituent methodological features with respect to their potential for use in the engineering of useful diversity. We also de ne and explore the use of relative measures of the diversity between version sets as a guide to the potential for exploiting inter-set diversity.

Keywords: Multilayer Perceptrons, Backpropagation, Radial Basis Function Networks, Generalisation, Multiversion Systems, Software Reliability.

1

Use of Methodological Diversity to Improve Neural Network Generalisation 1 Introduction This paper further develops earlier results that used two strategies (majority vote and selector network) to obtain a better generalisation performance from a set of versions as a whole than is exhibited by any individual version in the set (see Partridge and Grith [1994]). The basis for the demonstrated improvement was the presence of \useful diversity" in a set of alternative networks, each trained to perform a given target function. Such a version set exhibits useful diversity when test failures are limited to a (minority) subset of the complete version set. For example, if all test pattern failures occur in 4 (or less) out of 10 version networks in a set, then a simple majority vote approach will always give the correct result { i.e. 100% generalisation performance for the versi on set as a whole, despite the fact that no individual network operates to this level of p erformance. Several strategies for engineering diversity were explored: varying weights, architectures, and training sets, as well as selection of versions according to a minimum coincident failure heuristic. As Littlewood and Miller [1989] note on page 1596, \Although diversity of the products (versions) will often be the ultimate goal, it seems clear that the achieved level will depend on the diversity of the process (software development methodology)". And so they extend earlier models of multiversion software to encompass the possibility of diverse development methodologies to produce the versions of a multiversion system. We explore the possibility for improving the generalisation performance of a neural computing system by exploiting methodological diversity. In this paper, we begin in section 2 by reviewing the statistical framework developed in Partridge and Grith [1994] (based on the work of Littlewood and Miller [1989]) and extend it to accommodate multiple methodologies. In section 3 we introduce two neural computing \methods": Multilayer Perceptron networks trained with Backpropagation (see Hornik et al. [1989] and Rumelhart et al. [1986]), and Radial Basis Function networks (see Moody and Darken [1988] and Holden and Rayner [1993]), and specify a number of network versions. In section 4, we use these networks to construct two simple multiversion systems, and demonstrate the bene ts of employing methodological diversity. Finally, in section 5 we present our conclusions.

2 The Statistical Model The earlier paper developed statistical measures for computing probability of correctness (i.e. generalisation performance) within a single diverse set of versions. In order to properly assess the impact of methodological diversity between sets of versions we require inter-set statistics that can be directly compared with the intraset ones (developed in Partridge and Grith [1994]).

2

2.1 Intraset Measures

1. probability that a randomly selected version is incorrect on a randomly selected input, prob(1fails)

prob(1 fails) = =

N X

n=1 N X

prob(chosen net failsjexactly n fail) prob(exactly n fail)

n p n n=1 N

where pn is the probability that exactly n versions in a set will fail { i.e. the proportion of tests that failed on precisely n versions. 2. probability that two randomly selected version are both incorrect on a randomly selected input, prob(2both fail)

prob(2 both fail) = =

N X

n=2 N X

prob(chosen nets both failjexactly n fail) prob(exactly n fail)

n (n ? 1) p n n=2 N (N ? 1)

3. a measure of the diversity within a set of versions, Generalisation Diversity (GD): (2 both fail) GD = 1 ? prob prob(1 fails) This quantity is unity when maximum generalisation dierence is actually achieved (i.e. prob(2 both fail) = 0), and is zero when generalisation dierence is completely absent (i.e. prob(1 fails) = prob(2 both fail)).

2.2 Two-level statistics | inter-set measures

In the previous study the eectiveness of a single majority vote statistic within single sets of versions was explored. The supposed bene t of employing methodologically diverse sets is to minimize coincident failure by increasing diversity through the use of diverse methodologies and by selecting versions one from each of the dierent version sets. A simple multi-set strategy is to use three version sets and to take the majority result from three versions selected at random one from each set. Let p(maj 3)ABC be the probability of a majority correct result from three versions, one randomly chosen from each of the sets A, B , and C . The following statistic de nes the probability that a majority of 3 versions from a single set are correct. This statistic will provide a direct comparison with p(maj 3)ABC .

3

prob(majority out of 3 are correct) p(maj 3) = prob((0 out of 3) or (1 out of 3) fail on this input) =

N X

n=0

prob(exactly n versions fail on this input

and either 0 or 1 of chosen versions fail)

=

N X

n=0

prob(either 0 or 1 of chosen versions failjexactly n fail)

prob(exactly n fail)

=

N X

+

=

prob(none of chosen versions failjexactly n fail) pn

n=0 N X

n=1

prob(exactly 1 chosen version failsjexactly n fail) pn

N (N ? n) (N ? n ? 1) (N ? n ? 2) X

N (N ? 1) (N ? 2) pn N X ? n) (N ? n ? 1) p +3 Nn ((N n N ? 1) (N ? 2) n=1 n=0

In order to compute p(maj 3)ABC consider a multi-set system composed of three disjoint sets, A, B and C , containing NA, NB and NC versions, respectively, then pnA nB nC is the probability that exactly nA , nB and nC versions from each set fail1.

p(maj 3)ABC =

(NA ? nA) (NB ? nB ) (NC ? nC ) p nA nB nC NA NB NC nA nB nC =0 NAX NB NC n A (NB ? nB ) (NC ? nC ) + N N pnA nB nC B C nA =1;nB nC =0 NA NAX NB NC (N ? n ) n A A B (NC ? nC ) p + nA nB nC NA NB NC nB =1;nA nC =0 NAX NB NC (N ? n ) (N ? n ) n B B C A A + NA NB NC p n A n B n C nA nB =0;nC =1 NAX NB NC

If we wish to exploit diversity both within and between version sets, then we require \majority of majorities" statistics. Let p(majmaj ) be the probability of a correct majority outcome from three separate majorities, one each from sets A, B and C . 1

to increase readability we condense triple summations, such as

PNA NB NC

nA=1;nB nC =0

4

PNA PNB PNC nA=1

nB =0

nC =0 to

The probability a majority is correct when exactly n versions fail, p(maj )nX (for set X ), is simply 1 when less than a majority of versions fail, and 0 when a majority of versions fail. ( nX < majority number p(maj )nX = 10 ifotherwise

p(majmaj ) =

NAX NB NC

p(maj )nA p(maj )nB p(maj )nC pnA nB nC

nA nB nC =0 NAX NB NC

+

+ +


(1 ? p(maj )nA )p(maj )nB p(maj )nC pnA nB nC

nA nB nC =0 NAX NB NC nA nB nC =0

p(maj )nA (1 ? p(maj )nB )p(maj )nC pnA nB nC p(maj )nA p(maj )nB (1 ? p(maj )nC ) pnA nB nC

(full derivation in appendix). And for comparison purposes we need the simple majority of N statistic:

prob(majority correct) p(maj ) = prob(at most k fail) = prob(either exactly 0 fail or exactly 1 fails or : : :or exactly k fail) =

k X

n=0

pn

where N is the number of versions in the set, k = (N ? majority number)2.

2.3 Inter-set Generalisation Diversity

Suppose that A and B are two randomly chosen versions:

p(AB ) p(AjB ) p(A) p(B )

= = = =

prob(both A & B fail on a randomly chosen input) prob(A failsjB has failed) prob(A fails) prob(B fails)

as detailed in the earlier paper this statistic requires modi cation if used with sets of even numbers of versions; in this paper we use only sets with odd numbers of versions 2

5

then from standard probability theory: p(AB ) = p(AjB )p(B ) = p(B jA)p(A) Now maximum diversity occurs when p(AjB ) = 0 for any A and B , and so p(AB ) = 0. Minimum diversity occurs when p(AjB ) = 1 in which case p(AB ) = p(A) = p(B ). Since each of these latter terms represents the probability that a randomly chosen version fails on a given input, this is p(1 fails) in the earlier formulation of the GD measure. The actual probability that A and B both fail for a given population is p(AB ) which is p(2 both fail) in the earlier notation. We now denote by A a randomly chosen version from set A and by B a randomly chosen version from set B . The maximum inter-set generalisation diversity occurs when failure of one version is accompanied by non-failure of the other, and this again gives p(AB ) = 0. Minimum inter-set diversity again implies that failure of one version is accompanied by failure of the other, hence p(AjB ) = p(B jA) = 1. There is, however, a problem with this approach: p(A) will not in general equal p(B ) and so we have two dierent candidates for the single-version failure probability. There are a number of possible ways to solve this problem (e.g. use a \mean" of the two values), but since the purpose of the calculation in the single set case was to establish the upper limit of p(AB ) in order to standardize GD to the range (0; 1), the obvious resolution in the two-set case is to take the larger of p(A) and p(B ) as this upper limit. Thus we de ne pAB (1 fails) = max(p(A); p(B )) the larger of the two probabilities that a single randomly chosen version will fail in either set. Next we need to de ne the actual probability that two randomly chosen versions, each from a dierent set, pAB (2 both fail), will simultaneously fail on a given input. The new statistic required is based on two sets of networks, A and B , where jAj = NA and jB j = NB 3. Both sets of networks are subjected to exactly the same generalisation set of test patterns. We record precisely how many of the tests fail on every combination of numbers of networks between the two sets { i.e. number of test patterns failing on 1 network in set A and 1 network in set B , on 1 network in set A and 2 networks in set B , etc. From the number of tests that fail on each possible combination divided by the total number of test patterns we can compute an estimated probability pnA ;nB , which is interpreted as the probability that exactly nA randomly selected networks from set A and exactly nB randomly selected networks from set B will fail on a randomly selected input. Then pAB (2 both fail) =

NA X NB X

nA =1 nB =1

prob(chosen net failsjexactly nA fail)

prob(chosen net failsjexactly nB fail) prob(exactly nA and nB nets fail) NA X NB n X A nB = N N pnA ;nB nA =1 nB =1

3

A

B

the cardinality of set S is jS j

6

Finally, we can de ne a between-set GD measure, GDB : (2 both fail) GDBAB = 1 ? pAB pAB (1 fails) This is a strict between set measure; it does not take account of the within set diversities of the two constituent sets. It may therefore be useful to de ne a composite measure, GDBW , that takes into account both between- and within-set diversities: GDBWAB = GDBAB ? 12 (GDA + GDB ) where GDA and GDB are the GDs of sets A and B respectively.

3 Diverse Methodologies In this section we shall introduce the two methodologies we shall use to synthesize our multiversion system components. Speci cally, we shall employ Multilayer Perceptrons networks trained using Backpropagation (see Hornik et al. [1989] and Rumelhart et al. [1986]) and Radial Basis Function networks (see Moody and Darken [1988] and Holden and Rayner [1993]). We restrict our attention to these neural computing paradigms because they are relatively simple, well documented, and widely used and so constitute (in a practical sense) two important types of neural network technology. They should also be a source of methodological diversity. We shall use these two paradigms (introduced in sections 3.2 and 3.3) to implement, in a precise and formal manner (see section 3.1) a well de ned task presented in section 3.4. From these components we shall de ne a space or population of distinct neural network versions which will form the basis of our experiments in section 4.

3.1 Correctness

Our neural networks compute parameterized functions of the form f : W X r ! Y s where W is the network's weight space, r is the number of input units and s is the number of output units. The behaviour of a neural network is determined by the choice of weights w 2 W . Typically, such networks are not programmed, rather they are trained with a learning algorithm to perform a task typically encoded in a set P of examples called training patterns. A learning algorithm processes a sequence of patterns and iteratively modi es the network's weights until some termination criterion de ned over the network and pattern set is satis ed. The learning algorithm is itself parameterized and must be initialized with a set of initial weights (for the network) and other control parameters. The initial weights must all be dierent, and are usually chosen from a small interval, such as [?0:5; 0:5], according to some random procedure. The control parameters are usually determined experimentally during the prototyping stage of the network design. In this paper, our training pattern set is the graph of a target function g : X r ! Y s, and our task is to implement this function. Training continues until the learning algorithm returns a set of weights w such that neural network f (w) is an approximately correct implementation of the target function g, to some accuracy " and some tolerance , under metric d and measure . In symbols, training continues until the correctness speci cation [fx 2 P j d(f (w; x); g(x)) "g < ] 7

is satis ed. In general, the notion of correctness is dictated by the structure of P , X , Y , and the choice of r, s, and d. In this paper we shall assume that P is nite, X = Y = [0; 1], r = 5, s = 1, is the measure (8x 2 P ) (fxg) = jP1 j (8E P ) (E ) = jjPEjj and d is the usual metric on the reals. Intuitively, our correctness speci cation implies that learning terminates when our neural network is "-close to our target function, on all but some fraction of training patterns. The network training phase is followed by a testing phase. During testing the performance of the resulting network is evaluated on an unseen set of test patterns Q, to accuracy " and a value for computed. In the parlance of neural computing the value 1 ? corresponds to the generalisation performance of the network on test set Q.

3.2 Multilayer Perceptron Networks

We shall concern ourselves with a class of feedforward 2-layer neural networks based on a de nition presented in Hornik et al. [1989]. They specify a 2-layer feedforward neural network by a triple (r; A; G) consisting of a number r 2 N, a family of ane functions A, and an activation function G. The family of functions A has the form (8aj 2 A) aj : Rr ! R

aj (x) =

r X i=1

wjixi + bj

for j = 1; : : : ; n, where x 2 Rr are the inputs to the network, wji 2 R is the j th hidden unit's weight from the ith input unit and bj 2 R is the j th hidden unit's bias. The class P r of single hidden layer feedforward neural networks, with one output denoted (G) is the class of functions n X ff : Rr ! R j f (x) = wj G(aj (x))g 0

j =1

for n = 1; 2; : : : hidden units, where wj 2 R is the output unit's weight from the j th hidden unit. Our speci cation diers from the one found in Hornik et al. [1989] in that we allow the output unit an activation function G and a bias b . Thus 0

0

n X

ff : Rr ! R j f (x) = G( wj G(aj (x)) + b )g 0

0

j =1

The computational properties of the neural network are determined by the choice of activation function G. In this study G is the logistic or sigmoid activation function (see Rumelhart et al. [1986], page 329) which has the form

G : R ! [0; 1] 8

and is de ned by

G(x) = 1 + exp1 (?x):

We shall train our 2-layer feedforward neural networks using an online backpropagation learning algorithm with momentum. The learning algorithm is a neural network based implementation of the steepest (gradient) descent optimization algorithm, due originally to Fermat (see Rumelhart et al. [1986]). The network structure, learning algorithm parameters, termination criterion, and other salient details used to generate the networks models used here are described in more detail in Partridge and Yates [1994] and are summarized in table 1. Name Parameter MLP Input Units r=5 Hidden Units n = 8; 9; 10; 11; 12 Output Units s=1 Activation G = sigmoid BP Learning Rate = 0:05 Momentum = 0:5 Iterations t0 = 20000 Initial Weights Range [?0:5; 0:5] LIC1 Accuracy " = 0:5 Tolerance = 0:0 Number of Training Patterns m = 1000 Number of Test Patterns (struc) l = 161051 Number of Test Patterns (rand) l = 10000 Table 1: MLP experimental parameters.

3.3 Radial Basis Function Networks

We shall concern ourselves with the class of radial basis function network presented in Moody and Darken [1988]. This network is a speci c example of the wider class of Generalized Single Layer Networks described in Holden and Rayner [1993]. A radial basis function network (or RBF) consists of a set of n ? 1 radial basis functions (or RBFs) = fi : Rr ! R j i = 1; : : : ; n ? 1g and a bias unit,

n(x) = 1 where typically n > r. Each RBF receives r input signals from the external environment and is connected to a single output unit which computes a function f of the form f : Rn ! R 9

de ned by

Pn f (x) = Pi=1n i(x(x)w) i i=1 i 2 Rn are

where x 2 is an input pattern and w the weights from the RBFs to the output unit. The computational properties of the network are determined by the choice of radial basis functions. In this paper we shall employ Gaussian response functions. These functions have the form cijj2 ) i(x) = exp(? jjx(? di )2 for i = 1; : : : ; n ? 1, where ci 2 Rr is the center, and di 2 R is the diameter of unit i's receptive eld. The function jj jj is called the Euclidean norm and is de ned by Rr

q

jjx ? cijj = ((x1 ? ci;1)2 + (x2 ? ci;2)2 + + (xr ? ci;r )2): The Gaussian function is radially-symmetric, that is, the function attains a single maximum value at its center, which drops o rapidly to zero at large radii. By varying the center ci and diameter di we may vary the position and shape of the functions receptive eld. Radial basis network learning proceeds in three stages. First the r dimensional coordinates of the Gaussian unit's centers ci 2 Rr for i = 1; : : : ; n ? 1, are chosen. The center ci of a unit i determines the position, in input space Rr where the unit is most sensitive. In order to position these centers eectively with respect to an arbitrary data set we employ the adaptive k means cluster algorithm algorithm (see gure 1 and Moody and Darken [1988]), to adapt the centers so that they lie in \interesting" parts of the input space. Next, the diameters di for i = 1; : : : ; n ? 1 of the Gaussian unit's receptive elds are determined according to the simple heuristic: for each radial basis unit i = 1; : : : ; n ? 1 set the diameter di to twice the root-mean-square distance between the cluster center ci and the input points x that are clustered at unit i. Finally, the weights from the Gaussian unit's to the output unit are determined by the Widrow-Ho gradient descent learning algorithm (see Widrow and Ho [1960]). The network structure, learning algorithm parameters, termination criterion, and other salient details used to generate the networks models used here are described in more detail in Yates [1994] and are summarized in table 2.

3.4 Target Function

Our task is to implement one of the 15, so called, Launch Interceptor Conditions, LIC1. Knight and Leveson [1986] specify LIC1 as a predicate that evaluates to true if: \There exists at least one set of two consecutive data points that are a distance greater than the length LENGTH1 apart. (0 LENGTH1)" where LENGTH1 is a parameter of the condition. We shall abstract a functional representation of this precise, though informal description. The intention here is to make explicit the exact structure and semantics of the network task. 10

=

for

=t0 i=1

t

to 0 do

begin for each input pattern

x

do

begin find rbf

i

with center

update each component

ci cij

closest to =

cij

+

x x

* ( j -

cij )

for

j

=

1; : : : ; r

end

=

-

end

Figure 1: The k-means clustering algorithm. Let S = [0; 1][0; 1] represent the set of all data points, let a = (x1; y1) and b = (x2; y2) represent two consecutive data points and note that the function q

d(a; b) = (x1 ? x2)2 + (y1 ? y2)2: is the Euclidean distance on S . Formally, we require a neural network implementation of the bounded function g : S S [0; 1] ! B de ned by 8 < g(a; b; LENGTH1) = :1 if d(a; b) > LENGTH1; 0 otherwise, In fact, as our task will ultimately be executed on some digital computer, we shall restrict our attention to the nite subset of rationals, with 6 decimal places, in S S [0; 1]. In this special case we note that g is also continuous. The ve \random" training sets (that we shall use later) were constructed by selecting a seed i = 1; 2; : : : ; 5 and generating 1000 random triples < a; b; LENGTH1 > and applying the task speci cation g. The ve \rational" training sets are randomly generated boundary patterns, i.e. similar patterns that are \just" true or \just" false. Our two test sets consists of 10000 randomly generated patterns and a set of patterns constructed according to the algorithm shown in gure 2. The resulting test set (of 161051 or 115 patterns) ensures that our networks are tested over the whole input space and not some restricted region. In principle, we are able to generate the entire set of example patterns (recall that we have restricted our attention to rationals of 6 decimal places). In practice, however this is (usually) not the case and a speci cation of the task function is usually encoded 11

Name Parameter RBF Input Units r=5 Hidden Units n = 50; 60; 70; 80; 90 Output Units s=1 Activation = Gaussian Cluster Neighborhood = 0:6 Iterations t0 = 150 Widrow-Ho Learning Rate = 0:1 Iterations t1 = 1000 Initial Weights Range [?0:5; 0:5] LIC1 Accuracy " = 0:5 Tolerance = 0:0 Number of Training Patterns m = 1000 Number of Test Patterns (struc) l = 161051 Number of Test Patterns (rand) l = 10000 Table 2: RBF experimental parameters. for

x1

= 0 to 1.0 step 0.1 do

for

y1

= 0 to 1.0 step 0.1 do

for

x2

= 0 to 1.0 step 0.1 do

for

y2

= 0 to 1.0 step 0.1 do

for LENGTH1 = 0 to 1.0 step 0.1 do

x ; y1; x2; y2; LENGTH1, g(x1; y1; x2; y2; LENGTH1) >

print < 1

Figure 2: Structured test set generator algorithm. in a (small) training set acquired during the requirements analysis phase of the network design.

3.5 Engineering Methodological Diversity

The two distinct types of neural network, MLP and RBF, are expected to provide the major source of methodological diversity. In order to test this conjecture we shall require a number of RBF and MLP neural network versions. By systematically manipulating the networks structure, the learning algorithm parameterization and the task speci cation we shall construct a space or population of distinct and hopefully diverse neural network components. Speci cally, we shall vary four major parameters: 1. the number of hidden units in a network, 12

2. the learning algorithm's initial weights, 3. the training set used, and 4. the training set construction method. We shall reference individual versions in our space by labels that re ect the trained networks' initial conditions, with the convention that xi-Hj -Wk denotes a neural network with j 2 f8; 9; 10; 11; 12g or j 2 f50; 60; 70; 80; 90g hidden units, trained with initial weight set k 2 f1; 2; 3; 4; 5g, on training set xi. The variable x 2 fT; Rg denotes the training set construction method. The T training sets are generated randomly while the R training sets are generated rationally (see section 3.4). The variable i = f1; 2; 3; 4; 5g denotes the particular training set used. As our MLP and RBF networks have dierent numbers of hidden units we may determine the type of network (i.e. RBF or MLP) by examining the value of j . For example, T3-H10-W2 is the MLP network (because it only has 10 hidden units), trained with initial weights set 2, on on random training set 3. As a result of the systematic manipulation of each of these four parameters 250 distinct networks can be speci ed and then, if necessary, generated as a basis for study. Later, in section 4 we shall refer to certain subsets of our version space. Speci cally, within the MLP subset of our version space we identify named subsets of ve speci c versions: 1. WMLP | vary initial weight set (1, 2, 3, 4, and 5); xed rational training set T3 and 10 hidden units, 2. HMLP | vary number of hidden units (8, 9, 10, 11 and 12); xed initial weight set 3 and rational training set R3, 3. TMLP | vary randomly constructed training set (T1, T2, T3, T4, and T5); xed initial weight set 3 and 10 hidden units, and 4. RMLP | vary \rationally" constructed training set (R1, R2, R3, R4, and R5); xed initial weight set 3 and 10 hidden units. 0

0

0

0

4 Results In this section we present the empirical results of our experiments designed to exploit the methodological diversity between MLPs and RBFs in order to improve generalisation performance. As this study was conceived as an extension of the earlier one, which examined the potential for improvement in diverse version sets we shall rst establish the baseline for comparison. The problem is that both the useful diversity obtained in a version set, and the average level of generalisation of the individual versions are now understood better. Signi cantly higher levels of both measures (i.e. GD and average generalisation) can be routinely engineered. It therefore seems reasonable to use the \better" version sets as our new basis because improvements in individual network construction will contribute directly to the ultimate goal to improve generalisation performance of the neural computing system. We can, however, relate the baseline performance of the new version sets to that of the previous study. 13

version set max min av maj. vote selector net GD T 94.67 92.41 93.77 94.54 97.24 0.554 W 94.76 93.46 94.11 94.15 96.69 0.384 A 94.55 93.33 93.91 94.24 96.10 0.443 TMLP 97.64 96.46 96.98 97.94 | 0.611 RMLP 97.66 97.47 97.55 97.98 | 0.550 WMLP 97.54 97.35 97.47 98.00 | 0.571 HMLP 97.59 96.90 97.28 98.35 | 0.583 Table 3: A comparison of earlier and current baseline performances. 0

0

0

0

4.1 Establishing a Baseline of Performance

In order to provide some reasonable basis for comparison we have generated some results from the version sets used in the earlier study, and from comparable version sets which form the basis for the current study. The previous study involved only MLP networks and was founded on three sets each containing 10 networks. These were 1. the T set | vary randomly constructed training set 2. the W set | vary initial weight set 3. the A set | vary number of repetitions of output unit The main diversity and performance \engineering" improvements introduced are 1. the A style of architecture variation (repetition of output units) is replaced by variation in the number of hidden units | H variation (using 8, 9, 10, 11 and 12 hidden units) 2. training sets are constructed from either randomly or \rationally" selected patterns In table 3 comparative results are presented (all tabulated results, except GD values, are percentages). The rst three lines of results are from the earlier study and are largely self-explanatory. The column headed \selector net" is the best selector-network average from the previous study | a selector network is a meta-level network trained as a switching mechanism to choose between individual version networks dependent upon characteristics of the input pattern. Four variations on the selector-network idea were examined; the tabulated result is an average for whichever variant proved to be the best. As can be seen in table 3 the current version sets are approximately 3% better than the earlier ones which means that potential generalisation improvements although starting from a higher baseline have less potential for demonstrating improvement. Rough comparisons, which are all that is possible within this table, do not indicate any obvious diminishment of returns as average network performance is increased. For example, the two T sets (T and TMLP ), possessing similar GD values, show a generalisation improvement from average performance to majority vote, i.e. p(majority); of 0:77% and 0:96% respectively, despite the fact that the TMLP set is performing approximately 3:5% better 0

0

14

network gen (struc) gen (rand) R3-H9-W4 97.98 98.00 R2-H11-W1 97.93 98.31 R5-H12-W3 97.90 98.40 R5-H10-W1 97.87 97.95 R2-H9-W5 97.87 97.60 R4-H11-W1 97.87 98.24 R2-H10-W4 97.87 97.92 R1-H11-W1 97.82 98.13 R4-H8-W3 97.81 97.90 R5-H11-W3 97.78 97.89 R3-H11-W2 97.77 98.07 R2-H11-W4 97.75 97.68 R4-H11-W2 97.74 97.82 R5-H10-W4 97.74 97.77 R4-H9-W5 97.73 97.87 Table 4: The 15 best networks. than the T set. Similarly, if we compare the W and WMLP sets, same gains are 0:04% and 0:53%, respectively. However, in both of these cases the comparison is distorted in favor of the current versions sets because they have the higher GD values. But if we compare an early and a current set with similar GD values (but favouring the earlier set), such as T and RMLP , then we see gains of 0:77% and 0:44%, respectively. The size of the improvement is reduced for the higher performance version set, but not to such a degree as might be expected. 0

0

4.2 Eects of Methodological Diversity on Generalisation

In this section we present the results of two experiments designed to demonstrate the positive eects of methodological diversity on generalisation performance of multiversion systems. 4.2.1

Single Set

All our networks were evaluated on the structured test set described in gure 2, and the 15 networks with the best generalisation statistic were selected (see table 4, column 2). In order to distance ourselves from any unseen biases that may be present in our structured test set each selected network was also tested on a set of 10000 randomly generated patterns (see table 4, column 3). The results of considering these networks as the components of a majority vote multiversion system are shown in table 5. Clearly, even though the GDs are relatively low, the overall system is still more accurate than any of the individual networks. Intuitively, one would expect a multiversion system constructed from these components to be superior to any constructed from inferior components. It is the purpose of this experiment to demonstrate the contrary. 15

test average GD majority struc 97.83 0.468 98.44 rand 97.97 0.530 98.48 Table 5: Majority vote statistics. network gen (struc) gen (rand) R1-H50-W3 84.10 84.79 R1-H50-W4 85.38 86.68 R2-H60-W5 86.69 87.53 R3-H60-W2 87.29 90.38 Table 6: Four RBF networks. Consider the four RBF networks shown in table 6. The values in tables 7 and 8 demonstrate the eect on our multiversion system of replacing the rst, that is the best network in table 4 with the rst network in table 6. The next entry in the table is derived by replacing the rst and second network in table 4 with the rst and second network in table 6. The third and fourth entries in the table are derived by replacing the rst three and four MLP networks in table 4 by the rst three and four RBF networks in table 6 respectively. The results clearly demonstrate that although the networks we are substituting are markedly inferior their addition leads to an increase in overall performance, and an increase in GD. The reason for this is, in our opinion, due solely, to the increase in methodological diversity { the processes involved in constructing trained RBF and MLP networks are so dierent that the resultant products are extremely diverse. This is however, a worst case { very low performance RBFs are substituted for the best MLPs. It shows the power of diversity in a highly disadvantageous context. The optimum strategy is to exploit diversity together with individual high performers, nut we have, as yet, not generated very high performance individual RBF networks for the chosen task. average GD majority 96.902 0.638 98.41 96.065 0.709 98.46 95.318 0.734 98.47 94.613 0.752 98.47 Table 7: Majority vote statistics after replacement on the structured test set.

16

average GD majority 97.089 0.678 98.53 96.314 0.728 98.53 95.589 0.737 98.51 95.085 0.746 98.51 Table 8: Majority vote statistics after replacement on the random test set. 4.2.2

Multi-Set

A further implication of diverse processes generating diverse products is that multiversion system generalisation ought to be optimizable as a result of selecting versions from between methodologically dierent sets, instead of selecting them from within a maximally diverse set. Given a system design based on three methodologically diverse sets of versions, we can compare the relative merits (with respect to generalisation performance) of selecting versions within and between sets by applying the statistical selection strategies developed earlier. In particular, we can compare the majority selection strategies (both outright majority and majority of three randomly selected) in order to assess the methodological diversity idea. The three methodologically diverse sets each contain ve versions and attempt to exploit diverse processes: network type and training set construction. The three sets are: TRBF : a set of RBF versions each trained on the same random training set (T2), each containing 50 hidden units (H50), but each initialized with a dierent sets of random weights (W1 to W5). RMLP : a set of MLP versions each trained on a rational training set (R1, R2 or R4), each containing a dierent number of hidden units (H9 to H12), and initialized with a variety of random weights (W2 to W5). TMLP : a set of MLP versions each trained on a random training set (T5), each containing 11 hidden units (H11), and each initialized with a dierent set of random weights (W1 to W5). The individual networks are shown in table 9 together with the generalisation results of each of these sets, individually, on the structured test set and on the same, previously unseen, random test set of 10; 000 patterns. In table 10 the rst two columns give the generalisation performance of the best and worst individual version in each set, and the third column gives the average of all ve versions. The last two columns give the results from two majority-vote statistics: the simple majority (of ve), and the majority of three randomly selected versions, i.e. p(maj 3) derived earlier, respectively. The table also includes the generalisation diversity, GD, that each set exhibited in this test. The rst point to notice is that none of the sets is very diverse (maximum GD is just greater than one half). This is to be expected as the sources of diversity were minimal | primarily weight seed variation and/or hidden unit number variation. The result of 17

network gen (struc) gen (rand) T2-H50-W1 96.36 96.99 T2-H50-W2 96.05 96.75 T2-H50-W3 96.37 97.04 T2-H50-W4 96.79 97.56 T2-H50-W5 95.84 96.56 R1-H8-W2 97.40 97.64 R4-H9-W5 97.73 97.87 R2-H10-W3 97.66 97.66 R4-H11-W4 97.40 97.49 R4-H12-W3 97.43 97.86 T5-H11-W1 97.12 97.98 T5-H11-W2 96.82 97.68 T5-H11-W3 96.79 97.17 T5-H11-W4 97.38 97.78 T5-H11-W5 97.64 97.15 Table 9: The 15 networks.

set max min aver GD majority maj3 TRBF 97.56 96.56 96.98 0.471 97.56 97.38 RMLP 97.87 97.49 97.70 0.419 97.93 97.88 TMLP 98.15 97.17 97.75 0.506 98.37 98.16 Table 10: The performance of three version sets on the random test set.

set max min aver GD majority maj3 TRBF 97.79 95.84 96.28 0.399 96.81 96.69 RMLP 97.73 97.40 97.53 0.335 97.79 97.73 TMLP 97.64 96.79 97.15 0.423 97.60 97.46 Table 11: The performance of three version sets on the structured test set.

18

sets GDB GDBW TRBF RMLP 0.761 0.316 TRBF TMLP 0.758 0.269 RMLP TMLP 0.673 0.211 Table 12: Inter-set diversity on the random test set. sets GDB GDBW TRBF RMLP 0.613 0.317 TRBF TMLP 0.689 0.277 RMLP TMLP 0.607 0.228 Table 13: Inter-set diversity on the structured test set. this can be seen in the last two columns: the majority-vote performances are only slightly better than the best version in each set, and in one case a majority performance is worse. But because these sets have been constructed using maximally diverse processes we expect some considerable diversity between the sets as a whole. Table 12 gives the inter-set diversity measures. The inter-set diversity results in table 12 support the conjecture that dierent network type and training set structure are greater sources of diversity than either weight seed or hidden unit number variation. The between-set diversities are 33% to 82% greater than the within-set diversities. The signi cant positive values for GDBW illustrate this gain | GDBW would be zero if within-set and between-set diversities were equal, and would be negative if the within-set diversity was greater. Similar results obtained with the structured test set are given in tables 11 and 13. However, notice that in tables 12 and 13, network type generates a greater diversity than training set structure | approximately 8% more in table 12. This is expected, but the dierential is less than might be anticipated on the basis that network type is a far more drastic methodological variation than training set structure. The question to be answered now is: does this diversity pattern provide a basis for improved generalisation performance when we select between these methodologically diverse sets rather than simply within them? The relevant results are presented in table 14. The most demanding comparison will be provided by `equivalent' selection processes applied to a single set composed of the three version sets; this gives the ` at' 15 set which is also included in the tabulation. The rst row is the results derived from the structured 161051 test set (see gure 2), and the second row is from the random 10000 test set. The trends are the same with respect to either of these two large and very dierent test sets. The third row is a limiting case result. One network from each set was selected and replicated four times to give a three set system in which all ve networks in each set are identical. The within-set GD for each version set was zero, and so all three statistical between-set measures give the same result of 98:52%. And all between-set statistics are superior to the within set ones, 19

between set selection ` at' 15 results Test maj3rand majofmaj3 majofmaj maj3rand majorityof9 Struc 98.35 98.55 98.65 98.14 98.61 Rand 97.87 98.04 98.07 97.63 98.07 Path 98.52 98.52 98.52 98.04 98.44 Table 14: Exploiting methodological diversity.

majority 98.74 98.17 98.52

except for the total majority vote which is the same, 98:52%. This (contrived) system presents a best case for between-set selection, and the gain from selecting three random networks one from each set is 98:52 ? 98:04 which is 0:48%. In this extreme case even the majority-of-three majorities of three, one from each set, is superior to a simple majority of nine. The rst point to notice is that a comparison of the the rst and fourth columns in this table shows that the between-set selection delivers a better generalisation performance than within-set selection | i.e., an appropriate organization of methodological diversity can improve the generalisation performance of the overall system. However, notice also that if all versions are to be evaluated (rather than just three) for a total majority-vote strategy, then the separate-set approach (i.e. take the majority of each of the sets and then take the majority of these three outcomes, the \majofmaj" column) is inferior to that of simply treating all the three sets as one larger set and taking the overall majority (i.e. \majority" column). Intuitively, when every version is to contribute to the outcome then the two-step approach is disadvantageous because each version set of ve is limited to contributing precisely three versions to the nal outcome. But when treated as one set of 15 versions, any subset of ve versions can contribute anything from zero to all ve versions to the nal outcome. In the terminology of Littlewood and Miller [1989] the between-set majority-of-three statistic, p(maj )ABC , is a \diverse 2-out-of-3 system", and a majority-of-three from a \ at" 15 set is an example of a \homogeneous 2-out-of-3 system". Our studies support their analysis which concludes \that the diverse design is not always superior to the homogeneous design" (see page 1606), and we provide guidance for making the optimal choice.

5 Conclusions First and foremost the study shows that diversity of versions is more important than individual levels of performance. In order to maximize the generalisation performance of a version set, the top priority (given at least one high performance version) is to construct highly diverse versions. And one way to maximize diversity is to employ methodological diversity in the version construction process. In terms of potential to lead to version diversity, a number of factors can be given an approximate ordering: network type; training set structure; number of hidden units and weight seed (in order of decreasing diversity generating potential) { results in Partridge [1994] con rm this ordering for training set, 20

hidden units, and weight seed. Littlewood and Miller [1989] stress the importance of relating the software development process (for us, training regimes) to the failure behavior of products (for us, trained networks). They continue: \In particular, we are interested in whether it is possible to understand the way in which the `diversity of version behavior' depends on `diversity of methodology'." (see page 1609). Clearly, we are beginning to do this in the context of neural computing. Additionally, it was shown that the most eective use of high-diversity generating features is to generate, and select from independent, methodologically diverse sets, provided only a small number of versions are to be used to determine system outcome. But when all versions are to be employed to determine the system outcome, then the simplest and the best strategy is to treat all versions as one large set. In order to determine the generality of these results two large and quite dierent test sets were used. All of the same eects were observed in both cases. Finally, two inter-set diversity quantities were de ned, GDB and GDBW , and used, and it was shown how GDBW provides a guide to the ecacy of employing a between-set selection strategy on a collection of sets of versions.

6 Acknowledgements The help of Wojtek Krzanowksi on the statistical work is gratefully acknowledged: he developed the basic framework and proposed the two inter-set GD measures. Simon Klyne is to be thanked for his implementation of the statistical model. Finally, this study and one of the authors (WBY) has been supported by a grant from the Safety-Critical Systems Programme (no. GR/H85427) of the EPSRC and DTI.

21

A Appendix

Derivation of the statistic, p(maj 3), which is the probability of correctness of a majority of three versions, one selected at random from each of the sets A, B and C , is as follows: p(maj3) = p((0 out of 3) or (1 out of 3) fail on this input) =

NA ;N XB ;NC

nA;nB ;nC =0

p(exactly nA ; nB &nC versions fail on this input

and either 0 or 1 of chosen versions fail) = =

NA ;N XB ;NC

nA;nB ;nC =0

p(either 0 or 1 of chosen versions failjexactly nA ; nB &nC fail)

p(exactly nA ; nB &nC fail) NA ;N XB ;NC

p(none of chosen versions failjexactly nA ; nB &nC fail) pnAnB nC

nA;nB ;nC =0 NA ;N XB ;NC

+ =


p(exactly 1 chosen version failsjexactly nA; nB &nC fail) pnAnB nC

p(none of chosen versions failjexactly nA ; nB &nC fail) pnAnB nC


+ + +

nA=1;nB ;nC =0 NA ;N XB ;NC nB =1;nA;nC =0 NA ;N XB ;NC

p(A version fails & B&C versions correctjexactly nA ; nB &nC fail) pnAnB nC p(B version fails & A&C versions correctjexactly nA ; nB &nC fail) pnAnB nC p(C version fails & A&B versions correctjexactly nA ; nB &nC fail) pnAnB nC

nA;nB =0;nC =1 NA ;N XB ;NC (NA ? nA ) (NB ? nB ) (NC ? nC ) = NA NB NC pnAnB nC nA;nB ;nC =0 NA ;N XB ;NC nA (NB ? nB ) (NC ? nC ) N N pnAnB nC + B C nA=1;nB ;nC =0 NA NA ;N XB ;NC (NA ? nA ) nB (NC ? nC ) + NA NB NC pnAnB nC nB =1;nA;nC =0 NA ;N XB ;NC (NA ? nA ) (NB ? nB ) nC + NA NB NC pnAnB nC nA=1;nB =0;nC =1

22

The derivation of the statistic p(majmaj )ABC , which is the probability of correctness of a majority vote from three majority outcomes, one each from sets A, B and C , is as follows: p(majmaj)ABC = prob((0 out of 3 majorities)or(1 out of 3 majorities) fail) =

NAX NB NC

nAnB nC =0

prob(exactly nA ; nB &nC versions fail on this input

and either 0 or 1 of the individual majorities fail) = =

NAX NB NC

nAnB nC =0

prob(either 0 or 1 of chosen majorities failjexactly nA ; nB &nC versions fail)

prob(exactly nA ; nB &nC versions fail) NAX NB NC

prob(none of chosen majorities failjexactly nA; nB &nC versions fail) pnAnB nC

nAnB nC =0 NAX NB NC

+ =


prob(none of chosen majorities failjexactly nA; nB &nC versions fail) pnAnB nC


+

+ + =

prob(one of chosen majorities failjexactly nA ; nB &nC versions fail) pnAnB nC

nA nB nC =0 NAX NB NC nA nB nC =0 NAX NB NC


prob(A majority fails&B; C majorities correctjexactly nA ; nB &nC versions fail) pnAnB nC prob(B majority fails&A; C majorities correctjexactly nA ; nB &nC versions fail) pnAnB nC prob(C majority fails&A; B majorities correctjexactly nA ; nB &nC versions fail) pnAnB nC

p(maj)nA p(maj)nB p(maj)nC pnAnB nC


+

+ +

nA nB nC =0 NAX NB NC nA nB nC =0 NAX NB NC nA nB nC =0

(1 ? p(maj)nA )p(maj)nB p(maj)nC pnAnB nC p(maj)nA (1 ? p(maj)nB )p(maj)nC pnAnB nC p(maj)nA p(maj)nB (1 ? p(maj)nC ) pnAnB nC

23

References Holden and Rayner [1993] S B Holden and P J W Rayner. Generalization and PAC learning: Some new results for the class of generalized single layer networks. IEEE Transactions on Neural Networks, (submitted), 1993. Hornik et al. [1989] A K Hornik, A M Stinchcombe, and H White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359{366, 1989. Knight and Leveson [1986] J C Knight and N G Leveson. An experimental evaluation of the assumption of independence in multiversion programming. IEEE Transactions on Software Engineering, 12(1):96{109, 1986. Littlewood and Miller [1989] B Littlewood and D R Miller. Conceptual modeling of coincident failures in multiversion software. IEEE Transactions on Software Engineering, 15(12):1596{1614, 1989. Moody and Darken [1988] J Moody and C Darken. Learning with localized receptive elds. In D Touretzky, G E Hinton, and T J Sejnowski, editors, Proceedings of the 1988 Connectionist Models Summer School, pages 133{143. Morgan Kaufman, 1988. Partridge and Grith [1994] D Partridge and N Grith. Strategies for improving neural network generalisation. Neural Computing and Applications, 4:1{11, 1994. Partridge and Yates [1994] D Partridge and W B Yates. Replicability of neural computing experiments. IEEE Transactions on Neural Networks, (submitted), 1994. Partridge [1994] D Partridge. Network generalization dierences quanti ed. Research Report 291, University of Exeter, 1994. Rumelhart et al. [1986] D E Rumelhart, G E Hinton, and R J Williams. Learning internal representation by error propagation. In D E Rumelhart and J L McClelland, editors, Parallel Distributed Processing. Explorations in the Microstructure of Cognition. Volume 1: Foundations, pages 318{362. MIT Press, 1986. Widrow and Ho [1960] B Widrow and M E Ho. Adaptive switching circuits. In J A Anderson and E Rosenfeld, editors, Neurocomputing: Foundations of Research, pages 126{134. MIT Press, 1960. Yates [1994] W B Yates. Radial basis function networks. Research Report R317, University of Exeter, 1994. 24

Use of Methodological Diversity to Improve Neural ... - CiteSeerX

Use of Methodological Diversity to Improve Neural ... - CiteSeerX

Suggest Documents

Use of exogenous data to improve an artificial neural ... - CiteSeerX

leveraging diversity to improve business performance - CiteSeerX

Use of Brooks-Corey Parameters to Improve Estimates of ... - CiteSeerX

Use of Inhaled PGE1 to Improve Diastolic Dysfunction ... - CiteSeerX

The Use of Dynamic Contexts to Improve Casual Internet ... - CiteSeerX

Intentional Use of the Hawthorne Effect to Improve Oral ... - CiteSeerX

the use of artificial intelligence to improve the numerical ... - CiteSeerX

On the Use of MeSH Headings to Improve Retieval ... - CiteSeerX

Interpreting Neural Networks to Improve Politeness Comprehension

IGF-1 to improve neural outcome

IGF-1 to improve neural outcome

theoretical and methodological perspectives on the use of ... - CiteSeerX

Grass-Selective Herbicides Improve Diversity of Sites ... - CiteSeerX

Reports \ Methodological : Methodological Report 59 - CiteSeerX

A methodological contribution to use of Ground-Penetrating Radar ...

The use of transdermal scopolamine to solve methodological issues ...

Artificial Neural Networks - Methodological Advances and ...

Pragmatism: A Methodological Approach to Researching ... - CiteSeerX

Methodological Triangulation Using Neural Networks for Business ...

neural networks as methodological tools.

Contemporary Methodological Diversity in European ... - SAGE Journals

Combining Different Methodological Approaches to ... - CiteSeerX

Response surface methodological approach to optimize ... - CiteSeerX

Response surface methodological approach to optimize ... - CiteSeerX