A Model Selection Algorithm for a Posteriori Probability Estimation ...

55 downloads 172536 Views 690KB Size Report
THE ESTIMATION of a posteriori probabilities is an important problem in pattern recognition. In addition to taking decisions, there are many applications where ...
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005

799

A Model Selection Algorithm for a Posteriori Probability Estimation With Neural Networks Juan Ignacio Arribas and Jesús Cid-Sueiro, Member, IEEE

Abstract—This paper proposes a novel algorithm to jointly determine the structure and the parameters of a posteriori probability model based on neural networks (NNs). It makes use of well-known ideas of pruning, splitting, and merging neural components and takes advantage of the probabilistic interpretation of these components. The algorithm, so called a posteriori probability model selection (PPMS), is applied to an NN architecture called the generalized softmax perceptron (GSP) whose outputs can be understood as probabilities although results shown can be extended to more general network architectures. Learning rules are derived from the application of the expectation–maximization algorithm to the GSP-PPMS structure. Simulation results show the advantages of the proposed algorithm with respect to other schemes. Index Terms—Expectation–maximization, model selection, neural network (NN), objective function, posterior probability, regularization.

I. INTRODUCTION

T

HE ESTIMATION of a posteriori probabilities is an important problem in pattern recognition. In addition to taking decisions, there are many applications where posterior probabilities provide an essential degree of confidence about decisions, which may be even so important as the decision itself: medical diagnoses, aircraft control, financial data risk analysis and critical/tactical communications, are only some examples. Moreover, from classical decision theory, once costs associated with decision errors have been fixed, posterior probabilities of the hypotheses have enough information to make optimal decisions, minimizing the mean value of the cost, [1]. Therefore, posterior probabilities are sufficient statistics in decision problems and might provide an added value to the decisions undertaken. The problem of estimating probabilities with neural network (NN) has received a wide attention in the literature. It has been often noted that when training an NN in order to minimize the sum of square errors (SEs) [2]–[4] or the cross entropy (CE)

Manuscript received March 20, 2003; revised December 2, 2004. This work was supported by grant numbers FP6-507609 SIMILAR Network of Excellence from the European Union, TIC01-3808-C02-02, TIC02-03713 and TEC04-06647-C03-01 from the Comisión Interministerial de Ciencia y Tecnología and CAM-07T/0017/2003-1 and GR/SAL/0471/2004 from La Comunidad Autónoma de Madrid, Spain. J. I. Arribas is with the Departamento de Teoría de la Señal y Comunicaciones e Ingeniería Telemática, Universidad de Valladolid, 47011 Valladolid, Spain (e-mail: [email protected]). J. Cid-Sueiro is with the Department of Signal Theory and Communications, University Carlos III de Madrid, 28911 Leganés-Madrid, Spain (e-mail: [email protected]). Digital Object Identifier 10.1109/TNN.2005.849826

[5]–[7], the soft decision of the NN provides estimates of the a posteriori class probabilities. More generally, necessary and sufficient conditions on the cost function to provide posterior probability estimates have been analyzed in [8] for binary cases, in [9]–[11] for multiclass problems. An important issue in posterior probability estimation is the model selection: finding a tradeoff between the over-fitting problems of large networks and the limited expression capabilities of small networks, that is, the bias-variance dilemma, which is well known in machine learning [12], [13]. The model selection problem states many algorithmic difficulties in real applications, where large and highly structured networks may be needed. Several strategies have been proposed to solve it: a wide family of methods are based on training different architectures and using a validation set to evaluate them according to some performance measure that combines classification capabilities with a penalty or regularization term stating some preference for small over large networks [14]–[19]. Alternatively, other constructive approaches try to adaptively modify the network structure in order to find the best configuration. These methods, which combine ideas of pruning, splitting, and merging neural components, avoid the need to use a validation set, taking all data to determine jointly the structure and the weights. They have been widely explored in the literature, especially for estimating Gaussian mixtures [20]–[23] and training multilayer perceptrons [13]. There are not many papers in the NN literature devoted to the problem of estimating the size and configuration of an architecture for estimating posterior probabilities in classification problems. This is mainly because, in principle, any general model selection algorithm can be applied to estimate a posteriori probability model. However, the immediate application of generalpurpose schemes to this problem may be nonefficient. For instance, the number of configurations required to evaluate a posteriori probability model using validation data may grow excessively with the number of neural components and for a high number of classes, requiring some heuristic to reduce the exploration space. The most well known structures for posterior probability estimation in the NN field are the mixture of experts (MEs) [24] and its generalization, the hierarchical mixture of experts (HMEs) [25], [26]. Some constructive algorithms have been proposed for this kind of structures: [27] and [28] for classification and [20] for regression problems. All these methods are based on taking some performance measure of network components as the accumulation of node statistics computed from data. According to these measures, nodes are split, merged, or pruned.

1045-9227/$20.00 © 2005 IEEE

800

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005

In both the HME and the ME, the network is a combination of expert modules, which are essentially decision machines trying to solve the classification task in localized regions of the feature space. Therefore, each node added to the tree increases the network complexity for all classes, which may be not advisable if some class is less represented than others. Our work is based on the use of an architecture that decomposes the posterior probability as a mixture of components, as in the ME or HME, but nonhierarchically organized. In this way, the complexity is added specifically to different classes. In any case, the extension of our algorithm to ME and HME structures does not state special difficulties. We propose a novel model selection algorithm which is also based in taking performance measures from the mixture components, that are used to modify the network structure. The pruning mechanism is essentially the same used by Fritsch [27] and Waterhouse [28]. However, both the performance measure and the criteria to select the node to split, is different. As in [27] and in contrast to [28], our algorithm does not hypothesize all possible node splits, but selects them according to a performance criteria. However, while in [27], the worst expert is selected for splitting, we select the most active nodes, so as to provide them a wider expression capability to capture details escaping to a single node. The remaining of the paper is organized as follows. In Section II, a brief review of existing methods to compute probabilities is presented, we indicate some NN architecture candidates to estimate probabilities and introduce a probability model based on the generalized softmax perceptron (GSP) architecture [29]. Section III presents a brief overview of some wellknown model selection algorithm and criteria, that we have used for comparison purposes. In Section IV, we introduce and derive the new PPMS model selection algorithm based on the previous estimation of posterior probabilities, give details concerning PPMS implementation and briefly define the performance measures we shall use in the results section. In Section V, we carry out some computer simulations in order to show the benefits of PPMS algorithm in comparison with three well known exhaustive techniques to determine the optimal model complexity. For that, we compute surface probability estimates as well as numerical comparisons based on Kullback–Leibler (KL) divergence and CE cost/loss/objective function, with both synthetic and real data. Finally, Section VI summarizes some conclusions and depicts various further lines. II. PROBABILITY ESTIMATION Consider sample set where is an observation vector and is an element in the set of possible target classes. Class- label is a unit -dimensional vector with , i.e., the Kronecker delta function, every components component is equal to 0, except the th component which is equal to 1. In order to estimate posterior probabilities in a multiclass problem with mutually exclusive classes, an NN with one outoutput per class is required (or, at least, a network with puts, being the remaining one computed following probability

rules). In this paper, we consider the case of NNs that compute , where is the input feaa nonlinear mapping is a probability ture space, space, and are the network parameters. Therefore, the network computes one soft decision per class satisfying probability constraints

(1)

The hard output, , depends on the decision criteria. For instance, provided that soft outputs are posterior probability estimates, maximum a posteriori (MAP) decisions can be computed as a winner takes all (WTA) decision over the soft output. Training pursues the minimization of some empirical estibased on samples in , where is mate of the cost function. In [10] and [30], it is shown that any cost function providing posterior probability estimates, called strict sense Bayesian (SSB), can be written in the form (2) is any strictly convex function in , which can where be interpreted as an entropy measure. In fact, the formula comprises also a sufficient condition: any cost function expressed in this way provides probability estimates. A more general class of SSB costs are the wide sense Bayesian (WSB) objective functions whose structure is analyzed in [31]. The selection of an SSB cost function for a particular situation is an open problem. This work is based on the CE

(3)

which comes from (2) by taking (i.e., the Shannon entropy). There are other widely used measures, like the SE defined as

(4) Our choice the CE is justified by several reasons. First, the CE has shown to outperform SE for classification problems in experimental studies [32], [33]. Second, some theoretical evidence [34], [35] has shown that CE can solve class separation problems in situations where the SE cannot. Finally, the minimization of the CE is equivalent to maximum likelihood (ML) estimation of the posterior class probabilities [36] which allows the derivation of a wide family of learning rules by application of the EM algorithm or its variants. Note however, that, although our parameter updating rules are derived from an ML perspective, the model selection algorithm that will be proposed does not impose any particular SSB cost function during learning.

ARRIBAS AND CID-SUEIRO: MODEL SELECTION ALGORITHM FOR A POSTERIORI PROBABILITY ESTIMATION

801

probabilistic interpretation of the network outputs is useful for training: the model parameters that best fit the data in can be estimated via ML as (9) where (10)

Fig. 1.

GSP multiple-output network.

A. Architecture We have used a neural architecture based on the softmax nonlinearity, which guarantees that outputs satisfy probability concorresponding to straints given in (1). In particular, output class is given by (5) where (6)

It is worth noting that, unless for a scale factor, (10) also results taking the negative of the CE in (3), averaged over the training set. For this reason, minimizing CE is equivalent to ML estimation. Following an analysis similar to that of Jordan et al. [25] for HME, this maximization can be done iteratively by means of the EM algorithm. To do so, let us assume that each class is partitioned into be the number of subclasses from several subclasses. Let class . The subclass for a given sample will be represented , by a subclass label vector, , with components , , such that at most one of them is equal to 1 and (11)

and (7) for and , where is the number of is the number of outputs used to compute and is classes, the input sample. This network, which is called the generalized softmax perceptron (GSP) [29] can be easily shown to be a universal probability estimator, in the sense that it can approximate with arbitrary precision any posterior probability map by increasing for every class. A GSP example architecture for a 3-class problem is shown in Fig. 1. can be interpreted as subclass As we will see later, outputs the number of subclasses within class . probabilities, being Therefore, classes are decomposed in subclasses, and their probabilities will be computed as the sum of subclass probabilities. Since subclass labels are not available in , learning is supervised at the class level, but not at the subclass level. B. Probability Model Based on the GSP In this section, we derive learning rules for estimating, via ML, a probability model based on the GSP network, which provides an interpretation for the softmax outputs and some insight into the behavior of the GSP network. Since the GSP outputs satisfy the probability constraints, we can say that any GSP implements a multinomial probability model of the form

Also, let us assume that the joint probability model of is given by

and

(12) where is an indicator function equal to one if and satisfy (11), and equal to zero otherwise. Note that, according to (6), follow a multinomial logit model. Note, also that variables the joint model in (12) is consistent with that in (8), in the sense that the latter results from the former by marginalization (13) Since there is no information about subclasses in , learning can be supervised at the class level, but must be unsupervised at the subclass level. The EM algorithm based on hidden to maximize log-likelihood in (10) variables provides a way to do this. Consider the complete data set, , which extends by including subclass labels. According to (12) the complete data likelihood is

(8)

(14)

where matrix encompasses all GSP weight parameters (subclass weight vectors and biases), being formed by suband is given by (5) and class weight column vectors . This (6). According to this, we arrive at

The EM algorithm for ML estimation of the GSP posterior model proceeds, at iteration , in two steps: 1) E-step: compute ; . 2) M-step: compute

802

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005

In order to apply the E-step to , note that, given and assuming , the only unknown component in that true parameters are are the hidden variables. Therefore

mizing the CE. They have been used in our simulations. However, the EM formulation suggests other estimation algorithms, besides providing the concept of subclass, which is essential to the PPMS algorithm.

(15) III. MODEL SELECTION ALGORITHMS: COMPLEXITY to refer the softmax output Let us use compact notation at iteration (and (relative to class and subclass ) for input and ). Then, using (8) and (12) the same for (16) Substituting (16) in (15), we get (17) Therefore, the M-step reduces to (18) depends on . Note that, during the M-step, only Maximization can be done in different ways. For the HME, the iteratively reweighted least squares algorithm [25] or the Newton–Raphson algorithm [37] have been explored. A more simple solution (that is also suggested in [25]) consists of replacing the -step by a single iteration of a gradient search rule (19) which leads to

The problem of determining the optimal network size, also known as the model selection problem, is as well-known as, in general, difficult to solve [38]. The selected architecture must find a balance between the approximation power of large networks and the usually higher generalization capabilities of small networks. One can distinguish between pruning and growing algorithms [39] or algorithms based on adding a complexity penalty-term to the objective function, for instance: weight decay (WD) [40] or the algorithms based on Akaike’s information theoretic criterion (AIC)1[43] or the Rissanen’s minimum description length (MDL) [44] criteria. In all these algorithms, complexity criteria are evaluated upon minimizing a total cost function which is composed by the addition of two distinctive terms: an error term plus an additional complexity penalizing term (24) where is the cost function of the input data (e.g., the emis the penalizapirical risk based on SE or the CE), and tion term. Note that the dependency with NN weights and training set has been made explicit in this formula. Function penalizes the excessive complexity of the network and will exclusively depend on the model (weights). Variable in (24) can be understood as a regularizing parameter following Tikhonov theory [45] which helps to weight the relative importance of both terms. The difference among model selection algorithms resides in the complexity penalizing term . It takes the following values for cross validation (CV), defining a validation set independent of the training set, CV-WD, AIC, and MDL strategies [19]:

(20)

(21) is the learning step. A further simplification can be where done by replacing the gradient search rule by an incremental version. In doing so, the following rule for weights and bias term adaptation results: (22) (23) and are the weight vectors and biases, respecwhere are the softmax outputs for input , are the latively, at iteration , and is the learning step, at bels for sample iteration . It is not difficult to show that rules (22) and (23) can also be derived computing the stochastic gradient learning rules mini-

(25) where represents the dimension of the parameter space (i.e., the number of network weights), and is the dimension of input space . Regularization parameter may be adapted based on the analysis carried out in [12], as will be shown later on in Section V. IV. A POSTERIORI PROBABILITY MODEL SELECTION (PPMS) ALGORITHM While, by hypothesis, the number of classes is fixed and assumed known, the number of subclasses inside each class is un1Algorithm AIC was originally developed by Akaike for linear systems, but there exist new versions valid in nonlinear NN approaches, like the network information criterion [41] and [42].

ARRIBAS AND CID-SUEIRO: MODEL SELECTION ALGORITHM FOR A POSTERIORI PROBABILITY ESTIMATION

known and must be estimated from samples during training: a high number of subclasses may lead to data over-fitting. The algorithm we propose to determine the GSP configuration, which has been called the a posteriori probability model selection (PPMS) algorithm [30] belongs to the family of growing and pruning algorithms [39]: starting from a predefined architecture, subclasses are added to or removed from the network during learning according to needs. PPMS determines the number of subclasses by seeking a balance between generalization capability and learning toward minimal output errors. Although PPMS could be easily extended to other networks, like the HME [25], the formulation presented here assumes a GSP architecture. The fundamental idea behind the PPMS algorithm is the following: according to the GSP structure and its underlying probability model, the posterior probability of each class is a sum of the subclass probabilities. The importance of a subclass in the sum can be measured by means of its prior probability

(26) which can be approximated using samples in

803

Splitting subclass is done by removing weight vector and constructing a pair of new weight vectors and such that, at least initially, the posterior probability map is approximately the same

(30) It is not difficult to show that, for , probability maps and are identical. In order that new weight vectors can evolve differently during time, using a small nonzero value of is advisable. 3) Merge: mix or fuse two subclasses into a single one. That is, the respective weight vectors of both subclasses are fused into a single one if they are close enough. Subclasses and in class are merged if , where represents a distance meais the merging threshold. In our simusure and lations, was taken to be the Euclidean distance. After merging subclasses and , a new weight vector and bias are constructed according to

as2 (31)

(27) A small value of is an indicator that only a few samples are being captured by subclass in class , so its elimination should not affect the whole network performance in a significant way. On the contrary, a high subclass prior probability may be indicative that the subclass captures too many input samples. PPMS explores this hypothesis by dividing the subclass into two halves in order to represent the data distribution more accurately. Finally, it has also been observed that, under some circumstances, certain weights in different subclasses of the same class tend to follow a very similar time evolution to other weights within the same class. This fact suggests a new action: merging similar subclasses into a unique subclass. PPMS implements these ideas via three actions called prune, split, and merge, as follows. 1) Prune: remove a subclass by eliminating its weight vector, if its a priori probability estimate is below a certain pruning threshold . That is

, , Note that there are three threshold parameters, , that should be fixed in some way. Also, note that and and state upper and lower bounds in the number of network subclasses, respectively, and . In order to avoid a dependency on the threshold selection, and variable during learning as folwe have made lows: was increased by each time a split operawas decreased by tion is accomplished, and conversely every time a pruning action takes place. They were dequite terministically initialized to nonrestrictive values: big, rather small and . The value of the initial thresholds has shown to be noncritical in our simulations. was fixed On the contrary, the merge distance threshold as a small constant value. A. Implementation of PPMS

2) Split: add a new subclass by splitting in two a subclass whose prior probability estimate is greater than split . That is threshold

An iteration of the PPMS algorithm can be implemented after each M-step during learning, or after several iterations of the EM algorithm. In our simulations, we have implemented a PPMS step after each M-step, where the M-step is reduced to a single iteration of rules (22) and (23). Also, to reduce computational demands, prior probabilities are not estimated using (27), but updated iteratively as

(29)

(32)

2Since is computed based on samples in S , prior probability estimates based on the same sample set are biased. Therefore, priors should be estimated by averaging from a validation set different from S . In spite of this, in order to reduce the data demand, our simulations are based on only one sample set.

. Note that, if prior probabilities are initially where nonzero and sum up to 1, the updating rule preserves the fulfilment of these probability constraints.

(28)

W

804

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005

After each application of the PPMS algorithm, the prior estimates must also be updated. This is done as follows: 1) Pruning: When a subclass is removed, the a priori probability estimates of the remaining classes do not sum up to one. This inconsistency can be resolved by redistributing the prior estimate of the removed subclass proportionally among the rest of subclasses. In particular, given that condition (28) is true, the new prior probability estimates are computed from after pruning subclass of class as follows:

(33)

2) Splitting: The a priori probability of the split subclass is also divided into two equal parts that will go to each of the descendants. That is, if subclass in class is split into subclasses and , prior estimates are assigned to new subclasses as follows:

(34) 3) Merging: Probability vectors are modified in accordance with probabilistic laws (1). Formally speaking, if subclasses and in class are merged into subclass

V. RESULTS In this section, we have applied PPMS to a GSP NN. We now present five experiments: A visual example for computing probability maps (Experiment 1), a numerical experiment over synthetically generated input Gaussian data (Experiment 2), in order to control all noisy parameters of the input distribution and be able to easily compute the true a posteriori probabilities for each class, two numerical experiments over real high-dimensional data sets (Experiments 3 and 4), and a final numerical experiment (Experiment 5) with synthetic data to compare PPMS with other schemes. We give numerical comparison results in terms of average KL divergence, CE cost or error rates over the test set after training, and network complexity estimates. In doing so, we try to objectively compare PPMS with other algorithms, over a GSP architecture using the learning rules defined in (22) and (23). We compare PPMS with three classical regularization criteria from information theory that make a full search over optimal network sizes (CV-WD, AIC, and MDL), and with neural network ensembles (NNEs), including majority voting (MV) [46], ensemble averaging (EA) [47], and HME [25], in terms of the error rate. A. Experiment 1: Probability Maps Over Synthetic Data In Experiment 1, true posterior probability surfaces and decision boundaries are visually compared with those estimated by PPMS. A set of 100 000 bidimensional samples from three equally likely classes were generated according to the following Gaussian mixture models:

(35) (38) B. Performance Measures In [10], [11], and [30], it is shown that any SSB cost function can be related to an SSB divergence measure between posterior probabilities and by means of (36) where is the entropy measure that appears in (2). In particular, for the CE cost function, becomes (37) which is the KL divergence. Therefore, the average value of the SSB divergence is a natural dissimilarity measure between true and estimated posterior probability maps. Whenever possible, it will be used as a performance value in our simulations. Next we show the results obtained. To make comparisons between different estimation algorithms using real data, cannot be usually computed since is unknown. However, since this term is independent of the network, it can be omitted. Therefore, under these circumstances, comparisons can be carried out by averaging the CE cost function over a test set.

3 and is the where is the class index, . Mean zero-mean Gaussian pdf with variance matrix where chosen according to a uniform distribution vectors inside two-dimensional (2-D) square , resulting: 12.37 1.91 , 0.61 8.94 , 9.70 4.20 , 7.79 7.99 , 8.17 4.40 , 7.85 10.25 . As Fig. 2 shows, classes are highly overlapped in this example. 0.025, Threshold values for PPMS were chosen to 0.250, and 0.100. This results in a minimal of four kernels and a maximum of 40 kernels for all classes. iterations, PPMS converged to a network with two After subclasses per class. By averaging, prior probabilities for each subclass were estimated as shown in Table I. As an example, in Fig. 3 we visually compare the true posterior probability maps with those obtained with PPMS for the first class, exhibiting a high match. Finally, Fig. 4 shows the optimal Bayes decision regions and those estimated with PPMS, again with an apparent match. B. Experiment 2: Numerical Results Over Synthetic Data , (37), obIn this section, we compute KL divergence tained via PPMS, CV, AIC, and MDL model selection strategies

ARRIBAS AND CID-SUEIRO: MODEL SELECTION ALGORITHM FOR A POSTERIORI PROBABILITY ESTIMATION

805

Fig. 2. Experiment 1. Probability maps over input Gaussian mixture class samples and decision boundaries obtained with PPMS.

TABLE I EXPERIMENT 1: PROBABILITY MAPS. ESTIMATED AND TRUE VALUES FOR PRIOR (SUBCLASS) PROBABILITIES WITH SYNTHETIC DATA Fig. 4. Experiment 1. Probability maps. Estimated (PPMS) versus Optimal (Bayes) decision regions; first black, second grey, and third class in white.

Fig. 3. Experiment 1. Probability maps. Estimated (PPMS) versus Optimal (Bayes) a posteriori probability surface (class 1).

over synthetically generated data. Synthetic data allow to theoretically determine the true posterior probabilities, and to compute KL divergences. We have selected here a representative numerical simulation, 3 equally with the following characteristics: data from likely classes were generated according to probability model

in (38) with 3. Mean vectors were randomly generated input square. according to a uniform distribution in a A validation set was used to select the posterior probability model using CV, AIC and MDL strategies. Note that the PPMS algorithm might result penalized (at least regarding the mean values) if we do not carry out validation for the simple stochastic gradient descent rules (22) and (23), where a random number of initial network weights are initialized at random values, thus, no convergence to global minima is guaranteed at all and local minima convergence might be an important factor to consider. So in that respect the mean value PPMS might be inferior to the rest of the algorithms because, although the same gradient-based stochastic descent algorithm and learning rule are defined in all algorithms, the use of validation sets in CV, AIC, and MDL ( different runs for validation purposes) greatly filters out the possibility of convergence to local minima. In spite of this, the experiment in this section shows that a reduced number of training repetitions provides PPMS the ability to find better solutions than other algorithm making a full search over all possible network configurations. 8 In this numerical experiment, training is repeated times for all algorithms, and the configuration achieving the minimum value of the average CE cost over the training set is selected. 400 input training samples were used for PPMS, and they were split in 200 training samples and 200 validation samples for CV, AIC, and MDL. Finally, 200 samples were used for testing. All CV-based methods explored all network configurations starting from one up to 5 subclasses per class

806

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005

TABLE II EXPERIMENT 2 OVER SYNTHETIC DATA: KULLBACK–LEIBLER DIVERGENCE D FOR CV, AIC, MDL, AND PPMS. TEST SET, 25 SIMULATIONS AVERAGED, N = 8

TABLE IV EXPERIMENT 3 OVER UCI WINE DATABASE: CE COST FUNCTION C FOR CV, AIC, MDL, AND PPMS. TEST SET, 40 SIMULATIONS AVERAGED, N = 3

TABLE III EXPERIMENT 2 OVER SYNTHETIC DATA: MEAN NETWORK COMPLEXITY ESTIMATES . 25 SIMULATIONS AVERAGED, N = 8

TABLE V EXPERIMENT 3 OVER UCI WINE DATABASE: MEAN NETWORK COMPLEXITY ESTIMATES . 40 SIMULATIONS AVERAGED, N = 8

W

( 125 different possible network complexity configurations validated). The best structure based on results from the validation set was selected. Here, 25 simulations were carried out and minimum, mean and median values were computed. 0.025, PPMS thresholds were initialized to 0.20, and 0.1. This results in a minimal total initial number of 5 and a maximum number of 40 initial kernels for the three classes, so threshold values remain not restrictive at all. Prune and split thresholds were modified each time a pruning decreased by a factor or splitting operation took place, and increased by a factor 1.01 in an attempt to automate PPMS algorithm. Also, regu, , and could vary larization parameters and during learning, increased or decreased by 0.95 dewith typical drastic factor reduction of the 95%, pending on one of the following three possible cases [12]. 1) If and/or , where is the iteration, is the total cost function as defined in (24) and is the maximum admissible total cost, then (increasing the influence of the complexity penalizing term). but and , i.e., the 2) If , cost increases but less than the average total cost, computed iteratively as (39) then take cost term 3) If

(increasing the importance of ).

, and , then . Also, WD pruning algorithm [40], (25), called CV-WD in what follows, was used in the comparison. Table II shows that PPMS outperforms the other schemes in terms of KL divergence evaluated over the test set. We observe a slightly better performance for CV-WD than PPMS re, but needgarding minimum value of KL divergence less to say that the computational complexity of PPMS is well below CV-based algorithms CV-WD, AIC, and MDL. Finally, Table III shows network mean complexity estimates, measured as the average number of subclasses per class.

W

C. Experiment 3 and Experiment 4: Numerical Results Over Real Data In this section, we use two databases for classification problems: the Wine database and the Wisconsin Diagnostic Breast Cancer database (WDBC), from the UCI repository.3 In these examples (Experiments 3 and 4), training was again 8 times during learning phase in order to reduce repeated local convergence problems. In both real data examples, CV alone showed some generalization difficulties, especially with the WDBC database, possibly due to the fact that it does not use a complexity penalizing term, showing some tendency to overestimate the network dimension. Both databases are small and high dimensional. In Table IV, corresponding to Experiment 3, we measure CE for CV, AIC, MDL, and PPMS algorithms over Wine database. The database has 59 samples for training, 59 for testing, and 59 for validation4 (for CV, AIC, and MDL), with a total of 177, 12 dimensional, input samples, belonging to one 3, 59 cases in first, 71 cases of three possible wine varieties ( in second, and 48 cases in third class). One can observe superior convergence performance of CE for PPMS with a slight increase in variance, as convergence of AIC and MDL remain basically constant over all 40 simulations. CV showed clear generalization problems, a situation also present in the WDBC database. By examining the misclassification errors in detail, we observed that AIC, MDL, and PPMS produced approximately the same number of mistakes over the same samples, obtaining an av3.17, error 5.00, erage number of errors of error error 5.00, over 59 test samples, with each sample in training set entered only once, that is, neither repetition of samples nor leave-one-out strategies were used. We averaged over 40 runs. Finally, in Table V we show average network complexities. In Table VI, corresponding to Experiment 4, we measure CE for AIC, MDL, and PPMS (CV presented big generalization problems as mentioned previously) over the WDBC database, 2 classes (malignant, 212 cases, or benign, consisting on 3http://www.ics.uci.edu/

 2 2

mlearn/MLRepository.html. validate a total of 5 5 5 = 125 different network complexities, exhaustively from complexity (1,1,1) up to (5,5,5) subclasses in each class. 4To

ARRIBAS AND CID-SUEIRO: MODEL SELECTION ALGORITHM FOR A POSTERIORI PROBABILITY ESTIMATION

TABLE VI EXPERIMENT 4 OVER UCI WISCONSIN DIAGNOSTIC BREAST CANCER (WDBC) FOR AIC, MDL, AND PPMS. TEST DATABASE: CE COST FUNCTION C SET, 40 SIMULATIONS AVERAGED, N = 8

807

TABLE VIII EXPERIMENT 5 OVER SYNTHETIC DATA: ERRORS IN DIFFERENT CNN-NNE FUSION ENSEMBLES (MV, EA, HME) AND VARIOUS CLASSICAL CV-BASED REGULARIZATION APPROACHES (CV, MDL, AIC) VERSUS PPMS (WORST, BEST, AVERAGE, AND ENSEMBLE). TEST SET, 50 SIMULATIONS AVERAGED, N = 8 ; N.A. I!S AN ABBREVIATION OF NOT APPLICABLE

TABLE VII EXPERIMENT 4 OVER UCI WISCONSIN DIAGNOSTIC BREAST CANCER (WDBC): MEAN NETWORK COMPLEXITY ESTIMATES . 40 SIMULATIONS AVERAGED, N = 8

W

355 cases) with 30 dimensional real-valued input features. This database has 189 train, 1/3 total one pass only, 189 test samples, and other 189 validation samples,5 for AIC and MDL model se567 input samples) averlection algorithms, totaling aged over 40 simulations. Observing Table VI, once more PPMS is ahead of its partners. Again, AIC, MDL, and PPMS produced basically the same number of mistakes over approximately the same outlayer input samples, obtaining a mean number of er8.35, error 9.17, error 9, rors of error over 189 test samples. Table VII shows average network complexities for this experiment. D. Experiment 5: PPMS Versus HME and Other Neural Information Fusion Ensembles We start analyzing the performance of the in-tandem GSPPPMS algorithm here presented with HME and other ensemble strategies. We define a new data set, high overlap, binary, threedimensional (3-D) Gaussian mixture data, and obviously the same one for all the algorithms under comparison. In Experiment 5, we use MV [46], EA [47], HME [25], CV, Akaike’s AIC [43], and Rissanen’s MDL [44] to be compared with PPMS. There are several methods for designing NNE. In general, they can be classified into two categories: static and dynamic data fusion [12]. Static fusion methods do not acquire any information from the input data, like MV and EA do. In MV, the decision winning class is the one most often chosen by the different NN components. In the EA approach, the outputs of the different NNE components are linearly fused to form an overall output. On the other side, in dynamic architectures the fusion units integrating the outputs of the NNE components that take part in the overall output are actuated by the input data. The HME can be considered a dynamic fusion ensemble in that sense. We passed data only once in all algorithms, so we used 8 times for only 1 epoch, but the process was repeated each algorithm, and the best one of them was selected over learning data. Results were averaged for 50 independent times

2

5To validate a total of 8 8 = 64 network configurations, from up to = (8; 8) subclasses in each class.

C

C = (1 1) ;

to achieve statistical significant results, and errors were measured, in Table VIII. All algorithms but PPMS used a validation set; In the MV approach there are 300 validation samples for each component neural network (CNN) and as we have a total of 3 CNN architecture, the total number of validation samples is 900. In the EA approach, there are 900 validation samples; in the HME approach there are 300 validation samples for each of the 3 CNN, so it totals 900 validation samples again; For the rest of nonensemble CV based procedures, that is, CV, MDL and AIC, we also defined a validation set of 900 samples apart from training and test sets (2000 samples), to add up to a total of 3800 samples in all cases but PPMS. On the other hand, for PPMS we simply defined training and test sets, 900 and 2000 samples respectively, so PPMS uses only 2900 different samples because no use of validation set is required. To analyze the results displayed in Table VIII, we quickly notice that ensembles achieve the higher worst error and the lower best values, as a direct consequence of using a number of NNs, hopefully cooperating one with the other. If we further analyze Table VIII, we see how the best average (or ensemble, when applicable) results are obtained by EA, AIC, and PPMS, in that order but all of them very close to each other, and with the advantage for PPMS in the sense that it uses a smaller number of samples because no cross validation is ever done. However, the higher best results are obtained using some configurations in HME, MV and EA and finally the best (lowest value) worst case is obtained once more by PPMS, closely matched with MDL and CV alone. We then conclude that PPMS presents small differences between best and worst cases, a fact that can be interpreted as a high consistency of the proposed algorithm, while the best among best partial results (best CNN) is obtained by the MV and EA ensembles. VI. CONCLUSION Given that the new proposed algorithm has been presented and tested versus other approaches, we have found several advantages in PPMS that we would like to emphasize and conclude: • PPMS does not modify the cost function as methods with complexity penalty or regularization term do. Thus, we could say that the PPMS model selection algorithm has been designed ad hoc to estimate posterior probabilities in multiple-hypotheses problems.

808

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005



PPMS automatically selects the optimal number of neurons dynamically during the learning phase. The model complexity is determined at the same time that posterior probability estimates are obtained. • For a given data set, PPMS exploits data better since it does not need separate validation set, which results in more accurate estimates of probability. • PPMS has a computational complexity noticeably reduced in comparison with CV-based methods, because in CV-based one has to carry out an exhaustive network training for different network sizes, then validate and finally start the test phase over the winning architecture. • PPMS finds better (similar to those in CV-WD and AIC regularization criteria, and to EA in ensemble error classification rates) results and estimates than its partners at least regarding the mean and minimal values of KL divergence and CE cost function, with a slight increase in variance, due to the fact that no validation phase is present in PPMS. There remain several open lines for future work. The use of PPMS in combination with other cost functions must be explored. Quantifying the rate of convergence of CE, SE, and other SSB cost functions, as well as their robustness to outliers, can make the selection of the cost function a crucial issue. Today, we still continue working in applications for the PPMS algorithm in medical problems, like microcalcification detection in breast cancer diagnoses [48] and bone age assessment in childhood. Finally, we are also working toward extending PPMS algorithm to work with other NN architectures and neural ensembles [49]. ACKNOWLEDGMENT J. I. Arribas would like to thank Dr. L. del Pozo and Dr. E. Llamas for reviewing and commenting on the manuscript, and Dr. T. Adali, University of Maryland at Baltimore County, Baltimore, MD, for her enlightening discussion. The authors would like to thank Y. Wu, Beijing University of Posts and Telecommunications, Beijing, China; S. Aeberhard, Institute of Pharmaceutical and Food Analysis, Genoa, Italy; and W. N. Street, Department of Computer Science, University of Wisconsin, Madison, WI, for providing part of the data. The authors would also like to thank the reviewers for their useful comments and interesting suggestions for further work. REFERENCES [1] H. V. Trees, Detection, Estimation and Modulation Theory. New York: Wiley, 1968. [2] J. Meditch, Stochastic Optimal Linear Estimation and Control. New York: McGraw-Hill, 1969. [3] D. Ruck, S. Rogers, M. Kabrisky, M. Oxley, and B. Suter, “The multilayer perceptron as an approximation to a bayes optimal discriminant function,” IEEE Trans. Neural Netw., vol. 1, no. 4, pp. 296–298, Dec. 1990. [4] E. Wan, “Neural network classification: A bayesian interpret,” IEEE Trans. Neural Netw., vol. 1, no. 4, pp. 303–305, Dec. 1990. [5] R. Gallager, Information Theory and Reliable Communication . New York: Wiley, 1968. [6] E. Baum, “Supervised learning of probability distributions by neural networks,” Amer. Inst. Phys., pp. 52–61, 1988.

[7] A. El-Jaroudi and J. Makoul, “A new error criterion for posterior probability estimation with neural nets,” in Proc. Int. Joint Conf. Neural Networks, San Diego, CA, 1990, pp. 185–192. [8] J. Miller, R. Godman, and P. Smyth, “On loss functions which minimize to conditional expected values and posterior probabilities,” IEEE Trans. Inf. Theory, vol. 39, no. 4, pp. 1404–1408, Jul. 1993. [9] M. Saerens, “Building cost functions minimizing to some summary statistics,” IEEE Trans. Neural Netw., vol. 11, no. 6, pp. 1263–1271, Nov. 2000. [10] J. Cid-Sueiro, J. I. Arribas, S. Urbán-Munoz, and A. R. Figueiras-Vidal, “Cost functions to estimate a posteriori probability in multiclass problems,” IEEE Trans. Neural Netw., vol. 10, no. 3, pp. 645–656, May 1999. [11] J. Cid-Sueiro and A. R. Figueiras-Vidal, “On the structure of strict sense bayesian cost functions and its applications,” IEEE Trans. Neural Netw., vol. 12, no. 3, pp. 445–455, May 2001. [12] S. Haykin, Neural Networks: A Comprehensive Foundation. Englewood Cliffs, NJ: Prentice-Hall, 1994. [13] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, U.K.: Clarendon, 1995. [14] J. Moody and J. Utans, Architecture Selection Strategies for Neural Networks: Application to Corporate Bond Rating Prediction in Neural Networks in the Capital Markets. New York: Wiley, 1994. [15] J. Moody and V. Cherkassky, Prediction Risk and Architecture Selection for Neural Networks in From Statistics to Neural Networks: Theory and Pattern Recognition Applications. ser. NATO ASI Series F, J. Friedman and H. Wechsler, Eds. New York: Springer-Verlag, 1994. [16] A. S. Weigend, D. E. Rumelhart, and B. A. Huberman, “The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems,” in Advances in Neural Information Processing Systems, J. Moody, S. Hanson, and R. P. Lippmann, Eds. San Mateo, CA: Morgan Kaufmann, 1992, vol. 4, pp. 875–882. [17] J. Larsen, C. Svarer, L. Andersen, and L. Hansen, Adaptive Regularization in Neural Network Modeling, in Neural Networks: Tricks of the Trade, G. Orr and K. Muller, Eds. Berlin, Germany: Springer-Verlag, 1998, Lecture Notes in Computer-Science 1524. [18] L. Hansen and J. Larsen, “Linear unlearning for cross-validation,” Adv. Computat. Math., vol. 5, pp. 269–280, 1996. [19] U. Anders and O. Korn, “Model selection in neural networks,” Neural Netw., vol. 12, pp. 309–323, 1999. [20] V. Ramamurti and J. Ghosh. (1997) Structurally adaptive modular networks for non-stationary environments. [Online] Citeseer.nj.nec.com/192 756.html [21] J. L. Alba, L. Docio, D. Docampo, and O. W. Márquez, “Growing gaussian mixtures network for classification applications,” Signal Process., vol. 76, no. 1, pp. 43–60, 1999. [22] N. Ueda, R. Nakano, Z. Ghahramani, and G. Hinton, “Split and merge EM algorithm for improving gaussian mixture density estimates,” J. VLSI Signal Process. Syst., 1999. [23] N. Ueda and R. Nakano, “EM algorithm with split and merge operations for mixture models,” IEICE Trans. Inf. Syst., vol. E83-D, no. 12, 2000. [24] R. Jacobs, M. Jordan, S. Nowlan, and G. Hinton, “Adaptive mixtures of local experts,” Neural Computat., vol. 3, pp. 79–87, 1991. [25] M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and the EM algorithm,” Neural Computat., vol. 6, no. 2, pp. 181–214, 1994. [26] L. Xu, M. I. Jordan, and G. E. Hinton, “An alternative model for mixtures of experts,” in Advances in Neural Information Processing Systems, G. Tesauro, D. Touretzky, and T. Leen, Eds. Cambridge, MA: MIT Press, 1995, vol. 7, pp. 633–640. [27] J. Fritsch, M. Finke, and A. Waibel, “Adaptively growing hierarchical mixture of experts,” in Advances in Neural Information Processing Systems. San Mateo, MA: Morgan Kaufmann, 1994, vol. 6, pp. 459–465. [28] S. R. Waterhouse and A. J. Robinson, “Constructive algorithms for hierarchical mixtures of experts,” in Advances in Neural Information Processing Systems, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds. Cambridge, MA: MIT Press, 1996, vol. 8, pp. 584–590. [29] J. I. Arribas, J. Cid-Sueiro, T. Adali, and A. R. Figueiras-Vidal, “Neural architectures for parametric estimation of a posteriori probabilities by constrained conditional density functions,” in Neural Networks for Signal Processing IX, Y. Hu, J. Larsen, E. Wilson, and S. Douglas, Eds. Madison, WI: IEEE Signal Process. Soc., 1999, pp. 263–272. [30] J. I. Arribas, “Redes neuronales para la estimación de probabilidades a posteriori: Estructuras y algoritmos,” Ph.D. dissertation, Elect. Eng. Dept., Univ. Valladolid, Valladolid, Spain, 2001. [31] A. Guerrero-Curieses, J. Cid-Sueiro, R. Alaiz-Rodríguez, and A. R. Figueiras-Vidal, “Local estimation of posterior class probabilities to minimize classification errors,” IEEE Trans. Neural Netw., vol. 15, no. 2, pp. 309–317, Mar. 2004.

ARRIBAS AND CID-SUEIRO: MODEL SELECTION ALGORITHM FOR A POSTERIORI PROBABILITY ESTIMATION

[32] S. Amari, “Backpropagation and stochstic gradient descent method,” Neurocomput., vol. 5, pp. 185–196, 1993. [33] B. A. Telfer and H. H. Szu, “Energy functions for minimizing misclassification error with minimum-complexity networks,” Neural Netw., vol. 7, no. 5, pp. 809–818, 1994. [34] B. Wittner and J. Denker, “Strategies for teaching layered neural networks classification tasks,” in Neural Information Processing Systems, W. Oz and M.Yannakakis, Eds., 1988, vol. 1, pp. 850–859. [35] I. Mora-Jiménez and J. Cid-Sueiro, “A universal learning rule that minimizes well-formed cost functions,” IEEE Trans. Neural Netw., 2005, to be published. [36] T. Adali and H. Ni, “Partial likelihood for signal processing,” IEEE Trans. Signal Process., vol. 10, no. 1, pp. 204–212, Jan. 2003. [37] K. Chen, L. Xu, and H. Chi, “Improved learning algorithm for mixture of experts in multiclass classification,” Neural Netw., no. 12, pp. 1229–1252, Jun. 1999. [38] V. Vapnik, “An overview of statistical learning theory,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988–999, Sep. 1999. [39] R. Reed, “Pruning algorithms—A survey,” IEEE Trans. Neural Netw., vol. 4, no. 5, pp. 740–747, Sep. 1993. [40] G. Hinton, “Connectionist learning procedures,” Artif. Intell., vol. 40, pp. 185–235, 1989. [41] N. Murata, S. Yoshizawa, and S. Amari, “Network information criterion—Determining the number of hidden units for an artificial neural network model,” IEEE Trans. Neural Netw., vol. 5, no. 6, pp. 865–872, Nov. 1994. [42] S. Amari, N. Murata, K. R. Muller, M. Finke, and H. H. Yang, “Asymptotic statistical theory of overtraining and cross-validation,” IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 985–996, Sep. 1997. [43] H. Akaike, “A new look at the statistical model identification,” Ann. Inst. Statist. Math., vol. 22, pp. 202–217, 1974. [44] J. Rissanen, “A universal prior for integers and estimation by minimum description length,” Ann. Inst. Statist. Math., vol. 11, 1983. [45] A. Tikhonov, “On solving incorrectly posed problems and method of regularization,” Doklady Akademii Nauk, vol. 151, pp. 501–504, 1963. [46] N. Wanas and M. Kamel, “Decision fusion in neural network ensembles,” in Proc. Int. Joint Conf. Neural Networks, vol. 4, 2001, pp. 2952–2957. [47] D. Opitz and J. Shavlik, “Actively searching for an effective neural network ensemble,” Connect. Sci., vol. 8, pp. 337–353, 1996.

809

[48] J. I. Arribas, J. Cid-Sueiro, and C. Alberola-López, “Estimation of posterior probabilities with neural networks: An application to microcalcification detection in breast cancer diagnosis,” in Neural Signal Processing and Modeling, M. Akay, Ed. New York: Wiley, 2005. [49] Y. Wu and J. I. Arribas, “Fusing output information of neural networks: Ensemble performs better,” in Proc. 25th Annu. Int. Conf. Engineering in Med. and Biol. Society, Cancún, México, 2003, pp. 2265–2268.

Juan Ignacio Arribas received the Ph.D. degree in electrical engineering from the Universidad de Valladolid, Valladolid, Spain, in 2001. In 1996, he was awarded an FPI Research Fellowship from the Spanish Ministry of Science, becoming a Visiting Research Associate at the University of Maryland Baltimore County, MD, during 1998. In 1999, he joined the Department of Electrical Engineering, Universidad de Valladolid, where he has been an Associate Professor since 2004. His research interests include learning machines and their applications to medical image processing.

Jesús Cid-Sueiro (M’95) received the Telecomm Engineer degree from the Universidad de Vigo, Vigo, Spain, in 1990 and the Ph.D. degree from Universidad Politecnica Madrid, Madrid, Spain, in 1994. Since 1999, he has been an Associate Professor in the Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Madrid, Spain. His main research interests include statistical learning theory, neural networks, Bayesian methods and their applications to communications, multimedia signal processing, and education.

Suggest Documents