A Comparative Evaluation of Sequential Constructive Methods

0 downloads 0 Views 129KB Size Report
Abstract. A detailed comparison of the most promising algorithms in the class of sequential constructive methods is performed by employing artificial benchmarks ...
A Comparative Evaluation of Sequential Constructive Methods Marco Muselli Istituto per i Circuiti Elettronici - CNR via De Marini, 6 - 16149 Genova, Italy

Abstract A detailed comparison of the most promising algorithms in the class of sequential constructive methods is performed by employing artificial benchmarks and real-world problems. Four of them incrementally build the hidden layer of the resulting neural network by subsequently adding threshold units. The last method considered makes use of neurons with a window-shaped activation function. The performances obtained with each technique and their comparison with the widely employed back-propagation algorithm allow a thorough examination of the characteristics of these sequential constructive methods. In particular, the neural networks trained show good generalization ability also when the learning time is very low. Keywords: Supervised learning, constructive methods, sequential learning, performance comparison, generalization ability.

1

Introduction

The solution of most real-world problems through the back-propagation algorithm [1] requires a high computational cost; in many cases, this prevents us to analyze training sets containing a sufficient number of samples. Consequently, the resulting neural network is often unable to catch the underlying input-output function associated with the given problem. Another limitation inherent the back-propagation algorithm is the need to choose a priori the architecture of the multilayer perceptron to be trained. The determination of the number of layers and the number of nodes in each layer requires the execution of additional trials that increases the total learning time spent by the algorithm. When we are facing with classification problems, a possible way to overcome these difficulties is provided by the class of sequential constructive methods: they subsequently add neurons to the hidden layer until all the input-output pairs contained in the given training set are satisfied. The insertion of a new neuron does not require the updating of the whole weight matrix, but only the training of the connections associated with the unit to be added. Although the convergence properties of sequential constructive methods have been theoretically proved elsewhere [2], there has not yet been presented a detailed comparison which allows to establish the effective performances of this class of algorithms. This paper tries to fill this gap by examining the five most 1

promising sequential constructive methods and by executing a series of common comparative trials. Furthermore, the application of the back-propagation algorithm on the same set of benchmarks allows to correctly evaluating the results obtained.

2

Training algorithms considered

Sequential constructive methods essentially differ in the algorithm employed for the addition of a new hidden neuron. In fact, the weights for the output layer are obtained through the application of simple algebraic equations and therefore do not need a proper learning phase. The following definition plays a fundamental role in the description of the algorithms examined. Definition 1 Let Q+ and Q− be two subsets of the n-dimensional input space D, containing a finite number of patterns q + and q − respectively. A neuron will be called a partial classifier if it provides output +1 for all the patterns included in a non-empty subset R of Q+ and output −1 for all the elements of Q− . Neural networks generated by sequential constructive methods contain partial classifiers in the hidden layer. Thus, at any addition, the main goal is to build a partial classifier that provides the correct output for the greatest number of patterns of Q+ and Q− , so as to minimize the number of connections in the final neural network. Since the achievement of this result can lead to the solution of an NP-complete problem, sequential constructive methods only try to obtain near-optimal partial classifiers that approximate the desired configuration. Now, we will give a brief description of the training algorithms considered in the comparative evaluation. Four of them employ threshold neurons for the construction of the hidden layer; their activation function is given by

ϕ(x) = sgn

à n X i=0

  

! wi xi

=

+1

if

n X

wi xi ≥ 0 i=0   −1 otherwise

.

(1)

As usual we have added a component x0 = +1 to the input pattern x so as to include the bias in the weights of the neuron. The fifth method builds partial classifiers having the following windowshaped activation function   

¯ ¯ n ¯X ¯ ¯ ¯ +1 if ¯ wi xi ¯ ≤ δ ψ(x) = ¯ ¯ i=0   −1 otherwise

,

(2)

where the (small) real quantity δ is called amplitude and is meaningful only by an implementative point of view.

2.1

Irregular Partitioning Algorithm (IPA)

In this method [3] the subset R with highest size is incrementally built by subsequently adding to it the element x ∈ Q+ that maximizes the number of input-output pairs satisfied by the current partial classifier. It is then necessary to employ at every iteration a proper algorithm that verifies the existence of a threshold neuron separating R and Q− . A possible choice is to adopt the Thermal Perceptron Learning Rule (TPLR) [4], because of its high convergence speed.

2.2

Carve Algorithm (CA)

A different approach to obtain the construction of a partial classifier for two sets of input patterns Q+ and Q− is to consider the convex hull generated by the elements of Q− . In fact, it could be observed that only the points of Q+ outside this convex hull can be separated from the elements of Q− through a hyperplane. This is the basic idea of CA, but its direct application can lead to a total execution time which is exponential in the number n of inputs. A near-optimal result can be obtained much faster by considering only a restricted number of faces of the convex hull; this can be done by applying a proper algorithm which has polynomial complexity [5].

2.3

Target Switch Algorithm (TSA)

To obtain a subset R ⊂ Q+ with highest size, TSA [6] searches for a nearoptimal threshold neuron by applying a proper training algorithm, e.g. TPLR [4]. If it does not correctly classify some patterns of Q− , the more critical element of Q+ (chosen according to a precise definition) is moved from Q+ to Q− . It can be shown that such a modification leads to a partial classifier for the given training set in a finite number of iterations.

2.4

Oil Spot Algorithm (OSA)

If the input patterns have binary components, they are placed at the vertices of an n-dimensional hypercube centered at the origin of the space Rn . Thus, two patterns are contiguous if they are connected by a single edge of the hypercube. It can easily be seen that two edges are parallel if the corresponding pairs of vertices differ by the same component; moreover, two parallel edges are called congruent if their orientation coincides. Finally, according to the classification induced by the sets Q+ and Q− the edges that join two vertices belonging to opposite classes will be named critical. By using these definitions, it is possible to show that the sets R and Q− can be separated by a threshold neuron if the following two conditions are satisfied: (1) the patterns of R are the nodes of a connected subgraph of the hypercube {−1, +1}n , and (2) two parallel critical edges are always congruent. This property is used by the oil spot algorithm [7] to construct partial classifiers in an incremental way.

Problem # 1 Problem # 2 Problem # 3 Algorithm w g t w g t w g t BPA 21 95.7% 8 42 84.3% 36 21 90.2% 4 IPA 31 87.8% 44 60 83.3% 163 29 87.8% 33 CA 30 90.7% 5 74 82.9% 13 35 88.7% 5 TSA 64 83.4% 59 116 82.3% 198 64 81.8% 37 OSA 350 82.2% 1 963 74.3% 4 341 80.8% 1 OSA & HC 13 99.7% 1 82 90.5% 1 39 91.1% 1 SWL 38 95.3% 15 16 100.0% 1 65 85.5% 27 SWL & HC 11 100.0% 1 18 98.7% 1 29 91.5% 1 Table 1: Number of nonnull weights w, generalization ability g and average CPU time t for the neural networks solving Monk’s problems.

2.5

Sequential Window Learning (SWL)

The interest in window neurons (2) is motivated by the existence of a fast and efficient learning algorithm, based on the solution of systems of algebraic equations, which allows to find good partial classifiers for any pair of sets Q+ and Q− [8]. This method is used by SWL which incrementally builds the set R by adding to it the element x ∈ Q+ that maximizes the number of input patterns correctly classified by the current window neuron.

3

Experimental results

The sequential constructive techniques briefly described in the previous section have been extensively tested on five benchmark problems to evaluate their performance. The first of them (Monk’s problems) has been artificially built while the other four groups of trials derive from a collection of data sets distributed by the machine learning group of the University of California at Irvine [9]. The influence of the procedure of Hamming Clustering (HC) [8] on the complexity and the generalization ability of the configurations generated will also be analyzed. Thirty runs have been performed for every trial and every training algorithm so as to obtain a sufficient statistics of the measured quantities. In the tables of results we will denote with w the number of nonnull weights of the resulting configurations, with g their generalization ability (percentage of correctness on the test set), and with t the training time (sec.) needed for each construction. A reference value for each of these quantities has been obtained by the application of the Back-Propagation Algorithm (BPA) to every set of trials. All the execution times refer to a DECStation 3000/600 with 64 MB RAM under operating system Digital UNIX 4.0.

3.1

Monk’s problems

Three artificial classification problems have been proposed in [10] as benchmarks for training algorithms. The first two Monk’s problems are noise-free, whereas in the third case the outputs in the training set can undergo small

GL IR CH VO Algorithm w g t w g t w g t w g t BPA 40 73.6% 143 10 96.1% 93 148 98.6% 799 34 94.9% 24 IPA 59 73.7% 71 35 91.0% 23 491 97.8% 9074 42 94.5% 45 CA 57 73.6% 43 30 89.7% 1 1297 92.5% 8452 69 93.7% 135 TSA 78 75.2% 33 32 93.2% 36 558 96.8% 4202 49 94.6% 3 OSA 4326 80.0% 41 1859 92.2% 6 7329 88.3% 15576 1036 90.7% 5 OSA & HC 81 79.0% 6 35 90.7% 2 156 98.3% 422 47 93.2% 3 SWL 393 52.2% 457 589 11.3% 104 371 93.1% 6146 139 86.4% 755 SWL & HC 37 74.6% 13 28 87.3% 3 102 98.0% 651 32 91.6% 7

Table 2: Number of nonnull weights w, generalization ability g and average CPU time t for the neural networks solving four real-world problems. changes from their correct values. This last test can therefore provide a measure of the robustness of sequential constructive methods. Since the inputs are discrete, a proper transformation must be applied to allow the employment of SWL and OSA; it leads to binary training sets containing 15 inputs. The results obtained in the simulations are shown in tab. 1. The employment of HC allows the achievement of good performances both for SWL and OSA: the resulting neural networks show a high generalization ability and a discrete robustness (Monk’s problem #3). Furthermore, the low computational cost of HC allows the reduction of the total CPU time in all the three tests. Interesting results are also offered by CA which is a good compromise between training speed and efficiency.

3.2

Real-world problems

The following four benchmarks maintained at the UCI machine learning repository [9] have been considered: Glass identification (GL): Starting from 9 real parameters, we want to determine if a given piece of glass is float-processed or not. Iris data set (IR): Three different types of iris plant, virginica, versicolor and setosa must be recognized by knowing the length and width of the sepal and petal. Chess end-game (CH): In this problem, the goal is to determine whether or not a given chess board configuration is a winning position for the white player. The input pattern contains 35 binary components and one ternary component describing a board position of a chess end-game. United States congressional voting record database (VO): From the results of 16 ternary key votes (‘yea’, ‘nay’ or ‘?’) determine if a US House of Representatives Congressman is either Democrat or Republican. In all these benchmarks 2/3 of the patterns (randomly chosen in each of the 30 trials) form the training set whereas the remaining 1/3 has been used to test the generalization ability of the resulting neural networks. The application of OSA and SWL needs a proper transformation of the input patterns: the

corresponding binary training sets contain 256 (GL), 151 (IR), 37 (CH), and 48 (VO) inputs respectively. For real inputs (GL and IR) the thermometer code has always been used. The results obtained in the simulations are reported in tab. 2. In all these tests TSA generates two-layer perceptrons with good generalization ability in reasonable training times. As one can note, the performances of neural networks constructed by OSA and SWL does not suffer from data binarization. HC generally lowers the computational burden and the complexity of the resulting configurations. In many cases (particularly with SWL) it also allows to reach a better generalization ability.

References [1] Rumelhart, D.E., Hinton, G.E., and Williams, R.J. Learning internal representations by error propagation. In Parallel Distribute Processing (Rumelhart D.E., and McClelland, J.L., eds.), Cambridge, MA: MIT Press (1986), 318–362. [2] Muselli, M. A unified approach to sequential constructive methods. Contribution to WIRN’98 - The 10-th Italian Workshop on Neural Nets. [3] Marchand M., and Golea, M. On learning simple neural concepts: From halfspace intersections to neural decision lists. Network 4 (1993), 67–85. [4] Frean, M. A “thermal” perceptron learning rule. Neural Computation 4 (1992) 946-957. [5] Young, S., and Downs, T. Improvements and extensions to the constructive algorithm CARVE. In Artificial Neural Networks — ICANN 96 (Von der Malsburg, C., W. von Seelen, J.C.V., and Sendhoff, B., eds.), Berlin: Springer (1996), 513–518. [6] Campbell C., and Perez Vicente, C. The target switch algorithm: A constructive learning procedure for feed-forward neural networks. Neural Computation 7 (1995), 1245–1264. [7] Frattale Mascioli, F.M., and Martinelli, G. A constructive algorithm for binary neural networks: The oil-spot algorithm. IEEE Transactions on Neural Networks 6 (1995) 794-797. [8] Muselli, M. On sequential construction of binary neural networks. IEEE Transactions on Neural Networks 6 (1995), 678–690. [9] Merz, C.J., and Murphy, P.M. Uci repository of machine learning databases [http://www.ics.uci.edu/ mlearn/mlrepository.html]. Irvine, CA: University of California, Department of Information and Computer Science (1996). [10] Thrun, S. et al. A performance comparison of different learning algorithms. Technical Report CMU-CS-91-197, Pittsburgh, PA: Department of Computer Science, Carnegie Mellow University (1991).

Suggest Documents