An algorithm for generating modular hierarchical neural network classifiers: a step towards larger scale applications D. Roverso Institute for Energy Technology, OECD Halden Reactor Project POBox 173, N-1751 Halden, Norway E-mail:
[email protected] ABSTRACT Many-class learning is the problem of training a classifier to discriminate among a large number of target classes. Together with the problem of dealing with high-dimensional patterns (i.e. a high-dimensional input space), the manyclass problem (i.e. a high-dimensional output space) is a major obstacle to be faced when scaling-up classifier systems and algorithms from small pilot applications to large full-scale applications. The Autonomous Recursive Task Decomposition (ARTD) algorithm is here proposed as a solution to the problem of many-class learning. Example applications of ARTD to neural classifier training are also presented. In these examples, improvements in training time are shown to range from 4-fold to more than 30-fold in pattern classification tasks of both static and dynamic character. Keywords: intelligent computing, many-class learning, task decomposition, modular classifiers, neural networks.
1. INTRODUCTION Scaling-up of systems and techniques to real-world problems has been, and still is, a primary concern in the field of soft computing. Most industries, or other potential users of soft computing solutions, will inevitably pose the question of whether a particular technique or process will live-up to the expectations when a full-blown application has to be developed. In this paper we focus on a sub-set of soft computing applications, namely pattern classification, and, throughout the paper, we have chosen to use neural networks as the soft computing technique of reference. Issues of scale in a pattern classification task are generally related to three basic factors: 1.
Amount of data.
2.
Input dimensionality.
3.
Output dimensionality.
Several techniques have been proposed, and are commonly used, to tackle these issues. When the amount of data to be used for training of the classifier is vast, training time might become extremely long. In this case, most of the proposed solutions involve one form or another of data sampling [1, 2]. Recently, promising techniques for "data squashing" have also been proposed [3]. When the input dimensionality is large, i.e. the patterns to be classified are described by a large number of features, several techniques for dimensionality reduction can be used. These range from feature sub-set selection algorithms [4], which basically eliminate some dimensions (features), to classical Principal Components Analysis (PCA) and other similar techniques (e.g. non-linear PCA, Independent Component Analysis (ICA), and data fusion techniques [5]),
178
Intelligent Computing: Theory and Applications, Kevin L. Priddy, Peter J. Angeline, Editors, Proceedings of SPIE Vol. 5103 (2003) © 2003 SPIE · 0277-786X/03/$15.00
which instead try to apply a transformation to the original input space so as to obtain a new representation in a lowerdimensional space. When the output dimensionality is large, i.e. we have a many-class learning problem, where the patterns to be classified belong to a large number of distinct target classes, the only available option is to adopt some form of task decomposition in order to reduce the size and complexity of the classification models. Several approaches have been proposed in which the task decomposition is achieved through the construction of a modular classifier architecture. This can be done in a number of ways: •
manually, i.e. using explicit expert knowledge of the task at hand prior to the employment of the learning algorithm, in order to decompose the original task into a number of predefined subtasks (see for instance [6]);
•
mechanically, i.e. in a predetermined fixed fashion, before starting the learning process; for example by decomposing an N-class task into N independent two-class (binary) tasks [7] where each classifier discriminates one class from the rest, or into possible combinations;
•
N 2
binary tasks [8], where the target classes are paired in all
autonomously, i.e. the task decomposition is carried out during the learning process. In this category we find methods such as the Hierarchical Mixture of Experts (HME) of Jordan & Jacobs [9], and several other modular approaches. For a comparative study of modular neural classifiers see [10].
The algorithm proposed in this paper falls in this last category, and is characterized by a recursive task decomposition procedure, which bases the decomposition strategy on the classification performance of partially trained classifiers. We call this new method Autonomous Recursive Task Decomposition (ARTD). In the remainder of the paper, Section 2 describes the ARTD algorithm in detail; Section 3 is reserved to the analysis of a series of comparative tests, and Section 4 summarizes the paper including a discussion of open issues and future work.
2. THE ARTD ALGORITHM The Autonomous Recursive Task Decomposition (ARTD) algorithm, is a hierarchical decomposition procedure for many-class classification tasks, i.e. tasks involving a large number of distinct pattern classes. ARTD generates a hierarchy of classifiers in a recursive way, by decomposing the task at hand into a set of sub-tasks, and by reapplying the same procedure in turn on each sub-task. The decision to decompose a task into sub-tasks is based on an analysis of the classification performance of a classifier, which has been only partially trained to solve the original task. In the following, the ARTD algorithm for the case of neural network classifiers is showna.
Let: C = {c1 , …, cN}, be the set of N target classes defining the current task. net, be a neural network classifier, which receives patterns p in input, and has N outputs, one for each class in the set C.
1.
Partially train net on the available training patterns p (for example for a limited number of epochs, or until a limited error goal is reached).
2.
Run the training patterns p through net, and record the obtained N-dimensional output patterns o.
a Equivalent versions of the ARTD algorithm for other types of learning classifiers can be derived from this. One example could be the combination of
ARTD with genetic search.
Proc. of SPIE Vol. 5103
179
3.
Assign each recorded output patterns o to one of the N sets O1 , …, ON according to the class membership of the originating training patterns p.
4.
Let x1 , …, xN , be the centers of mass of the sets O1 , …, ON .
5.
Create a partition S of C by clustering the corresponding class centers x1 , …, xN . The clustering shall take into account the distribution of the output patterns o in the N-dimensional space of the outputsb. This step generates a set S of K superclasses.
S = {s1 ,..., s K }
∀s i, j
i
∩ sj = ∅ ∧
Us K
j
≡C
j =1
6.
If K=1 (i.e. the outputs are indistinguishable), continue training net by going to Step 1 (unless a maximum number of iterations has been reached, and further decomposition is abandoned, in which case fully train net and return).
7.
If K=N (i.e. the superclasses coincide with the original classes, indicating that further decomposition is not necessary, and that training net to solve the current task is easy), fully train net and return.
8.
If 1