of commands from the MS-DOS operating system. Each of these commands is repre- sented by a 37-dimensional feature vector containing the words that occur ...
Proc. 6th Int’l Conference on Artificial Neural Networks (ICANN’96). Bochum. Germany. July 16-19. 1996.
Visualizing Similarities in High Dimensional Input Spaces with a Growing and Splitting Neural Network Monika Köhle, Dieter Merkl Institute of Software Technology, Vienna University of Technology Resselgasse 3/188, A-1040 Wien, Austria {monika, dieter}@ifs.tuwien.ac.at Abstract. The recognition of similarities in high dimensional input spaces and their visualization in low dimensional output spaces is a highly demanding application area for unsupervised artificial neural networks. Some of the problems inherent to structuring high dimensional input data may be shown with an application such as text document classification. One of the most prominent ways to represent text documents is by means of keywords extracted from the full-text of the documents and thus, document collections represent high dimensional input spaces by nature. We use a growing and splitting neural network as the underlying model for document classification. The apparent advantage of such a neural network is the adaptive network architecture that develops according to the specific requirements of the actual input space. As a consequence, the class structure of the input data becomes visible due to the separation of units into disconnected areas. The results from a growing and splitting neural network are further contrasted to the more conventional neural network approach to classification by means of self-organizing maps.
1. Introduction Real world data is often represented in high dimensional spaces and hence the inherent similarities are hard to recognize and to illustrate. This fact makes it a challenging task to build tools that are able to visualize these similarities. By nature, visualization requires a mapping process from the high dimensional input space to a low dimensional output space. One well-known approach to dimension reduction is the selforganizing map [6]. This model preserves the structure of input data as faithfully as possible and results can be visualized on a 2D topographic map, thus indicating the similarities between input vectors in terms of the distance between the respective units. However, it does not represent cluster boundaries explicitly, these rather have to be drawn by hand. Only recently, variations on the basic idea of self-organizing maps with a special focus on adaptive architectures have been suggested [2], [4]. These architectures incrementally grow and split according to the specific requirements of the actual input space, thus producing isolated clusters of related items. This paper presents an investigation in the applicability of one of those models to a real world problem domain, namely the classification of text documents. The remainder of the paper is organized as follows. In the next section we present a brief review of the network dynamics of Growing Cell Structures, a growing and splitting artificial neural network. In Section 3 we present the general framework for text document classification. Section 4 contains a description of experimental results from our application. These results are further compared to self-organizing maps. Finally, our conclusions are presented in Section 5.
2. Growing Cell Structures For the experiments we used a neural network implemented in the spirit of Growing Cell Structures, GCS, as described in [4]. This model may be regarded as a variation on Kohonen’s self-organizing maps [6]. In its basic form, GCS consist of a twodimensional output space where the units are arranged in form of triangles. Each of these output units owns a weight vector which is of the same dimension as the input data. The learning process of GCS is a repetition of input vector presentation and weight vector adaptation. More formally, during the first step of the learning process
the unit c with the smallest distance between its weight vector wc and the current input vector x is selected. This very unit is further referred to as the best-matching unit, the winner in short. The selection process as such may be implemented by using the Euclidean distance measure as indicated in expression (1) where O denotes the set of units in the output space. c: ||x−wc|| ≤ ||x−wi||; ∀i∈O (1) The second step of the learning process is the adaptation of weight vectors in order to enable an improved representation of the input space. This adaptation is performed with the weight vector of the winner and the weight vectors of its directly neighboring units. Denoting the best-matching unit with c and the set of its neighboring units with Nc, the equations of weight vector adaptation may be written as given in expressions (2) and (3) with t denoting the current learning iteration. The terms εc and εn represent learning-rates in the range of [0, 1] for the winner and its neighbors. Notice that the definition of a neighborhood function as with self-organizing maps is omitted since the adaptation is restricted to direct neighbors only. As another difference to self-organizing maps the learning-rates are fixed throughout the entire learning process. wc(t+1) = wc(t) + εc ⋅ (x−wc) (2) wn(t+1) = wn(t) + εn ⋅ (x−wn); ∀n∈Nc (3) Finally, the third step of the learning process constitutes the major difference to selforganizing maps. In addition to the weight vector, each unit owns a so-called signal counter variable τ. This variable represents an indication of how often a specific unit has been selected as the winner during the learning process. This information is further used to adapt the architecture of the artificial neural network to the specific requirements of the input space. We will return to this adaptation below. At the moment we will just provide the formulae for signal counter adjustment in equations (4) and (5). In this expressions c again refers to the best-matching unit. α is a fixed rate of signal counter reduction for each unit that is not the winner at the current learning iteration t. τc(t+1) = τc(t) + 1 (4) τi(t+1) = τi(t) − α ⋅ τi(t); i≠c (5) Apart from the adaptation of weight vectors as described above the GCS learning process consists of an adaptation of the overall architecture of the artificial neural network. Pragmatically speaking, units are inserted in those regions of the output space that represent large portions of the input data whereas units are removed if they do not contribute sufficiently to input data representation. Additionally, due to the deletion of units, the output space of the GCS may split into several disconnected areas each of which representing a set of highly similar input data. This adaptation of the architecture is performed repeatedly after a fixed number of input presentations. Just to give the exact figure, the results presented below are based on an adaptation of the network structure each two epochs of input presentation. Starting with the insertion of new units, in a first step the unit that served most often as the winner is selected. The selection is based in the signal counter variable as given in expressions (6) and (7). We denote the unit having the highest relative signal counter value with q. hi = τi / Σj τj (6) q: hq ≥ hi; ∀i∈O (7) Subsequently, the neighboring unit r of q with the most dissimilar weight vector is to be determined as given in expression (8). In this formula Nq again represents the set of neighboring units of q. r: ||wr−wq|| ≥ ||wp−wq||; ∀p∈Nq (8) An additional unit s is now added to the artificial neural network in between units q and r and its weight vector is initially set to the mean of the two existing weight vectors, i.e. ws = 1/2 ⋅ (wq + wr). Finally, the signal counter variables τ in the neighborhood Ns of the newly inserted
unit s have to be adjusted. This adjustment represents an approximation to a hypothetical situation where unit s would have been existing throughout the learning process so far. The necessary operations are given in expressions (9) through (11). With card(Np) we refer to the cardinality of set Np. The term Fp represents an approximation of the size of the decision region covered by the units of Np. Furthermore, t refers to the situation before and t+1 after the insertion of unit s. In a last step the signal counter variable of unit s is initialized with the sum of changes at existing units. Fp = 1 / (card(Np) ⋅ ∑i ||wp−wi||; ∀i∈Np (9) ∆τp = ((Fp(t+1) − Fp(t)) / Fp(t)) ⋅ τp(t); ∀p∈Ns (10) τs = −∑p ∆τp; ∀p∈Ns (11) Units that do not contribute sufficiently to input data representation are removed from the artificial neural network. Again, the contribution of a particular unit is measured by means of its relative signal counter value and the decision region the unit belongs to, i.e. pi = hi / Fi; ∀i∈O. The final deletion is guided by a simple threshold logic. Thus, all units j with contribution pj below a certain threshold η are removed, i.e. pj