content-based document classification with highly

Proc. 5th Int’l Conference on Artificial Neural Networks (ICANN’95). Paris. Oct 9-13. 1995. pp Vol 2: 239-244.

CONTENT-BASED DOCUMENT CLASSIFICATION WITH HIGHLY COMPRESSED INPUT DATA Dieter Merkl Institute of Software Technology, Vienna University of Technology Resselgasse 3/188, A-1040 Vienna, Austria [email protected]

Abstract: One of the major obstacles for the application of artificial neural networks to real-world problems is the rather time-consuming task of training. In this paper we will demonstrate that considerable acceleration with equal classification results may be achieved by the utilization of highly compressed input data for a self-organizing map. As the basis for the experiments we use an application in the area of software reuse, namely the structuring of software components. As a result we are able to show that the self-organizing map is successful in arranging the software components according to their semantic similarity even in case of highly compressed input data. INTRODUCTION The self-organizing map as proposed by Teuvo Kohonen (Kohonen, 1982, 1989, 1990) is one of the most prominent models in the area of unsupervised artificial neural networks. This neural network architecture is well-known for successful representation of semantic similarities inherent in the set of input data, eg. (Merkl, 1995a; Ritter and Kohonen, 1989). More precisely, the semantic relationship of the input data is visualized in terms of distances between the respective winners within a two-dimensional grid of output units. However, the self-organizing process is a rather timeconsuming task and thus, one solicitation refers to approaches that reduce the time needed to train the self-organizing map. A step in the right direction is lateral inhibition of output units (Merkl et al., 1994; Merkl, 1995b; Miikkulainen, 1991) since the process of initial cluster formation is accelerated. In this paper we will present an alternative approach which aims at the compression of the input vectors and thus obviously contributes to reduced training time. In particular, we suggest the utilization of a multi-layer perceptron for data compression. We will show the effects of such an approach on the basis of a software engineering application, namely structuring the contents of a library containing reusable software components. The remainder of this paper is organized as follows. In the next section we will provide a description of the architecture and the learning rule of the artificial neural network we used for our experiments. Furthermore, we will describe data compression by using multi-layer perceptrons. Section 3 will contain an outline of our application. In Section 4 we will show the results of the application. Finally, in Section 5 we will draw some conclusions. SELF-ORGANIZING MAPS The architecture of a self-organizing map consists of a layer of n input units and a grid of output units each of which has assigned an n-dimensional weight vector. The task of the input units is to receive the various input vectors representing real-world entities and to propagate them as they are onto the grid of output units. Each of the output units in its turn computes exactly one output value which is proportional to the similarity between the current input vector and that unit’s weight vector. This value is commonly referred to as the unit’s activation or the unit’s response to the presentation of an input. Usually, the Euclidean distance is used as the measure of similarity. During the learning process of a self-organizing map, the adaptation of the weight vectors is performed according to an unsupervised learning rule. Basically, this learning rule may be seen as a

generalization of winner-take-all learning. More precisely, the learning process of a self-organizing map includes a spatial dimension. We can describe the learning rule in three steps which are to be performed repeatedly and which are collectively referred to as one learning iteration. First, one input vector at a time is randomly selected out of the set of possible input data. Second, this vector is mapped onto the grid of output units and the unit with the strongest response is determined. This unit is further referred to as the winning unit, the winner in short. Third, the weight vector of the winning unit as well as weight vectors of units in the topological neighborhood of the winner are adapted in such a way that these units will exhibit an even stronger response to the same input vector in future. More formally, we may describe the process of adaptation of the weight vector mi assigned to output unit i as presented in expression (1). Unit i is neighbor to the winning unit c. The amount of adaptation is dependent on a gain function, η(t), a neighborhood function, ξci(t), and the difference between the current input vector and the weight vector, x(t)-mi(t). The gain function produces results which are gradually decreasing to zero with increasing learning iterations t. With such a restriction it is obvious that the learning process will terminate at a stable arrangement of weight vectors. mi(t+1)=mi(t)+η(t)⋅ξci(t)⋅[x(t)-mi(t)](1) The purpose of the neighborhood function is to determine the set of units that are subject to adaptation as well as the amount of adaptation with respect to, on the one hand, the current learning iteration and, on the other hand, the distance between units c and i as measured in the output space. In general, the amount of adaptation has to decrease with increasing distance from the winner. Our concrete realization of such a neighborhood function is presented in equation (2). A property of this neighborhood function is the incorporation of lateral inhibition of output units. Pragmatically speaking, lateral inhibition may be paraphrased as a learning function that moves weight vectors of units in close vicinity to the winner towards the current input vector whereas weight vectors of units that are farther away from the winner are pushed slightly away from the current input vector. Returning to the formula, the parameter α is used to determine the spatial width of the neighborhood function in terms of lateral excitation and lateral inhibition. Furthermore, the amount of adaptation decreases with increasing learning iterations t according to a time-varying parameter β which is limited within the range of [0, 1]. The expression ||x|| refers to the Euclidean vector norm. ξci(t)=β(t)⋅sin(α⋅||c-i||)/(α⋅||c-i||)(2) From the description of the self-organizing map’s learning process it is obvious that the length of the input vectors have considerable impact on learning speed. In other words, the higher the input dimension the longer is the time needed to train the self-organizing map. In order to reduce the dimension of the input data we utilize a multi-layer perceptron, i.e. an MLP in short. More precisely, we utilize an MLP consisting of three fully connected layers. During the training process of the MLP each input vector is mapped onto the same output vector, i.e. the number of input units equals the number of output units of the MLP. The activation levels of the hidden units are further treated as input for the self-organizing map. In case of successful training the original input vectors may be reconstructed when given the activation levels of the hidden units, i.e. the weight vector of the winner. The MLP used for our experiments was set up by using the PlaNet neural network simulator (Miyata, 1992) and was trained with the standard error-backpropagation learning rule. A BRIEF NOTE ON SOFTWARE LIBRARIES There is common agreement that reusing software is one of the key issues to improve both productivity of the software suppliers and the quality of the software itself. The increased

productivity is due to the amount of work that is saved each time a component is reused. The increased quality relates to the fact that the same component is used and tested in many different contexts. Due to the limited space in this paper we cannot provide a more precise exposition of the various areas of software reuse, we rather refer to (Biggerstaff and Perlis, 1989; Krueger, 1992). It is necessary to recognize, however, that the reuse of existing software components relies on the existence of libraries which contain these components (Frakes and Isoda; 1994). In order to be useful, such libraries have to provide a large number of reusable software components in a wide spectrum of application domains. Yet, an inherent problem to large libraries is the fact that finding and choosing the appropriate components tend to become troublesome tasks. Consequently, the libraries ought to be organized in a way that facilitates locating and retrieving the needed software components. In other words, the stored software components ought to be arranged in a way that resembles their functional similarity as closely as possible. Pragmatically speaking, we can conclude that components exhibiting similar behavior should be stored near to each other and thus be identifiable. The task of library structuring is commonly performed by using either semantic networks (Ostertag et al.; 1992) or cluster analysis (Maarek et al.; 1991). The former approach counts for smart behavior at the expense of a manually constructed semantic net. The advantage of the latter approach is the high degree of possible automation since the basis of cluster analysis are automatically extracted keywords from the textual description of the various software components. The hypothesis behind such an approach might be characterized as the more keywords are in common within the two textual descriptions the more similar are the respective software components. Our approach relies on automatically extracted keywords from the manual of the software components. These keywords are further used as the input data of a self-organizing map which preforms the task of library structuring. Specifically, each component is described by using a set of keywords extracted from the full-text of the manual. During this so-called indexing process we utilize a small list of stop-words, i.e. words that are to be excluded from the final component description, to clean up the resulting index. In fact, the stop-word list comprises only conjunctions, articles, and pronouns and thus, no domain specific knowledge. Subsequently, each software component is represented as a binary-valued vector where each component corresponds to a possible document feature, i.e. keyword. Thus, an entry of zero denotes the fact that the corresponding feature is not used in the description of the software component. Contrary to that, an entry of one means that the corresponding feature is used to describe the software component. Such a representation is known as the vector space model of information retrieval (Turtle and Croft; 1992). EXPERIMENTAL RESULTS For our experiments we used the NIH Class Library, the NIHCL in short, as an example of a software library. The NIHCL is a collection of classes developed in the C++ programming language. The NIH Class Library includes a number of generally useful data types such as Date and Time as well as a large variety of container classes such as Dictionary and LinkedList which are constructed in the spirit of the Smalltalk-80 classes. Moreover, the NIHCL provides the facilities to store arbitrarily complex data structures on disk, e.g. ReadFromTbl, StoreOnTbl. For more detailed information about the NIH Class Library we refer to (Gorlen et al.; 1990). Based on the textual description of the various classes as contained in the reference manual of Version 3.10 (Gorlen; 1990) we extracted 489 distinct keywords representing the software components. Thus, each class of the NIHCL is represented by using a binary-valued 489dimensional feature vector. As a result of structuring the NIHCL consider the mapping as depicted

in Figure 1. This example is based on a training process with the full-sized input vectors. This final arrangement was obtained after 7,000 learning iterations with an initial learning-rate η of 0.7. For the ease of comparison we marked some of the interesting clusters manually. In particular, the final arrangement of the various classes within the output space of the self-organizing map resembles a combination of the structure as given in the inheritance hierarchy and the classes that are identified as being related in the manual. Consider for example the left upper part of the final map. This region contains all classes implementing I/O operations. Moreover, within this cluster the classes are arranged according to their relations as indicated in the respective parts of the manual. For another cluster of related classes we refer either to the right lower corner containing Range and Regex or to the cluster built by the classes Date and Time. However, the largest portion of the final map is reserved to class Collection and its derived and related classes. This cluster is located in the left lower part of the map. As an interesting property of this cluster we refer to the separation of classes that provide keyed-access to the stored information, i.e. Dictionary and IdentDict. These classes together with the classes that actually implement such a keyed access, i.e. Assoc, AssocInt, and LookupKey, are arranged in close vicinity. Finally, we want to direct the attention to yet another cluster which is neither obvious from the inheritance hierarchy nor from the relations as indicated in the manual. In particular, we refer to the region containing the classes Float and Integer. Both classes are used to implement basic numerical data types and thus have a highly similar description in the manual. Furthermore, please note that the class Random is mapped onto a unit neighboring the cluster of Float and Integer. As may be guessed by the names of the classes, Random produces pseudo-random numbers of Float data type and thus, the classes are highly related. . OIOofd OIOifd OIOin OIOistream . OIOnihout . Object ReadFromTbl . . OIOout . OIOnihin . . . StoreOnTbl . LinkOb . OIOostream . FDSet . Point . Assoc . Class . Exception . Nil . AssocInt . LookupKey . Link . Iterator . Random . Dictionary . . LinkedList . . Collection . . KeySortCltn . OrderedCltn . SeqCltn . . Float IdentDict . SortedCltn . Stack . . Arraychar . . Set . Heap . . Arrayob . String IdentSet . Bag . Bitset Vector . Range .

. Date . Time . Integer . . . Regex

Figure 1: Final map with uncompressed input vectors (489 components) On a closer look at the final maps depicted in Figures 2 and 3, we may notice that the same clusters where formed. These maps were trained, however, with highly compressed input data. More precisely, Figure 2 represents the mapping achieved with input data compressed by using a 489-75489 MLP, i.e. 489 input and output units and 75 hidden units. Whereas the map depicted in Figure 3 results from an even higher compression of the input data, namely by using a 489-30-489 MLP. Both maps reached their stable state after about 7,000 learning iterations. Yet, as expected, the time needed to train these maps is much shorter as with the original input data. Information on the training time is presented in Table 1. More precisely, Table 1 contains the time measured in minutes that is needed to train a self-organizing map for 7,000 learning iterations with varying length of the input data. We used the UNIX command time to get the approximate training time on a SUN SPARC 10 workstation. 489 components (uncompressed)

75 components (compressed)

30 components (compressed)

45:26.2

7:13.9

3:06.6

Table 1. Training time for 7,000 learning iterations

SortedCltn . KeySortCltn . LinkOb . Collection . . OIOnihout Bag . Dictionary Assoc . . . Iterator . OIOnihin . IdentSet . . AssocInt . Heap . OIOistream . Date . IdentDictLookupKey . String . . OIOin . . Class . Object . Arraychar . Arrayob . OIOout Time . Nil . Vector . Bitset . . OIOostream . FDSet . LinkedList . OrderedCltn . . OIOifd . Random . Link . SeqCltn . Stack Set . OIOifd . . . Regex . Point . . ReadFromTbl . Integer . Float . Range . . Exception . StoreOnTbl

Figure 2: Final map with compressed input vectors (75 components) Float . Time . . Range . String . Vector

Integer . . Date Point . . . LinkOb .

. Exception . Set . IdentDict . Dictionary Random . . . IdentSet . AssocInt . FDSet . Nil . Heap . Assoc LookupKey . . . OrderedCltn . LinkedList . . . Bag . . . Stack . Iterator Regex . . . SeqCltn . Link Collection . Bitset . SortedCltn . OIOnihin . Arrayob Object . . KeySortCltn . OIOnihout . Arraychar . . StoreOnTbl . OIOostream . OIOifd . Class ReadFromTbl . OIOout OIOin OIOistream . OIOofd

Figure 3: Final map with compressed input vectors (30 components) CONCLUSION In this paper we have demonstrated the applicability of self-organizing maps to achieve semantically structured software libraries. Such a structuring was achieved on the basis of binaryvalued input vectors representing the features of the stored software components. However, our main concern was to show that such a structuring is robust even in case of highly compressed input data. More precisely, we have used a MLP for preprocessing and compressing of the original input vectors. Subsequently, these compressed data were used to train the self-organizing map. In a nutshell, we were able to show that the same clusters were formed with the compressed input data as with the original input data. The major benefit of such a two-level approach relates to the highly reduced time needed to train the self-organizing map while retaining the full classification information. REFERENCES Biggerstaff, T. J., and Perlis, A. J. ;Eds. (1989). Software Reusability. Vol I: Concepts and Models. Vol II: Applications and Experience. Addison-Wesley. Frakes, W. B., and Isoda, S. (1994). Success Factors of Systematic Reuse. IEEE Software 11(5). pp 15-19. Gorlen, K. E, Orlow, S., and Plexico, P. (1990). Data Abstraction and Object-Oriented Programming in C++. John Wiley & Sons. Gorlen, K. E. (1990). NIH Class Library Reference Manual (Revision 3.10). National Institutes of Health. Bethesda, MD. Kohonen, T. (1982). Self-Organized Formation of Topologically Correct Feature Maps. Biological Cybernetics 43. pp 59-69. Kohonen, T. (1989). Self-Organization and Associative Memory. Springer. 1989.

Kohonen, T. (1990). The Self-Organizing Map. Proceedings of the IEEE 78. pp 1464-1480. Krueger, C. W. (1992). Software Reuse. ACM Computing Surveys 24(2). pp 131-183. Maarek, Y. S., Berry, D. M., and Kaiser, G. E. (1991). An Information Retrieval Approach for Automatically Constructing Software Libraries. IEEE Trans on Software Engineering 17(8). pp 800-813. Merkl, D., Tjoa, A M., and Kappel, G. (1994). Application of Self-Organizing Feature Maps with Lateral Inhibition to Structure a Library of Reusable Software Components. Proc IEEE Int’l Conf on Neural Networks. Orlando, FL. pp 3905-3908. Merkl, D. (1995a). A Connectionist View on Document Classification. Proc Australasian Database Conference. Adelaide. pp 153-161. Merkl, D. (1995b). The Effects of Lateral Inhibition on Learning Speed and Precision of a SelfOrganizing Feature Map. Proc Australian Conf on Neural Networks. Sydney. pp 168-171. Miikkulainen, R. (1991). Self-Organizing Process Based on Lateral Inhibition and Synaptic Resource Redistribution. Proc Int’l Conf on Artificial Neural Networks. Espoo. pp 415-420. Miyata, Y. (1992). A User’s Guide to PlaNet Environment for Running, and Looking into a PDP Network. Version 5.8. School of Computer and Cognitive Sciences. Chukyo University. Toyota. Japan. Ostertag, E., Hendler, J., Prieto-Díaz, R., and Braun, C. (1992). Computing Similarity in a Reuse Library. ACM Trans on Software Engineering and Methodology 1(3). pp 205-228. Ritter, H., and Kohonen, T. (1989). Self-Organizing Semantic Maps. Biological Cybernetics 54. pp 241-254. Turtle H. R., and Croft, W. B. (1992). A Comparison of Text Retrieval Models. The Computer Journal 35(3). pp 279-290.

content-based document classification with highly

content-based document classification with highly

Suggest Documents

content-based document classification with highly ... - CiteSeerX

Collective Document Classification with Implicit Inter-document ...

Learning with Rationales for Document Classification

Document Classification with Unsupervised Artificial ... - CiteSeerX

Document Classification with Unsupervised Artificial ... - CiteSeerX

document classification - CNRS

Document classification through interactive supervision of document ...

Document Zone Content Classification for Technical Document

Structured Multimedia Document Classification - Lip6

Transductive Learning for Document Classification

New Resources for Document Classification

Structured Multimedia Document Classification - Lip6

Structured Multimedia Document Classification - CiteSeerX

Iterated Document Content Classification - CiteSeerX

Feature Extraction for Document Classification

Unsupervised Learning for Document Classification

Transductive Learning for Document Classification

Distributed Document Sharing with Text Classification over Content ...

Proposal of Document Classification with Word ... - IDEALS @ Illinois

Exploratory Analysis of Highly Heterogeneous Document Collections

Exploratory Analysis of Highly Heterogeneous Document Collections

Improving Topic Classification for Highly Inflective Languages ...

structured Document Classification in UBE - DCC/UFLA

Improving Persian Document Classification Using Semantic ... - arXiv