Data Mining in Large Free Text Document

1 downloads 0 Views 256KB Size Report
Data Mining in Large Free Text Document Archieves. Dieter Merkl, A Min Tjoa .... The material contained in this paper is organized as follows. .... Throughout the remainder of the paper we will use the various manual-pages of the NIH.
Data Mining in Large Free Text Document Archieves Dieter Merkl, A Min Tjoa Department of Software Technology Vienna University of Technology Resselgasse 3/188, A-1040 Vienna, Austria fdieter, [email protected]

Abstract

Document classi cation may be regarded as one of the central issues in information retrieval research during the last decades. The challenge of classi cation is to uncover the similarities between groups of data in order to improve the retrieval e ectiveness of the overall system. From an exploratory data analysis point of view the same process of classi cation may be used to gain insight in the structure of the various data items and may thus be referred to as data mining in text archives. In this paper we show the results from applying a neural network model, the hierarchical feature map, to such a data mining task. The neural network is carefully designed to impose a hierarchical structure on the underlying document collection which leads to straight-forward representation of data similarities. Apart from the bene t for text data mining, we are able to demonstrate that the hierarchical feature map leads to a tremendous speed-up of the training process as compared to more traditional neural network architectures that are already known to be e ective in text classi cation tasks. It is this time-consuming training-process that is commonly regarded as a major obstacle of real-world large-scale neural network application. Hence, hierarchical feature maps point the way towards an e ective usage of neural network technology in realistic applications and thus, represent a powerful alternative to traditional methods for text classi cation.

1 Introduction Many operations we would like computers to perform on text are classi cation tasks, i.e. tasks of assigning documents into classes. The whole process of text retrieval may even be described as being a classi cation task itself, namely the attempt to assign documents into

Proceedings of the International Symposium on Cooperative Database Systems for Advanced Applications (CODAS'96), Kyoto, Japan, December 5{7, 1996.

one of two classes, i.e. those that the user would like to see at a particular moment, and those the user would like not to see. Similarly, documents may be classi ed as belonging to a set of | possibly prede ned | classes, say a number of particular topics the documents deal with. Analogously, the various classes may correspond to areas of long-term interest of a particular user [11]. If the classes are not known, however, the application of classi cation methods may be the only practical way to obtain insight into the underlying structure of the document collection. In other words, the classi cation process may be regarded as text data mining with the goal of revealing structure in document archives. Talking about a classi cation task, a number of approaches are applicable. Among the oldest and most widely used ones we certainly have to mention statistics and here especially cluster analysis. The usage of cluster analysis for document classi cation has a long tradition in information retrieval. Its speci c strength and weaknesses are well explored, see for example [20] or [25]. Arti cial neural networks in their turn represent a methodology that is widely used to uncover and impose structure to a great variety of actual data [22]. Especially challenging in this context are, certainly, high-dimensional input data. An almost perfect example of which is represented by text documents that are, by nature, described in a high-dimensional space. Consequently, there is already a substantial amount of work done in the application of neural networks to information retrieval. Consider as representatives the research reported in [2, 3, 7, 8, 13, 15, 24], to name just some of the more recent papers. In general, there is wide agreement that the application of arti cial neural networks is suitable in areas that are characterized by rst noise, second poorly understood intrinsic structure, and third changing characteristics. Each of which is present in text classi cation. The noise is imposed due to the fact that no completely satisfying way to represent text documents has been found so far. One certainly is far from being audacious when claiming that such a situation will continue to be present for a long time | at least as long as natural language processing is unable to understand text in arbitrarily complex form. Second, the poorly understood intrinsic structure is due to the non-existence of an authority knowing the contents of each and every document, at least in real-world large-scale text collections. Finally, the changing characteristics of document collections are due to the fact that, at least periodically, the collections tend to be enlarged to comprise additional documents. From the proposed architectures of arti cial neural networks we regard the models adhering to the unsupervised learning paradigm as being especially well suited for a task such as text data mining. This is due to the fact that in realistic document archives the assumption of being able to de ne the necessary input-output-mappings required for supervised learning models is questionable, to say the least. By input-output-mappings we refer to a manual assignment of documents to groups which is only possible when assuming the availability of considerable insight in the structure of the document archive. Contrary to that, in an unsupervised learning environment it remains the task of the arti cial neural network to uncover the structure of the document archive. Hence, the unrealistic assumption of providing proper input-output-mappings is obsolete in an unsupervised environment. In this paper we will report on classi cation of text documents by means of hierarchically arranged self-organizing maps. The material contained in this paper is organized as follows.

Section 2 is dedicated to a exposition of the neural network architecture we used for document classi cation. In Section 3 we provide rst, an outline of our experimental setting and a review of some background information concerning document representation. Second, we discuss the results from classifying the documents by means of a hierarchical arrangement of unsupervised neural networks. These results are further contrasted in the third subsection to our previous work on the utilization of standard self-organizing maps for document classi cation. Finally, we present our conclusions in Section 4.

2 Unsupervised Learning in Hierarchical Feature Maps Basically, the idea of arranging several unsupervised neural networks along a hierarchy was rst proposed by Risto Miikkulainen [17]. In this model, referred to as the hierarchical feature map, the self-organizing map [9, 10] is used as the basic unsupervised neural network architecture that is arranged along the lines of a hierarchy. Thus, we have to provide a brief discussion of self-organizing maps prior to presenting the hierarchical feature map.

2.1 Self-Organizing Maps

The self-organizing map as presented in e.g. [9, 10] is one of the most prominent unsupervised arti cial neural network models. The model consists of a layer of input units each of which is fully connected to a grid of output units. These output units are arranged in some topological order where the most common choice is represented by a two-dimensional grid. Input units take the input patterns and propagate them as they are onto the output units. Each of the output units is assigned a weight vector with the same dimension as the input data. The learning process of self-organizing maps can be seen as a generalization of competitive learning. The key idea of competitive learning is to adapt the unit with the highest activity level with respect to a randomly selected input pattern in a way to exhibit an even higher activity level with this very input in future. Commonly, the activity level of an output unit is computed as the Euclidean distance between the unit's weight vector and the actual input pattern. Hence, the so-called winning unit, i.e. the winner in short, is the output unit with the smallest distance between the two vectors. Adaptation takes place at each learning step and is performed as a gradual reduction of the di erence between the respective components of input and weight vector. The degree of adaptation is guided by a so-called learning-rate that is gradually decreasing in the course of time. As an extension to competitive learning, units in a time-varying and gradually decreasing neighborhood around the winner are adapted, too. Pragmatically speaking, during the learning steps of self-organizing maps a set of units around the actual winner is tuned towards the currently presented input pattern. This learning rule leads to a clustering of highly similar input patterns in closely neighboring parts of the grid of output units. Thus, the learning process ends up with a topological ordering of the input patterns. One might say that self-organizing maps represent a spatially smooth neural variation of k-means clustering

[19] where k is equal to the number of output units. More precisely, the steps of the learning process may be outlined as given below. The four steps are performed repeatedly until no more changes to the weight vectors are observable. 1. Selection of an input x(t). 2. Calculation of the distances between weight vectors and the input vector according to D (t) = jjx(t) ? m (t)jj. In this expression m refers to the weight vector of unit i and jj  jj represents the Euclidean vector norm. As usual, t represents the time-stamp of the current learning iteration. 3. Determination of the winner c according to c : m (t) = min (D (t)). 4. Adaptation of the weight vectors. In particular we use the following learning rule. m (t + 1) = m (t) + (t)  (t)  [x(t) ? m (t)]. In this formula, (t) represents a time-varying gain term, i.e. learning-rate, decreasing in the course of time. (t) is a time-varying neighborhood-function taking into account the distance between the winner c and unit i within the output space. The task of the neighborhood-function is to impose a spatial structure on the amount of weight vector adaptation in that the function computes values in the range of [0, 1], depending on the distance between the unit in question and the winner. By means of this function, units in close vicinity to the winner are provided with a larger value of the neighborhood-function and thus, are adapted more strongly than units that are farther away from the winner. Please refer to [13, 14] for the exact details of the implementation, especially what concerns the neighborhood-function (t). We feel that the inclusion of the exact realization here would require too lengthy a discussion which is not necessary to understand the learning process in general. Please note that due to the time-varying, or more precisely decreasing, nature of learningrate and neighborhood-function, the learning process will converge towards a stable state. The stable state is reached when no further changes to the various weight vectors are observable. In practice, however, the learing process may be terminated earlier, namely at the time when no further variation within the process of winner selection is detected. In other words, the training process may be terminated when each input data is mapped repeatedly onto the same unit. i

i

i

c

i

i

c;i

i

i

i

c;i

c;i

2.2 Hierarchical Feature Maps

The key idea of hierarchical feature maps as proposed by Risto Miikkulainen [17, 18] is to apply a hierarchical arrangement of several layers containing two-dimensional self-organizing maps [9, 10]. The arrangement may be characterized as having the shape of a pyramid. The hierarchy among the maps is established as follows. For each output unit at any level of the hierarchy a two-dimensional self-organizing map is added to the next level as shown in Figure 1. The training of each single self-organizing map follows the basic self-organizing map learning algorithm. Generally, the training of a hierarchical feature map is performed sequentially from the rst level, i.e. one self-organizing map, downwards along the hierarchy. The rst level map is trained as usual. Please note that the top level map is much smaller in size as it would be when applying the basic model. This is due to the fact that the higher levels in the hierarchy

Figure 1: Hierarchical feature map are used to capture the hierarchical taxonomy of the input data whereas the bottom level maps are utilized to distinguish among each single input data. Therefore, only a short training time is needed until the rst level map converges to a stable state. Due to the small size of the rst level map, each of the units is assigned a relatively large number of input data in comparison with the basic model. As soon as the rst level map has reached a stable state, training continues with the maps of the second level of the hierarchy. Within the second level, each map is trained only with the input data which is assigned to the corresponding unit of the rst level map. Moreover, the length of the input vectors may be reduced by omitting the vector components which are equal in the original input vectors. Due to this reduction in size of the input vectors the time needed to train the maps is reduced as well. The training of the second level is completed when every map has reached a stable state. Analogously, the same training procedure is utilized to train the third and any subsequent levels of self-organizing maps. An interesting property of hierarchical feature maps is the tremendous speed-up as compared to the single-level mapping which is performed in the basic self-organizing map. This speed-up originates partially from the dimension reduction of the input vectors from one level to the next. Furthermore, the maps are much smaller in size than they would be in the basic model. However, the decision on the best size of the various maps as well as on the depth of the hierarchy still remains a non-trivial problem. There is no algorithmic solution to determine these parameters in advance. One rather has to examine various parameter settings to choose the best size of the hierarchical feature map. At rst sight it is not apparent why the hierarchical self-organizing feature map is more ecient in terms of the learning speed than the basic model. On the one hand, the reduction of input vector size certainly contributes to the increased learning speed. One the other hand, the size of the maps in the various layers is comparatively small in terms of output units. Yet, the total amount of units is commonly larger than in the basic model and additional overhead is caused by the determination of the input vectors for the next levels. One possibility to explain the speed-up lies in an investigation of the general properties of the self-organizing learning process. In the basic model, i.e. self-organizing maps, the units that are subject to adaptation are

selected by using a neighborhood function. It is common practice that at the beginning of the learning process almost the whole map is a ected by the presentation of an input vector. Thus, the map is forced to establish large clusters of similar input data at the beginning of learning. The neighborhood size decreases gradually during the learning iterations leading to ner and ner distinctions within the clusters whereas the overall topology of the cluster arrangement is maintained. However, in the single-level architecture the self-organizing process of each cluster interferes with the self-organization of its topologically neighboring clusters. Especially units along the boundaries tend to be occasionally modi ed as belonging to one or another cluster. This interference is one reason for the rather time-consuming self-organizing process. Contrary to that, such an interference is dramatically reduced due to the architecture of hierarchical feature maps. The topology of the high-level categories is depicted in the rst level of the hierarchy. Each of its subcategories are then independently organized on separate maps at lower levels within the hierarchy. These maps in turn are free from maintaining the overall structure since this structure is already determined by the architecture of the hierarchical feature map. To conclude, much computational e ort is saved due to the fact that the overall structure of the clusters is maintained in terms of the architecture rather than in terms of the learning process.

3 Data Mining in Document Spaces

3.1 Document Representation

Throughout the remainder of the paper we will use the various manual-pages of the NIH Class Library [4], NIHCL, as a sample document archive. The NIHCL is a collection of classes developed in the C++ programming language. The class library covers classes for storing and retrieving arbitrarily complex data structures on disk, generally useful data types such as String, Time, and Date, and nally a number of container classes as, for example, Set, Dictionary, and OrderedCltn. We do not want to go into further detail concerning the classes as such. The necessary information can be found in the reference manual [4] as well as in [5]. At this point we just want to de ne the least requirement on the data mining process, namely to uncover the overall structure of the document collection, i.e. the NIH Class Library. Generally, the task of text classi cation may be paraphrased as uncovering the semantic similarities between various text documents. However, since natural language processing is still far from understanding arbitrarily complex document structures, the various documents have, in a rst step, to be mapped onto some representation language in order to be comparable. Still one of the most widely used representation languages is single-term full-text indexing. Roughly speaking, the documents are represented by the set of words they are built of. In order to achieve a concise representation, words that appear either too often or too rarely within the entire document collection are excluded from the representation language. Additional improvement can be gained by means of pre x and sux stripping, thus by reducing the words to their radicals.

| Unordered Collection of Non-Duplicate Objects Base Class: Collection Derived Classes: Dictionary, IdentSet Related Classes: Iterator A Set is an unordered collection of objects. The objects cannot be accessed by a key as can the objects in a Dictionary, for example. Unlike a Bag, in which equal objects may occur more than once, a Set ignores attempts to add any object that duplicates one already in the Set. A Set considers two objects to be duplicates if they are isEqual() to one another. Class Set is implemented using a hash table with open addressing. Set::add() calls the virtual member function hash() and uses the number returned to compute an index to the hash table at which to begin searching, and isEqual() is called to check for equality.

Set

Figure 2: NIHCL manual entry of class Set As a result, the various text documents are represented by vectors of equal dimension. Each vector component corresponds to a keyword from the representation language, and each vector entry in a particular component resembles the importance of that very keyword in describing the component. In the basic form, i.e. binary single-term indexing, an entry of one indicates that this speci c keyword was extracted from the description of the component at hand. Contrary to that an entry of zero means that the corresponding keyword is not contained in the component's description. Such a representation is known as the vector space model of information retrieval [23]. Assuming a meaningful representation, the similarity between two documents corresponds to the distance between their vector representations. The bene t of such a vector-based representation is the ease of implementing a best-match retrieval strategy, where the retrieved documents may be ranked according to decreasing similarity between the vector representing the actual query, i.e. the description of the needed documents, and the vectors representing the various text documents stored in the archive. In order to obtain the nal document representation, we accessed the full-text of the various manual pages describing NIHCL classes. As an illustrative example we refer to Figure 2 containing the textual description of class Set. The manual entries are further full-text indexed generating a binary vector-space representation [23] of the documents. Just to provide the exact gure, each component is represented by a 489-dimensional feature vector. These vectors are subsequently used as the input data to the arti cial neural network. Please note that we do not make use of the meta-information contained in the manual-entries, as for example the reference to the name of a class' base class, derived classes, and related classes. The utilization of this kind of information in the document representation might have positive impact on document classi cation and thus, certainly might be an area for further investigation when more emphasis is directed

File I/O Classes

Datastructures with keyed access

Data Types

Datastructures

Figure 3: Top level map towards the software reuse aspect of document classi cation [6, 21]. Such an information, however, is not generally available in document archives and thus, for the sake of global applicability, we refrained from using this type of meta-information here.

3.2 A Hierarchical View of Document Classes

Based on the document representation as presented above, we trained a hierarchical feature map to classify the document space, i.e the descriptions of various software components. A typical result of the training process is presented below. The actual setup for the hierarchical map was exactly the same as depicted in Figure 1. Hence, we had four units in the top-level layer. Each of these units is expanded to a four unit second-level layer. The nal third-level layer consists of nine units in each single map. This setup has been determined empirically after a number of test-runs. Consider rst Figure 3 containing the top-level map. In this gure we refrained from typing the names of the various classes for the sake of readability. Instead, we give the general purpose of the classes as meta-information concerning the mapping process. It is exactly the general structure that is mirrored in the arrangement of classes within the top-level map. In other words, the top-level neural network arranges the documents within four classes which show the variety of classes contained in the NIHCL with respect to their intended functionality. The number of classes, obviously, is given by the setup of the architecture. Worth pointing out, however, is the fact that no misclassi cation occurred during the learning process of the top-level map. With the documents assigned to the various classes in the top-level map, the training process continues with the second level. Here, only the subset of documents that is matched onto the corresponding top-level unit is used for presentation. The result of the training process of the second level maps is presented in Figure 4. There is already a lot of highly interesting information presented concerning the relationship between the various classes assigned to the same general, or top-level, group. As a rst example, consider the map that is assigned the le I/O classes, i.e. the upper left map of Figure 4. In this map, pairs of classes performing \complementary" operations are assigned to the same unit. In fact, the \direction" of the operation is signi ed by the fourth character of the class name, `i' for input and `o' for output. In order to provide at least some

OIOistream

OIOnihout

OIOostream

OIOnihin

OIOin

OIOifd

OIOout

OIOofd

ReadFromTbl,

StoreOnTbl

Float String

IdentSet

Class, Object, Exception, Link, FDSet, Range, Point, Vector, Regex, Nil Time, Date

Integer Random

Assoc, AssocInt, LookupKey, Dictionary, IdentDict, KeySortCltn

Bitset

Set

Collection, Iterator, OrderedCltn, SortedCltn, Stack, Bag, Heap, LinkedList

Arraychar, ArrayOb

LinkOb

Figure 4: Middle level maps basic validation of the training process, we have split the input data in a set of training and a set of validation data. The validation data is printed by using bold face letters in the gure. One element of the validation data set was the class OIOostream which is perfectly correct mapped onto the same second layer unit as its pendant class OIOistream. As a second example consider the separation of the documents describing data structures into a subset of data structures with keyed-access and in a subset with \non-keyed-access." This classi cation is already imposed within the top-level map. In the middle-level map, however, we are able to point to a highly interesting and useful arrangement. Consider the right upper unit in the middle-level map containing the data structures with keyed-access. This unit comprises not only the data structures themselves, i.e. Dictionary and IdentDict, but also the classes that actually implement the keyed-access, namely Assoc, AssocInt, and LookupKey. Here again, we had one representative of a data structure with keyed-access as a member of the validation set, i.e. KeySortCltn. This document is mapped perfectly correct onto the unit containing closely related data structures. There is certainly a lot of other interesting information concerning the text collection contained in the maps depicted in Figure 4. We will, however, pick just one more for detailed discussion here. It is the lower right data types map. This map contains the basic numerical data types Float and Integer. The class Random, contained in the validation set, was mapped onto this very unit as well. This again corresponds to the functionality of the respective data types since Random implements a random number generator producing (pseudo-) random numbers of Float data type.

Range, Regex, Object, Class

FDSet Vector

Exception

Nil

Link

Date

Point

Time

Figure 5: Bottom level map From the low-level maps we just present one of the data types maps in this paper. More precisely, the third-level representation of the highly crowded right upper unit is depicted in Figure 5. The names of the various classes are mostly self-contained, so visual inspection of the classi cation is readily possible. We refrained from a graphic representation of the other third-level maps since they are highly similar in spirit to Figure 5.

3.3 A Flat View on Document Classes

In order to provide the means for convenient comparison, we present the results from two more conventional approaches to document classi cation in this subsection. One of these approaches is statistical, i.e. cluster analysis, the other is neural, i.e. self-organizing maps. The statistical results presented in Table 1 were obtained with hierarchical agglomerative clustering by using complete linkage as the fusion algorithm. The results from centroid or Ward clustering are quite similar. Regarding the information on the overall structure of the document archive we may conclude that cluster analysis tends to produce on the one hand large groups of rather unrelated clusters, cf. classes 1 and 2 in the results presented in Table 1. On the other hand, some classes are highly speci c in the sense of containing often just one document. Obviously, such a result is not very useful in an environment where one is interested in obtaining insight concerning the inherent structure of a particular document archive. In this sense, the distinction between the various groups of documents as contained in the archive, i.e. the various software components is by no means apparent from the results of cluster analysis. For a more detailed exposition of the shortcomings of cluster analysis in comparison with unsupervised neural networks we refer to [13]. Our previous work on document classi cation was mainly concerned with the usage of self-organizing maps. A comparison with these results is necessary, even more since hierarchical feature maps represent an extension to self-organizing maps. Moreover, the

. ReadFromTbl StoreOnTbl . AssocInt . . IdentDict . IdentSet

OIOofd . . Assoc . Dictionary KeySortCltn . Set .

OIOifd . LinkOb . LookupKey . . SortedCltn . Bag

OIOin OIOout . Class . . OrderedCltn . Heap .

OIOistream . OIOostream . Link LinkedList . Stack . Bitset

. OIOnihin . Exception . . SeqCltn . . Vector

OIOnihout . FDSet . Iterator . . . Arrayob .

. . . Nil . Collection . Arraychar . Range

Object . Point . Random . Float . String .

. Date . Time . Integer . . . Regex

Figure 6: A 10  10 self-organizing map self-organizing map was used in [12] for text classi cation. In this paper the authors use, however, just a fairly small document representation, made up from 25 distinct terms. A word on the di erent graphic representation compared to the gures containing the training results of the hierarchical feature map is in order. In the graphical representation of the self-organizing map each unit is signi ed by a dot in the gure. However, if a unit represents a particular document, or in other words, if a unit is the winner for a document, that document's name, i.e. the class name, is shown at the respective position in the gure. Figure 6 shows in fact a highly similar classi cation result, in that the various documents are arranged within the two-dimensional output space of the self-organizing map in concordance with their mutual functional similarity. It is, however, not that intuitively observable where to draw the borderline between the various groups of similar documents. Pragmatically speaking, an erroneous conclusion might be that the documents Object and OIOnihout are as similar to each other as the documents OIOnihout and OIOistream are. This because both pairs are arranged with the same distance on the top of the nal map. Apparently, such a conclusion results in a highly erroneous perception of the underlying document collection. We should note at this point, however, that this drawback of self-organizing maps is subject to intensive research in the neural network community, see [1, 16] for recent achievings. Finally, we want to compare the classi cation capability of both neural network models according to the time needed for training. The exact gures are presented in Table 2. As already indicated in Section 2 one of the striking arguments in favor of hierarchical feature maps is their tremendous speed-up compared to self-organizing maps. The timing was done on an otherwise idle SUN SPARC-20 workstation with neural networks of exactly the same setup as above. Thus, hierarchical feature maps exhibit a training process needing just a fraction of the time required for self-organizing map training even though the number of units in the hierarchical feature map was 164 as compared to 100 units of the self-organizing map. However, to be just we have to note that the detection of intercluster similarity is by no means as straight-forward as in self-organizing maps. Consider for example the separation in data structures with and without keyed-access in the hierarchical feature map. The information that both groups are container classes and in this sense related is invisible in the nal arrangement. Moreover, the selection of the hierarchical feature map's actual setup is a non-trivial task in general since one has to have some expectation concerning the structure

of the input data prior to neural network training. To conclude the results, we believe hierarchical feature maps represent a promising alternative for exploratory data analysis in document collections that can, due to the learning speed, also be used for massive libraries.

4 Conclusion In this paper we reported on the successful application of a novel neural network model to the task of document classi cation. We put speci c emphasis on the question whether unsupervised neural networks are capable of revealing the inherent structure of document archives, in other words we were mainly interested in text data mining capabilities of such models. As the experimental setting we used the manual pages describing various C++ classes contained in a real-world class library where the inherent structure is known and thus, available for evaluation. The document representation was built by using full-text indexing of the textual descriptions. The document representation according to the vector space model was further used as the input data during the learning process of the neural network. The essentials of the chosen neural network model may be characterized as follows. First, the model adheres to the unsupervised learning paradigm which makes it well suited in an application with frequently changing underlying data. Text classi cation is certainly a perfect example of such an application. Second, the model imposes a hierarchical structure on the input data because of its architecture made up from small, hierarchically arranged unsupervised neural networks. Third, thanks to this architecture, the model counts for extremely fast training times making it an interesting alternative to standard approaches to text classi cation. What concerns the classi cation process as such, we have demonstrated that the neural network has successfully uncovered the semantic similarity of the various text documents and structured them accordingly. This fact together with its high learning speed makes hierarchical feature maps a promising alternative to standard approaches for text classi cation.

References

[1] M. Cottrell and E. de Bodt. A Kohonen map representation to avoid misleading interpretations. In Proc of the European Symposium on Arti cial Neural Networks (ESANN'96), Brugge, Belgium, 1996. [2] F. Crestiani. Learning strategies for an adaptive information retrieval system using neural networks. In Proc of the IEEE Int'l Conf on Neural Networks (ICNN'93), San Francisco, CA, 1993. [3] C. J. Crouch, D. B. Crouch, and K. Nareddy. A connectionist model for information retrieval based on the vector space model. Int'l Journal of Expert Systems, 7(2), 1994. [4] K. E. Gorlen. NIH class library reference manual. National Institutes of Health, Bethesda, MD, 1990. [5] K. E. Gorlen, S. Orlow, and P. Plexico. Abstraction and Object-Oriented Programming in C++. John Wiley, New York, 1990.

[6] E.-A. Karlsson. Software Reuse: A holistic approach. John Wiley, Chichester, 1995. [7] S. Keane, V. Ratnaike, and R. Wilkinson. Hierarchical news ltering. In Proc of the Int'l Conf on Practical Aspects of Knowledge Management, Basel, Switzerland, 1996. [8] M. Kohle and D. Merkl. Visualizing similarities in high dimensional input spaces with a growing and splitting neural network. In Proc of the 6th Int'l Conf on Arti cial Neural Networks (ICANN'96), Bochum, Germany, 1996. [9] T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 1982. [10] T. Kohonen. Self-organizing maps. Springer-Verlag, Berlin, 1995. [11] D. D. Lewis. An evaluation of phrasal and clustered representations of a text categorization task. In Proc ACM SIGIR Int'l Conf on Research and Development in Information Retrieval (SIGIR'92), Copenhagen, Denmark, 1992. [12] X. Lin, D. Soergel, and G. Marchionini. A self-organizing semantic map for information retrieval. In Proc ACM SIGIR Int'l Conf on Research and Development in Information Retrieval (SIGIR'91), Chicago, IL, 1991. [13] D. Merkl. A connectionist view on document classi cation. In Proc of the 6th Australasian Database Conf (ADC'95), Adelaide. Australia, 1995. [14] D. Merkl. Content-based document classi cation with highly compressed input data. In Proc of the 5th Int'l Conf on Arti cial Neural Networks (ICANN'95), Paris. France, 1995. [15] D. Merkl. Content-based software classi cation by self-organization. In Proc IEEE Int'l Conf Neural Networks (ICNN'95), Perth, Australia, 1995. [16] D. Merkl and A. Rauber. On the similarity of eagles, hawks, and cows: Visualization of similarity in self-organizing maps. In Proc Int'l Workshop on Fuzzy-Neuro-Systems (FNS'97), Soest, Germany, 1997. [17] R. Miikkulainen. Trace feature map: A model of episodic associative memory. Biological Cybernetics, 66, 1992. [18] R. Miikkulainen. Subsymbolic Natural Language Processing: An integrated model of scripts, lexicon, and memory. MIT-Press, Cambridge, MA, 1993. [19] B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, 1996. [20] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989. [21] W. Schafer, R. Prieto-Diaz, and M. Matsumoto. Software Reusability. Ellis Horwood, New York, 1994. [22] K. Swingler. Applying Neural Networks: A Practical Guide. Academic Press, London, 1996. [23] H. R. Turtle and W. B. Croft. A comparison of text retrieval models. Computer Journal, 35(3), 1992. [24] R. Wilkinson and P. Hingston. Incorporating the vector space model in a neural network used for information retrieval. In Proc ACM SIGIR Int'l Conf on Research and Development in Information Retrieval (SIGIR'91), Chicago, IL, 1991. [25] P. Willet. Recend trends in hierarchic document clustering: A critical review. Information Processing & Management, 34, 1988.

Class 1

Arraychar, Arrayob, Bag, Bitset, Class, Heap, IdentDict, IdentSet, KeySortCltn, LinkedList, LinkOb, OIOifd, OIOin, OIOistream, OIOnihin, OIOnihout, OIOofd, OIOostream, OIOout, OrderedCltn, ReadFromTbl, Regex, SeqCltn, Set, SortedCltn, Stack, StoreOnTbl Class 2 Assoc, AssocInt, Date, Dictionary, Exception, FDSet, Float, Integer, Link, LookupKey, Nil, Point, Random, Range Class 3 Collection, Iterator Class 4 Object Class 5 String Class 6 Vector (a) Complete linkage | 6 Classes Class 1

Arraychar, Arrayob, Bag, Bitset, KeySortCltn, LinkedList, OIOifd, OIOin, OIOistream, OIOnihin, OIOnihout, OIOofd, OIOostream, OIOout, OrderedCltn, Regex, SeqCltn, SortedCltn, Stack Class 2 Assoc, AssocInt, Date, Dictionary, Exception, FDSet, Float, Integer, Link, LookupKey, Nil, Point, Random, Range, Time Class 3 Class, ReadFromTbl, StoreOnTbl Class 4 Collection Class 5 Heap, IdentDict, IdentSet, Set Class 6 Iterator Class 7 LinkOb Class 8 Object Class 9 String Class 10 Vector (b) Complete linkage | 10 Classes

Table 1: Cluster analysis using complete linkage

Hierarchical Feature Map Self-Organizing Map

Training Time 09:73 59:08

Table 2: Time needed for training on SPARC-20