Content-Based Software Classification by Self ... - Semantic Scholar

0 downloads 0 Views 32KB Size Report
particular we compare the application of two prominent self-organizing neural networks to the same problem ... clustering the stored software components into.
Proc. of the IEEE Int’l Conference on Neural Networks (ICNN’95). Perth. Australia. Nov 27 - Dec 1 . 1995. pp 1086-1091.

Content-Based Software Classification by Self-Organization Dieter Merkl Institute of Software Technology Vienna University of Technology Resselgasse 3/188, A-1040 Vienna, Austria [email protected] ABSTRACT This paper is concerned with a case study in content-based classification of textual documents. In particular we compare the application of two prominent self-organizing neural networks to the same problem domain, namely the organization of software libraries. The two models are Adaptive Resonance Theory and Self-Organizing Maps. As a result we are able to show that both models successfully arrange software components according to their semantic similarity.

1. Introduction Software reuse is concerned with the technological and organizational issues of using already existing software components to build new applications. This is believed to be one of the most promising proposals to overcome the frequently discussed software crisis which may be described as the lack of productivity in the software area as well as the inability of the software suppliers to satisfy the needs of their customers. However, to make software reuse operational the software developers need to be provided with large libraries containing reusable software components. Furthermore, the user of such a library must be assisted in locating components that are functionally close to the required component. This task is performed by clustering the stored software components into groups of semantically similar components, i.e. software components that exhibit similar behavior. In this paper we address organizing a library of reusable software components by using unsupervised artificial neural networks. In general, artificial neural networks were chosen for several reasons. First, the models are robust in the sense of tolerating noisy or inexact input data. This property is especially important during query processing where a query may be regarded as an inexact representation of stored software components. Second, the models serve as associative memories. In other words, artificial neural networks are capable of retrieving software components when given only the context of the description of the required software component. Third, the models are able to generalize. Roughly speaking, artificial neural networks learn conceptually relevant features of the input domain and organize the stored components accordingly. The remainder of this paper is organized as follows. In Section 2 we briefly describe the artificial neural networks that have been used during the experiments. Section 3 contains a brief survey of the

application domain and some hints on conventional approaches to software library organization. In Section 4 we present the results of training a selforganizing feature map and an ART model. Finally, Section 5 contains some conclusions.

2. Self-Organizing Neural Networks 2.1. The ART model Basically, an ART network consists of two layers of units named F1 and F2. The units contained in F1 receive input from the external world. Therefore, F1 is referred to as the feature representation field. The units contained in F2 are used to represent the various clusters of the input data and thus, this layer is termed the category representation field. Weighted connections exist between the two layers such that each unit contained in F1 is connected to every unit in F2 and vice versa. During data processing with an ART network no distinction is made between a learning mode and an execution mode as with other artificial neural network models. Learning may rather be performed continuously whenever a considerably new input is presented to the ART network. Particularly, the input data represented by a vector is presented to the units contained in F1 where each unit receives one component of the respective input vector. Subsequently, the units in F2 are activated by means of propagating the input vector along the weighted connections from F1 to F2. The units in F2 perform a competition to represent the current input in the sense that the unit with the highest activation level is selected as the winning unit. The activation of this winning unit is propagated back to the layer F1 via its weighted connections. Hence, a vector of equal dimension as the input data is computed. This vector may be regarded as the prototype of the category of input data represented by the winning unit. In case of sufficient similarity between the prototype and the actual input the category is enlarged to comprise the actual input, too. This case is referred to as the

resonant state of the ART network where adaptive behavior is exhibited. Adaptive behavior is related to the change of the connection weights. Particularly, the weighted connections between F1 and F2 are adapted in such a way that the activation level of the winning unit will be even higher at future presentations of the same input vector. Moreover, the connections between F2 and F1 are adjusted to enlarge the similarity between the prototype and the input. The test for sufficient similarity may be seen as a comparison of the corresponding vector components together with a simple threshold logic to define the degree of required similarity. In the terminology of ART networks the threshold is referred to as the attentional vigilance parameter. A higher vigilance parameter results in more precise categories. As an extreme, if complete equality is required, each input will be represented by a separate unit in F2. In case of lower similarity between the prototype and the input with respect to the vigilance parameter a completely different behavior of the ART network is exhibited. In particular, the input is again propagated onto F2. Yet, this time the previously winning unit is shut out from the competition by means of an inhibiting reset signal provided by the so-called orienting subsystem. Thus, another unit will gain the competition and moreover, another prototype will be generated for subsequent test of similarity. This process is being repeated until a unit which represents a category sufficiently similar to the input has been found. However, this might be a unit representing no input so far thus leading to the establishment of a new category. In this sense, the feature of continuous adaptability of the ART network is ensured by adding new units to F2 in case when none of the already existing units represents a sufficiently similar category. For a more detailed description of ART models we refer to [2]. A highly idealized schematic representation of an ART model is depicted in Figure 1. Gain Control

Attentional Subsystem F2-Layer

Orienting Subsystem

F1-Layer

Fig. 1. Schematic representation of an ART model

2.2. The self-organizing map The architecture of a self-organizing map [7] consists of a layer of n input units and a grid of output units each of which has assigned an ndimensional weight vector. The task of the input units is to receive the various input vectors representing real-world entities and to propagate them as they are onto the grid of output units. Each of the output units in its turn computes exactly one

output value which is proportional to the similarity between the current input vector and that unit’s weight vector. This value is commonly referred to as the unit’s activation or the unit’s response to the presentation of an input. Usually, the Euclidean distance is used as the measure of similarity. The adaptation of the weight vectors represents the crucial part of any unsupervised learning rule. This process may be described in three steps which are to be performed repeatedly. These three steps are henceforth collectively referred to as one learning iteration. First, one input vector at a time is randomly selected from the set input vectors. Second, this input vector is mapped onto the grid of output units of the self-organizing map and the unit with the strongest response is determined. This very unit is further referred to as the winning unit, the winner in short. Notice that in case of Euclidean distance metric the unit with the smallest distance between input and weight vector is selected as the winner. Hence, the winner is the output unit representing the most similar internal representation of the input at hand. Third, the weight vector of the winner as well as weight vectors of units in topological neighborhood of the winner are adapted in such a way that these units will exhibit an even stronger response with the same input vector in future. In less bulky terms, the third step refers to the reduction of distance between input and weight vectors of a subset of the output units and thus, to the improved correspondence between the description of an input and its internal representation. Such a distance reduction may easily be accomplished by a gradual reduction of the difference between corresponding vector components. This adaptation is further guided by a so-called learning-rate in the interval [0, 1] determining the amount of adaptation and a so-called neighborhood-rate determining the spatial range of adaptation. In order to guarantee the convergence of the learning process, i.e. a stable arrangement of weight vectors, the learning-rate as well as the neighborhood-rate have to shrink in the course of time. In other words, the amount of adaptation of weight vectors decreases during the learning process with respect to a decreasing learning-rate. Furthermore, the amount of units that are subject to adaptation, i.e. the spatial range of adaptation, decreases as well during the learning process such that towards the end of learning only the winner is adapted and the weight vectors of neighboring units remain unchanged. Given these two restrictions it is obvious that the learning process will converge towards a stable arrangement of weight vector entries. Moreover, the selforganizing map will assign highly similar input data to neighboring output units thanks to the inclusion of a spatial dimension to the learning process. As an illustrative example consider Figure 2 containing a schematic representation of a selforganizing map. In this figure the grid of output units consists of a square of 36 output units. One input vector x is mapped onto the grid of output units and

the winning unit is selected. In the figure the winner is depicted as a black node. The weight vector of the winner, i.e. wi(t), is moved towards the current input vector. Since both the input and the weight vector have equal dimension they may both be regarded as vectors of the same space and thus, both are depicted as belonging to the input space in the figure. As a consequence of the adaptation, the winner will produce a higher response with the same input vector at the next learning iteration, i.e. t+1, because the unit’s weight vector, i.e. wi(t+1), is now nearer to the input vector x. Apart from the winner, adaptation is performed for neighboring units, too. Units that are subject to adaptation are depicted as shaded nodes in the figure. Moreover, the shading of the nodes corresponds to the degree of adaptation and thus, to the spatial range of weight vector adaptation. Generally, units in close vicinity to the winner are adapted more strongly and consequently, they are depicted with a darker shade.

x wi(t+1) wi(t)

Output space

Input space

Fig. 2. Schematic representation of a self-organizing map

3. Classification of Software Components There is common agreement that reusing software is one of the key issues to improve both productivity of the software suppliers and the quality of the software itself. The increased productivity is due to the amount of work that is saved each time a component is reused. The increased quality relates to the fact that the same component is used and tested in many different contexts. Due to the limited space in this paper we cannot provide a more precise exposition of the various areas of software reuse, we rather refer to [1], [6], [13]. It is necessary to recognize, however, that the reuse of existing software components relies on the existence of libraries which contain these components [3]. In order to be useful, such libraries have to provide a large number of reusable software components in a wide spectrum of application domains. Yet, an inherent problem to large libraries is the fact that finding and choosing the appropriate components tend to become troublesome tasks. Consequently, the libraries ought to be organized in a way that facilitates locating and retrieving the needed software components. In other words, the stored software components ought to be arranged in a way that resembles their functional similarity as closely as possible. Pragmatically speaking, we can conclude that components exhibiting similar

behavior should be stored near to each other and thus be identifiable. The task of library structuring is commonly performed by using either semantic networks [12] or cluster analysis [8]. The former approach counts for smart behavior at the expense of a manually constructed semantic net. The advantage of the latter approach is the high degree of possible automation since the basis of cluster analysis are automatically extracted keywords from the textual description of the various software components. The hypothesis behind such an approach might be characterized as the more keywords are in common within the two textual descriptions the more similar are the respective software components. Our approach relies on automatically extracted keywords from the manual of the software components. These keywords are further used as the input data to an artificial neural network which preforms the task of library structuring. Specifically, each component is described by using a set of keywords extracted from the full-text of the manual. During this so-called indexing process we utilize a small list of stop-words, i.e. words that are to be excluded from the final component description, to clean up the resulting index. In fact, the stop-word list comprises only conjunctions, articles, and pronouns and thus, no domain specific knowledge. In parentheses, however, we should note that the construction of domain specific stop-word lists is a non-trivial and laborious task since the possible stop-words vary from one text collection to another. Subsequently, each software component is represented as a binary-valued vector where each component corresponds to a possible document feature, i.e. keyword. Thus, a matrix representation as depicted in Table 1 is obtained. In this table each column represents a software component, i.e. SCi, and each row corresponds to a keyword, i.e. KWj. Obviously, an entry of one in the intersection of column i and row j means that the keyword KWj is extracted from the description of software component SCi. Contrary to that, an entry of zero represents the fact that the corresponding keyword is not contained in the description of that very software component. Such a representation is known as the vector space model of information retrieval [14]. The columns of the matrix have been used as the input during the training process of the artificial neural networks. A more detailed description of the approach may be found in [9] and [11] where we describe the application of self-organizing maps with various learning functions to the task of Table 1: Representation of software components SC1

SC2

...

SCm

KW1

1

0

...

1

KW2

0

1

...

1

0

0

...

1

... KWn

software library organization. In [10] we provide a description of the retrieval process by using the selforganizing map. In this paper the retrieval results are compared with the more classical approach of cluster analysis. As a result we were able to show that the self-organizing map provides a better classification than cluster analysis. More precisely, cluster analysis failed in providing the user with relevant software components in case of vague queries whereas the self-organizing map still returns the correct components.

4. Experimental Results In order to assess the results of the self-organizing processes we compare them in two case studies. In the first one the experimental software library contains a number of commands from the MS-DOS operating system. Each of these commands is represented by a feature vector containing the words that occur in the description of that command when invoking the help command. The whole test set comprises 36 commands each of which is described by a 37 dimensional feature vector. This obviously is a fairly small test set, the effects of the approaches, however, may be described very well since the behavior of the commands is well-known and thus, the results may be assessed easily. A typical result from structuring operating system commands by using an ART model is depicted in Figure 3. At first sight, the model is successful in identifying commands that operate on directories, i.e. Cat#2, commands that are used to display system information, i.e. Cat#4, commands for editing files, i.e. Cat#6, commands for copying files, i.e. Cat#7, and finally commands handling disk drives, i.e. Cat#3. The remaining category Cat#1, however, contains a rather large number of commands that are more or less related to file manipulation. Figure 4 contains an example for structuring operating system commands by using the selforganizing map. The obvious difference to the ARTCat#1: Cat#2: Cat#3: Cat#4: Cat#5: Cat#6: Cat#7:

based representation is due to the topographic arrangement of the various software components. More precisely, similarity between two components is represented by means of the geographic distance between the respective entries to the final map. One of the most important findings is that complementary commands, i.e. commands that undo the effects of the other, are represented by closely neighboring units in the final map as for example the commands del and undelete or backup and restore. As a second case study we used the NIH class library, i.e. the NIHCL, as an example for a reusable software library. The NIHCL comprises a collection of classes developed in the C++ programming language and includes generally useful data types such as String or Date as well as numbers of container classes such as Set or Dictionary. Moreover, the NIHCL provides the facilities to store arbitrarily complex data structures on disks. For more detailed information about the NIHCL we refer to [4]. In this case study the representation of the various classes is obtained by full-text indexing of the respective descriptions as contained in the reference manual of the class library [5]. In order to give the exact figure, each class is represented by using a 489 dimensional feature vector where each feature corresponds to a particular word extracted from the class description. These feature vectors are further used as the input data during the learning process. In order to ease comparison we present the inheritance hierarchy of the NIH classes in Figure 5. Such an inheritance hierarchy is a convenient starting point in order to get a first impression concerning the similarity of the various classes. In Figure 6 we present a result from training an ART model with the various class descriptions. We may notice that the ART model assigns a rather large set of classes to categories uniquely. Some of them, however, may easily be identified as being container classes, i.e. Cat#3, Cat#10, and should therefore

append, attrib, comp, dir, fc, find, path, ren, replace, type, undelete, del chdir, join, mkdir, rmdir, tree assign, chkdsk, diskcomp, diskcopy, format, mirror, more, recover, unformat date, mem, time cls edit, edlin backup, copy, restore, xcopy Fig. 3. Classification of MS-DOS commands (ART)

more . . recover . ren . attrib . edit

. . mirror . . . . . . .

format chkdsk . . . rmdir . cls . edlin

. diskcopy . . diskcomp . assign . unformat . . undelete mkdir . del . . . . . copy . restore . . . . . backup .

time . . . . . . xcopy . dir

Fig. 4. Classification of MS-DOS commands (SOM)

. date . . find . append replace . chdir

. . . . . . path . . .

mem . type . fc comp . tree . join

Object

Bitset

NIHCL

Exception

Class Collection

ReadFromTbl

OIOifd

StoreOnTbl

OIOofd

Arraychar

OIOin

ArrayOb Bag

OIOout

SeqCltn

OIOistream

OIOnihin

OIOostream

OIOnihout

Heap LinkedList OrderedCltn

SortedCltn

KeySortCltn

Stack Set

Dictionary

IdentDict

IdentSet Date FDSet Float Integer Iterator Link LookupKey

LinkOb Assoc AssocInt

Nil Point Random Range String

Regex

Time Vector

Fig. 5. NIHCL class hierarchy Cat#1: Cat#2: Cat#3: Cat#4: Cat#5: Cat#6: Cat#7: Cat#8: Cat#9: Cat#10: Cat#11: Cat#12: Cat#13: Cat#14: Cat#15: Cat#16: Cat#17:

Arraychar, ArrayOb, Collection Assoc, AssocInt, Dictionary, Iterator, KeySortCltn, LookupKey, Object, Set Bag Bitset Class Date, Time Exception, Nil, ReadFromTbl, StoreOnTbl FDSet, Link Float, Integer, Random Heap IdentDict, IdentSet LinkOb OIOifd, OIOin, OIOistream, OIOnihin, OIOnihout, OIOofd, OIOostream, OIOout Point, Range, Vector Regex LinkedList, OrderedCltn, SeqCltn, SortedCltn, Stack String Fig. 6. Classification of NIHCL (ART)

. OIOofd OIOifd OIOin OIOistream . OIOnihout . Object ReadFromTbl . . OIOout . OIOnihin . . . StoreOnTbl . LinkOb . OIOostream . FDSet . Point . Assoc . Class . Exception . Nil . AssocInt . LookupKey . Link . Iterator . Random . Dictionary . . LinkedList . . Collection . . KeySortCltn . OrderedCltn . SeqCltn . . Float IdentDict . SortedCltn . Stack . . Arraychar . . Set . Heap . . Arrayob . String IdentSet . Bag . Bitset Vector . Range . Fig. 7. Classification of NIHCL (SOM)

. Date . Time . Integer . . . Regex

rather be classified as contained in Cat#16. Other classes although being closely related are isolated such as Cat#15 and Cat#17. A rather successful classification is done with the I/O classes that are mostly contained in Cat#13. Their superclasses, however, are assigned to Cat#7. As another example consider Cat#2. Most of the classes contained in that category may be referred to as container classes allowing keyed access to their elements together with the classes that enable such a keyed access. However, in this sense the classes Iterator, Object, and Set are highly unrelated classes. Moreover, the importance of Iterator for all of the container classes is not uncovered by the ART model. With the self-organizing map as depicted in Figure 7 the relationship between various classes may be seen more intuitively. For example all I/O classes are grouped together in the upper left part of the final map. Within this cluster, the arrangement of the classes mirrors their behavior in the sense that a class that performs some input operation, i.e. classes that are designated by an ‘OIOi*’ in the class name, is stored neighboring to its output counterpart, designated by an ‘OIOo*’ in the class name. As an example consider the classes OIOifd and OIOofd. Contrary to the ART result as presented above, the class Iterator is represented by a unit in close vicinity to the container classes. Furthermore, the container classes allowing keyed access to their elements, i.e. Dictionary, IdentDict, and KeySortCltn, together with the classes implementing this keyed access, i.e. Assoc, AssocInt, and LookupKey, are assigned to neighboring units in the left part of the map. Thus, their inherent semantic similarity is more readily revealed for the user of such a software library.

5. Conclusion In this paper we have demonstrated the applicability of unsupervised artificial neural networks to the task of software library organization. Both the ART model and the self-organizing map were capable of uncovering semantic similarities of the software components based on a feature vector representation of the component. The feature vectors were obtained by indexing the textual description of the various software components. The results that were achieved with the self-organizing map resemble the similarity of the software components more faithfully, yet at the expense of longer training time.

Acknowledgments Thanks are due to Werner Kasser who provided the implementation of the ART prototype used to perform the experiments.

References [1] T. J. Biggerstaff and A. J. Perlis (Eds.). “Software Reusability. Vol I: Concepts and Models. Vol II: Applications and Experience.” Addison-Wesley. 1989.

[2] G. A. Carpenter and S. Grossberg. “The ART of Adaptive Pattern Recognition by a Self Organizing Neural Network.” IEEE Computer 21(3). 1988. [3] W. B. Frakes and S. Isoda. “Success Factors of Systematic Reuse.” IEEE Software. Sept 1994. [4] K. E. Gorlen, S. Orlow, and P. Plexico. “Data Abstraction and Object-Oriented Programming in C++.” John Wiley & Sons. 1990. [5] K. E. Gorlen. “NIH Class Library Reference Manual (Revision 3.10).” National Institutes of Health. Bethesda, MD. 1990. [6] C. W. Krueger. “Software Reuse.” ACM Computing Surveys 24(2). 1992. [7] T. Kohonen. “The Self-Organizing Map.” Proceedings of the IEEE 78(9). 1990. [8] Y. S. Maarek, D. M. Berry, and G. E. Kaiser. “An Information Retrieval Approach for Automatically Constructing Software Libraries.” IEEE Trans on Software Engineering 17(8). 1991. [9] D. Merkl, A M. Tjoa, and G. Kappel. “Application of Self-Organizing Feature Maps with Lateral Inhibition to Structure a Library of Reusable Software Components.” Proc IEEE Int’l Conf on Neural Networks. Orlando. 1994. [10] D. Merkl, A M. Tjoa, and G. Kappel. “Learning the Semantic Similarity of Reusable Software Components.” Proc of the 3rd Int’l Conf on Software Reuse. Rio de Janeiro. IEEE CS Press. 1994. [11] D. Merkl. “A Connectionist View on Document Classification.” Proc 6th Australasian Database Conference. Adelaide. 1995. [12] E. Ostertag, J. Hendler, R. Prieto-Díaz, and C. Braun. “Computing Similarity in a Reuse Library.” ACM Trans on Software Engineering and Methodology 1(3). 1992. [13] W. Schäfer, R. Prieto-Díaz, and M. Matsumoto (Eds.). “Software Reusability.” Ellis Horwood. 1994. [14] H. R. Turtle and W. B. Croft. “A Comparison of Text Retrieval Models.” The Computer Journal 35(3). 1992.

Suggest Documents