Recovering Code to Documentation Links in OO ... - Semantic Scholar

4 downloads 8283 Views 135KB Size Report
Electrical and Computer Engineering Ecole Polytechnique,. C.P. 6079, Succ. Centre .... in an object-oriented language, namely C++, to trace classes to manual ...
Recovering Code to Documentation Links in OO Systems G. Antoniol, G. Canfora, A. De Lucia University of Sannio, Faculty of Engineering Palazzo Bosco Lucarelli, Piazza Roma, I-82100 Benevento, Italy [email protected] fg.canfora,[email protected] E. Merlo Dep. Electrical and Computer Engineering Ecole Polytechnique, C.P. 6079, Succ. Centre Ville, Montreal, Quebec, Canada [email protected]

Abstract Software system documentation is almost always expressed informally, in natural language and free text. Examples include requirement specifications, design documents, manual pages, system development journals, error logs and related maintenance reports. We propose an approach to establish and maintain traceability links between the source code and free text documents. A premise of our work is that programmers use meaningful names for program’s items, such as functions, variables, types, classes, and methods. We believe that the application-domain knowledge that programmers process when writing the code is often captured by the mnemonics for identifiers; therefore, the analysis of these mnemonics can help to associate high level concepts with program concepts, and vice-versa. In this paper, the approach is applied to software written in an object-oriented language, namely C++, to trace classes to manual sections.

models to non-technical readers. As a matter of fact, diagrammatic representations are often supplemented with free text descriptions that convey information that is not captured by the diagrams themselves, and a Z [32] specification document is typically an amalgam of mathematics, which is precise and supports reasoning, and explanatory text that makes the document an effective means of communication. Establishing traceability links between the free text documentation associated with the development and maintenance cycle of a software system and its source code can be helpful in a number of tasks. A few notable examples are:



Keywords: traceability, versions compliance check, object orientation

1. Introduction Most of the documents that accompany a software system are natural language documents, informally expressed. Examples include requirement specifications, design documents, manual pages, system development journals, error logs and related maintenance reports. Even when (semi-)formal models are applied, free text is largely used either to add semantics and context information in the form of comments, or to ease the understanding of the formal



Program comprehension; existing cognition models share the idea that program comprehension occurs in a bottom-up manner, a top-down manner, or some combination of the two [30, 6, 37, 26]. Tracing areas of code to related sections of an application domain handbook, to a set of design documents, or to manual pages, supports both top-down and bottom-up comprehension. In top-down comprehension it provides hints on where to look for beacons that either confirm or confute a hypothesis, whereas in bottom up comprehension the main support is in the assignment of a concept to a chunk of code. Maintenance; as the software industry matures, companies have built up a sizable number of legacy systems. A legacy system is an old system which is valuable for the corporation which owns and which often developed it. For the purpose of maintaining legacy systems, design recovery, as defined in [11] may be performed, requiring different sources of information, such as source code, design documentation, personal experience and general knowledge about problem and

application domains [4, 27]. Central to design recovery is representation [36], for which different schemes have been used and described in [34] and [5]. Traceability links between code and other sources of information are a sensible help to perform the combined analysis of etherogeneous information and, ultimately, to associate domain concepts with code program fragments and vice-versa.







Requirement tracing; traceability links between the requirement specification document and the code are a key to locate the areas of code that contribute to implementing specific user functionality [31, 12, 33]. This is helpful to assess the completeness of an implementation with respect to stated requirements, to devise complete and comprehensive test cases, and to infer requirement coverage from structure coverage during testing. Traceability links between requirements and code can also help to identify the code areas directly affected by a maintenance request as stated by an enduser. Finally, they are useful during code inspection, to provide inspectors with glue about the ultimate goals of the code at hand, and in quality assessment, for example to single out loosely-coupled areas of code that implement heterogeneous requirements. Impact analysis; a major goal of impact analysis is the identification of the work products affected by a proposed change [2]. Changes may initially be made to any of the documents that comprise a system, or to the code itself, and have then to be propagated to other work products [19]. As an example, enhancing an existing system by adding new functions is in most case initiated at the requirement specification level and changes are then propagate trough design documents down to the source code. Conversely, a change of an algorithm or a data structure may start at the code level and is then documented in the relevant sections of the design documentation. Thus, it can be seen the need for means of establishing traceability links between code and free text documentation. Reuse of existing software; deriving reusable assets from existing software systems has emerged as a winning approach to promote the practice of reuse in industry [9, 10]. Means to trace code to free text documents are a sensible help to locate reuse-candidate components. Indeed, since existing software has not been produce with reuse in mind, the knowledge about its functionality and the application domain concepts it implements is spread in several places, including requirement specification documents, manual pages, design documents, and the code itself. In addition, traceability links between the code and an application domain handbook are helpful to index the reusable assets

according to an application domain model and to implement querying facilities that retrieve them for potential reuse based on a user request. More in general, means to trace source code to free text documents are useful whenever it is required that bottom-up structured analysis be combined with top-down or modeldriven analysis [8, 3, 18]. We propose an approach to establish and maintain traceability links between the source code and free text documents. A premise of our work is that programmers use meaningful names for program’s items, such as functions, variables, types, classes, and methods. We believe that the application-domain knowledge that programmers process when writing the code is often captured by the mnemonics for identifiers; therefore, the analysis of these mnemonics can help to associate high level concepts with program concepts, and vice-versa. Our approach to build traceability links exploits the idea of a language model, i.e. a stochastic model that assigns a probability value to every string of words taken from a prescribed vocabulary [28]. We use the documents to estimate the language models, one for each document or identifiable section, and apply Bayesian classification [28] to score the sequence of mnemonics extracted from a selected area of code against the language models. A high score indicates a high probability that a particular sequence of mnemonics derives from the document, or document’s section, that generated the language model: this suggests the existence of a semantic link between the document and the area of code from which the particular sequence of mnemonics is extracted. In this paper, the approach is applied to software written in an object-oriented language, namely C++, to trace classes to manual sections. We use a code intermediate representation that encompasses the essentials of the class diagram in an object-oriented design description language, the Abstract Object Language (AOL) [1, 16]. Our process estimates a language model from each manual section, extracts an AOL system representation from the source code, applies simple transformation rules to the AOL textual representation to alleviate the effects of coding styles and conventions (for example, mnemonic concatenation and abbreviations), and scores the transformed AOL class representations against the language models. As a result, each class is bound to the manual pages where the concepts it implements are described. Support tools have been developed to extract and transform the AOL class representations and to compute the scores by applying Bayesian classification. We adapted the CMU public domain language model tool suite [35] to estimate the language models. We have applied the process to a public domain C++ class library, the LEDA library (Library of Efficient Data types and

V i C++ Source Code

Code2AOL Translation

Code AOL Specifications

Class Description Text Transformation Bayesian Classification

V j C++ Documentation

Stochastic Language Model Estimation

Scored Models

Language Models

Figure 1. Traceability Recovery Process.

Algorithms), for which both the source code and the manual pages can be obtained from the net.1 The remind of the paper is organized as follows: Section 2 introduces our traceability link recovery process and discusses its key issues. Tools supporting the process are illustrated in Section 3, while Section 4 presents the results of the LEDA case study. Related research work is discussed in Section 5. Finally, in Section 6 we draw some preliminary conclusions and indicate directions for further research.

2. Traceability Process Stochastic LMs are used extensively in several areas: automatic speech recognition, machine translation, spelling correction, text compression, etc [7, 22, 39]. The framework of stochastic LMs can be well represented by an information theoretical model. A sequence of words (a document) Wi , generated by a source S with probability Pr(Wi ), is transmitted through a channel and transformed into a sequence of observations Y with probability Pr(Y j Wi ). For instance, Y could represent the acoustic signal produced by uttering Wi , the translation of Wi from Italian to English, a typewritten version of Wi with possible mis-spellings, or the mnemonics for code identifiers chosen by developers on the basis of a given document page. The problem is decoding the observation Y, e.g., code extracted mnemonics, into the original document page/section Wi : ^ i that maximizes the a-posteriori probathat is, finding W bility Pr(Wi j Y ). Applying the Baye’s rule, we obtain the following identity:

j Wi )Pr(Wi ) Pr(Wi j Y ) = Pr(Y Pr (Y )

(1)

To further simplify the model, we assume that the source emits the documents Wi , i = 1; : : : ; N with probability 1=N ; furthermore we assume that the observations Y = y1 ; : : : ; ym are defined upon the same vocabulary, V , of Wi . 1 http://www.mpi-sb.mpg.de/LEDA/

By eliminating the constant factors, decoding is equivalent to find:

Pr(Y ) and Pr(Wi ),

W^ i = arg max Pr(Y j Wi ) W

(2)

i

Usually, the aim of the LM is to supply the decoding algorithm with the probability of the emitted sentence, or a score for it, for every word sequence. In our study, each document is described by a stochastic LM and the document selection process is accomplished by maximizing the quantity Pr(Y j Wi ).

2.1 Process Overview The whole traceability recovery process is represented in Fig. 1. The process consists of the following activities: 1. Stochastic Language Model Estimations: for any relevant document, or document’s section, a Language Model (LM) is estimated; 2. AOL Representation Extraction: in this phase an AOL system representation is recovered from code through a Code2AOL extractor; 3. Text Transformation: AOL class textual descriptions undergo simple transformations to decrease the test set entropy of the LMs; 4. Bayesian Classification: by means of a Bayesian classifier a mapping between AOL class descriptions and documents is recovered; In the following subsections we will highlight the key issues of these process steps and the related implications.

2.2 Stochastic Language Model Estimations In this phase the conditioned probabilities to compute equation 2 are estimated from a text collection, in our case study a manual section. Let Y = y1 : : : ym be a

sequence of observations (i.e., mnemonics for class items identifiers); the probability Pr(Y jWi ), for any given document Wi (manual section), can be decomposed as

Pr(Y ) = Pr(y1 )

m Y

t=2

Pr(yt j y1 : : : yt,1 ):

where the Wi is thought of as fixed and dropped from the equation. When m increases the conditioned probabilities involved in above product quickly becomes difficult to estimate. A simplification can be introduced by conditioning the dependence of each word (regardless of t) to the last n , 1 words:

Pr(Y ) ' Pr(y1 : : : yn,1 )

m Y t=n

Pr(yt j yt,n+1 : : : yt,1 ):

This n-gram approximation, which formally assumes a time-invariant Markov process [14], greatly reduces the statistics to be collected in order to compute Pr(Y), clearly at the expense of precision. However, trainable n-gram models must cope with two important problems: the estimation of a large number of parameters and the data sparseness of texts. Even a 2-gram (bigram) model may require a large amount of data (texts) for reliably estimating a large number of parameters. An important aspect that makes n-gram estimation a difficult task is the inherent data sparseness of real texts. Experimentally, any word sequence is a rare event, as it generally occur only a few times, if ever, even in very large text collections (corpora). Thus, in describing software documents a compromise may be required between the amount of textual data available, quite often very scarce, and the number of parameters to be estimated.

2.3 The “zero-frequency” Problem Consider our case study: the LEDA library, release 3.4: our goal is to establish links between classes and manual pages. Manual pages contain about 2300 different words and/or sequence of characters obtained concatenating two or more words (e.g. DELAUNAYDIAGRAM, alledges, configureevent, etc). To estimate an n-gram language model all possible V n word sequences must be considered: for any given word yt in the vocabulary V we need to construct and train all the possible contexts yt,n+1 : : : yt,1 (yt,k 2 V k = 1; : : : ; n,1) i.e., for a trigram language model about 2109 parameter must be estimated. Such an huge number of parameters requires a tremendous amount of text to reliably estimate probabilities Pr(yt j yt,n+1 : : : yt,1 ). Furthermore, even if much more material than actually available coul be exploited, most of the times a word yt will never appear in a given context yt,n+1 : : : yt,1 either due

to sparseness of the data or simply because it does not make sense. Given a context, also known as an history, h (yt,n+1 : : : yt,1 ), assumed fixed for sake of simplicity, and a word w 2 V , it can be shown that the maximum likelihood estimates of P (w j h) is:

) Pr(wjh) ' fr(w j h) = cc((hw h) where c(h) is the number of times that the h context was PN found in the training material (i.e., c(h) = t=n  (ht = h) with  () the Kronecker operator). By definition c(h) = 0 implies fr(w j h) = 0, to overcome the “zero-frequency” problem, instead of the fr(w j h) frequency, 8hw 2 V n discounted conditional frequencies fr (w j h) 2 [0; fr(w j h)] are used. The “zero-frequency” probability:

(h) = 1:0 ,

X

w2V

fr (w j h)

is then redistributed among the set of words never observed in the context h. Discounting and redistribution are usually combined according to two main strategies: backing-off and interpolation. In the interpolation strategy two approximations are combined:

Pr(wjh) ' fr (w j h) + (h)Pr(w j h0 ) where h0 denotes a less specific context. The backing-off strategy smoothes n-gram probabilities selecting the most significant available approximation. One of the simplest discounting strategies is the floor discounting which simply add 1 to all the never-seen events, thus given a vocabulary V and a training text of W1N (i.e., w1 ; : : : ; wN ) the unigram with back-off smoothing methods gives:

8 < Pr(z ) = :

c(z)+1 if c(z ) > 0 c+k 1 +

c k

(3)

otherwise

where c = N , 1 and k is the number of unseen events, words in V not appearing in W1N . The absolute shift method subtract a small positive quantity, , to all c(hw) counters:

), fr (w j h) = maxf c(hw c(h) ; 0g and the zero-frequency probability is:

(h) = nc((hh)) P where n(h) = w2V  (c(hw) > 0)). This methods has

been adopted for the preliminary experiments presented in this paper since experimentally it gives better performances with respect to the floor method, and it produces a lower number of parameters (taking = 1 singletons are deleted). More details of the LM estimation can be found in [15, 28].

2.4 AOL Representation Extraction AOL has been designed to capture OO concepts in a formalism independent of programming languages and tools. However, as for UML [13], the AOL allows to specify method signatures in terms of language specific types (e.g., char, int and type qualifiers (e.g., pointers * and references &). AOL is a general-purpose design description language, capable of expressing concepts available at the design stage of OO software development. The language resembles other OO design/interface specification languages such as IDL [23, 29] and ODL [24]. More details on AOL can be found in [16, 1].

2.5 Text Transformation The aim of this step is to cope with two basilar aspects: coding standards and programming language characteristics. As far as coding standards are concerned, it is not unlikely that words are concatenated to create more meaningful variable/method/parameter names. These strings may or may not be present in the text documents; in other words they can be out of vocabulary words (i.e., not present in the training materials) or very unlikely words. Furthermore, language specific type qualifiers in the method signatures have to be removed since it is expected the documentation does not contain such language dependent tokens.

2.6 Bayesian Classification In this phase LMs are used to score the input text, in the current study the mnemonic identifiers of a class interface (attribute and method signatures), against the estimated LMs. The classifier implements the equation 2. Theoretically speaking equation 2 suffices, as it computes a partially injective function. However, it may be over-restrictive in many practical cases. In fact, it associates a single observation (class description) to a single source (manual section). Consider for example the LEDA library, the case study presented in this paper. Lists are implemented by means of an underlying dynamic list structure and graphs are implemented on top of lists. Nevertheless, manual sections describe in details solely the user visible data structure and do not describe the underlying implementing classes. When constructing a mapping between classes and documents a many to many relation may be more suitable to capture the actual situation. In the current implementation, the classifiers produces for each class the list of documents ordered according to Pr(Y j Wi ).

3. Tool Support

The process shown in Figure 1 has been completely automated. We used the Code2AOL Extractor tool described in [1] to extract the AOL representation from source code. Extracting information about class relationships from code might have some degree of imprecision. In fact, given two or more classes and a relation among them, there is an intrinsic ambiguity to classify the relation due to the choice left to the programmers implementing an OO design. Associations can be instantiated in C++ by means of pointer data members or by inheritance [20]. Furthermore, aggregation relations could result either from templates (e.g., list), arrays (e.g., Heap a[MAX]) or pointers data member (e.g.,Edges=new GraphEdge[MAX]). The Code2AOL Extractor tool recognizes an aggregation if and only if a template, an object array, or an instance of an object is declared as a data member. All the remaining cases, i.e. object pointers and references both as data members and formal parameters of methods, give origin to associations. It is worth noting that the above mentioned imprecision does not affect the overall process; in fact, tracing classes to manual pages requires the names of attributes/methods/parameters, which are safely extracted by the Code2AOL tool, and the knowledge of the existence of some kind of relation between a pair of classes. We developed tools to pre-process text (removing language specific features, formatting symbols, etc) and to apply the deterministic transformations aimed at reducing the test set entropy of the LMs. We used the CMU tool suite [35] to estimate the stochastic LMs. This tool suite provides utilities to extract dictionaries, word frequencies, and other relevant statistic from textual corpora, and to estimate n-gram LMs. The tool suite component responsible for estimating the LMs implements the most widely applied discounting strategies, including absolute discounting, Good Turing discount, Witten Bell discounting and linear discounting (details can be found in [28]). The results reported in this paper were obtained with bigram language models, with closed vocabulary (i.e., out of vocabulary words are discarded) estimated with the absolute discounting strategy. Finally, we implemented a Bayesian classifier that reads a family of LMs and computes, for each LM, the Pr(Y jWi ) for the given input text. In the current implementation of the Code2AOL tool comments are discarded and, as a consequence, the input text is very sparse as it consists of the mnemonics for code identifiers (classes/attributes/methods/parameters). Accordingly, the probability Pr(Y jWi ) is actually computed as product of the unigram probabilities:

Pr(Y jWi ) = Pr(y1 ; : : : ; ym jWi ) =

m Y t=1

Relevant 81

Retrieved

Pr(yt jWi )

Irrelevant 127

Total 208

Table 2. Relevant and Irrelevant Retrieved Pairs (Best Match).

4. Case Study We have applied our traceability links recovery approach to a freely available C++ library of foundation classes, called LEDA (Library of Efficient Data types and Algorithms), developed and distributed by Max-Planck-Institut f¨ur Informatik, Saarbr¨ucken, Germany (freely available for academic research and teaching from http://www.mpisb.mpg.de/LEDA/). The LEDA project started in 1988 and a first version (1.0) was available in 1990. Since then several other versions have been released; the latest version (3.7.1) was released in 1998. In our case study, we analyzed the code and the documentation of the LEDA release 3.4, whose main features are summarized in Table 1. KLOC Classes Methods Attributes Associations Aggregations Generalizations Manual Sections

95 208 4967 510 272 114 98 88

Table 1. LEDA 3.4 main features.

The results obtained by applying our tool were very encouraging. The number of classes the LEDA version we analyzed was much greater than the number of manual sections; accordingly, we expected that several classes were not described by any manual section and that some manual sections described more than one class. This was actually the case. In fact, an abstract class and a derived concrete class having mostly the same names for methods and attributes were associated by our tool to the same manual section. This was, for example, the case of classes GRAPH and graph; in total, we discovered 20 of such cases. On the other hand, abstract superclasses with several children did not have a corresponding manual section; however, in most cases the top ranked manual sections associated to such classes by our tool were the sections describing their subclasses. Similarly, in many other cases classes without a manual section were associated to the manual sections of classes they were related to through an aggregation or an association. Finally, it is worth noting that 10 out of the 88 manual sections available did not describe LEDA classes, but basic concepts and algorithms.

The proposed approach and tool have been evaluated by using two well known metrics of the information retrieval field [17], related in particular to the retrieval effectiveness: recall and precision. Recall is the ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in the “database”. Precision is the ratio of the number of relevant documents retrieved over the total number of documents retrieved. In our case, the relevant documents consist of the subset of possible pairs (C, M), were M is the manual section actually describing the class C. The retrieved documents are the pairs of type (C, M), were M is the manual section that maximize the probability that class C is described by a given manual section (see Section 2). Retrieved 81

Relevant

Missed 17

Total 98

Table 3. Retrieved and Missed Relevant Pairs (Best Match).

The identification of the relevant pairs and the validation of relevant retrieved pairs have been done manually. It is worth noting that the number of retrieved pairs is 208, that is the total number of classes of the analyzed release of LEDA. The number of relevant pairs is 982 , while the number of relevant pairs that were retrieved by our tool is 81. Tables 2 and 3 summarize this data. Therefore, the Recall and Precision measures are: Recall

= #( # = #(

P recision

Relevant

^ Retrieved) % =

Retrieved

Relevant

#

81 208 % = 38 94% ) % = 81 % = 82 65% 98

^ Retrieved

Relevant

:

:

This means that the misclassification of type I, i.e., the ratio between the number of irrelevant retrieved pairs and the number of retrieved pairs is 61.06%, while the misclassification of type II, i.e., the ratio between the number of relevant and unretrieved pairs and the number of relevant pairs is 17.35%. 2 Note that the number of relevant manual sections is 78: this means that in some case the same manual section is associated to more than one class.

Precision 94,898

100 80 60 40 20 0

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16 17 Rank

Figure 2. Considered top ranked manual pages versus precision.

The low Recall measure can be explained by the fact that 110 out of the 208 classes do not have a corresponding manual section, whereas our tool always produces a pair (C, M). However, it is worth noting that without the tool support, one must analyze all the 88 manual sections to discover that a given class is not described by any manual section. On the other hand, using our tool the analysis can be limited to the n top ranked manual sections with a negligible error. For example, if we decide to limit the manual analysis to the highest scored manual section (n = 1) of each class, the resulting error is given by the ratio between the number of relevant pairs that are missed and the number of retrieved pairs, i.e., 17/208 = 8.17%; however, the effort is reduced of at least one order of magnitude with respect to a manual analysis. It is worth noting that the Precision and Recall measures can be further improved if we consider the n top ranked manual sections, with n > 1. For example for n = 3 the number of retrieved and relevant pairs is 93, thus resulting in a Precision of 93/98 = 94.89% while the Recall is 44.91% Figure 2 shows the trend of the Precision measure with respect to number n of top ranked manual sections considered.

5. Related Work The issue of recovering traceability links between code and free text documentation, in the authors’ knowledge, is not jet well understood and investigated. We have found very few related works, mainly in the areas of impact analysis and requirement tracing. Most of the contributions in the area of impact analysis assume the existence of some

form of ripple propagation graph describing relations between software artifacts, including both code and documentation, and focus on the prediction of the effects of a maintenance change request not only on the source code but also on the specification and design documents [38]. There is a number of systems that support requirement and documentation tracing during the system development. TOOR [31], IBIS [12], and REMAP [33] are a few examples of software development tools able to build and maintain traceability links among various software artifacts. Analysis of informal information in the source code (comments and mnemonics for identifiers) can help to associate domain concepts with program fragments and viceversa. The importance of informal information analysis has been discussed in [4] where an approach based on structures similar to semantic networks has been proposed and the possibility of using some kind of neural networks was addressed. Comments and mnemonics have an information content with an extremely large degree of variance between systems and, often, between different segments of the same system. Furthermore, this informal information is rarely parsable by Natural Language (NL) grammars because it is mostly based on incomplete sentences combined with abbreviations in a way which is not described by NL grammars. The problem presents some analogies with spoken sentence interpretation [21]. Thus, classification methods based on Artificial Neural Networks (ANN) or the stochastic approach exploited in this paper, rather than formal parsers, are more suitable for analyzing this type of information. These methods relate concepts to patterns of word sequences. For example, ANNs for source code informal informa-

tion analysis have been investigated in [27] where a connectionist method, that can be used for design recovery in conjunction with more traditional approaches, is proposed for analyzing the informal information (comments and mnemonics) in programs. The proposed approach uses a combination of top down domain analysis (the creation of a concept hierarchy by a domain expert, to be used in the construction of the training set) and a bottom up approach (the analysis of the informal information using the network). In [25], an approach for automatically assembling software libraries based on a free text indexing scheme has been also presented. This approach uses attributes automatically extracted from natural language IBM RISC System/6000 AIX 3 documentation to build a browsing hierarchy which accepts queries expressed in natural language. Indexing is performed by computing Lexical Affinities (LA), which stand for the correlation of common appearance of words in the utterances of the language, and by computing their “resolving power” defined as a function of the quantity of information of the probability co-occurrence of words within a sliding window of size +/- five. Classification against a query is performed by ranking documents based on the “resolving power of their LA”. Another indexing approach based on free text analysis is RSL [8] system which extracts free-text single-term indices from comments in source code files looking for reuse keywords like “author”, “date created”, etc. REUSE [3] is an information retrieval system which stores software objects as textual documents. CATALOG [18] is an information retrieval system to store and retrieve C components which are individually characterized by a set of single-term indices which are automatically extracted from natural language headers of C programs.

6. Conclusions We have presented an approach to recover traceability links between software written in an object-oriented language, namely C++, and manual sections. The approach rely on a probabilistic mapping to constructs a many-tomany relation between classes and manual sections. Our process estimates a language model from each manual section, extracts an AOL system representation from the source code, applies simple transformation rules to the AOL textual representation to alleviate the effects of coding styles and conventions (for example, mnemonic concatenation and abbreviations), and scores the transformed AOL representations against the language models. As a result, each class is bound to the ordered list of manual sections according to the computed probability. Preliminary experimental results on the LEDA library clearly demonstrate the feasibility of the approach. Further-

more, quantitative evaluations are very promising: the top five candidates almost always contain the correct classification, while by considering the top three candidates 83 % of precision is achieved. Future work will be devoted to extend the investigation to a larger data set and to extend the current approach with a probabilistic modeling of the transformation in other words of the programmers.

References [1] G. Antoniol, R. Fiutem, and L. Cristoforetti. Using metrics to identify design patterns in object-oriented software. In Proc. of the Fifth International Symposium on Software Metrics - METRICS98, pages 23–34, Nov 2-5 1998. [2] R. S. Arnold and S. A. Bohner. Impact analysis - towards a framework for comparison. In Proceedings of the International Conference on Software Maintenance - IEEE Computer Society Press, pages 292–301, Montreal, Quebec, Canada, 1993. [3] S. P. Arnold and S. L. Stepowey. The reuse system: Cataloging and retrieval of reusable software. In W. Tracz, editor, Software Reuse: Emerging Technology. IEEE Computer Society Press, 1987. [4] T. Biggerstaff. Design recovery for maintenance and reuse. IEEE Computer, Jul 1989. [5] T. Biggerstaff, B. Mitbander, and D. Webster. The concept assignment problem in program understanding. In Proceedings of the International Conference on Software Engineering, pages 482–498, IEEE Computer Society Press, May 1993. [6] R. Brooks. Towards a theory of the comprehension of computer programs. International Journal of Man-Machine Studies, 18:543–554, 1983. [7] P. F. Brown, S. F. Chen, S. A. D. Pietra, V. J. D. Pietra, A. S. Kehler, and R. L. Mercer. Automatic speech recognition in machine-aided translation. Computer Speech and Language, 8:177–187, August 1994. [8] B. A. Burton, R. W. Aragon, S. A. Bailey, K. Koelher, and L. A. Mayes. The reusable software library. In W. Tracz, editor, Software Reuse: Emerging Technology, pages 129– 137. IEEE Computer Society Press, 1987. [9] G. Caldiera and V. R. Basili. Identifying and qualifying reusable software components. IEEE Computer, pages 61– 70, 1991. [10] G. Canfora, A. Cimitile, and M. Munro. Re2: Reverse engineering and reuse re-engineering. Journal of Software Maintenance - Research and Practice, 6:53–72, 1994. [11] E. Chikofsky and J. C. II. Reverse engineering and design recovery: A taxonomy. IEEE Software, 7(1):13–17, Jan 1990. [12] J. Concling and M. Bergen. Ibis: a hypertext tool for exploratory policy discussion. ACM Transaction on Office Information Systems, pages 303–331, April 1988. [13] R. S. Corporation. Unified Modeling Language, Version 1.0. 1997. [14] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley Series in Telecommunications, John Wiley & Sons., New York, NY 10158-0012, 1992.

[15] M. Federico, M. Cettolo, F. Brugnara, and G. Antoniol. Language modelling for efficient beam-search. Computer Speech and Language, 9:353–379, 1995. [16] R. Fiutem and G. Antoniol. Identifying design-code inconsistencies in object-oriented software: A case study. In Proceedings of the International Conference on Software Maintenance - IEEE Computer Society Press, pages 94–102, Bethesda, Maryland, November 1998. [17] W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1992. [18] W. B. Frakes and B. A. Nejmeh. Software reuse through information retrieval. In Proceedings of 20-th Ann. HICSS, Kona (HI), pages 530–535, January 1987. [19] M. J. Fyson and C. Boldyreff. Using application understanding to support impact analysis. Journal of Software Maintenance - Research and Practice, 10:93–110, 1998. [20] E. Gamma, R. Helm, R.Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object Oriented Software. Addison-Wesley Publishing Company, Reading, MA, 1995. [21] R. Kuhn and R. D. Mori. Learning speech semantics with keyword classification trees. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE Computer Society Press, April 1993. [22] K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377–439, 1992. [23] D. A. Lamb. Idl: Sharing intermediate representations. ACM Transactions on Programming Languages and Systems, 9(3):297–318, July 1987. [24] D. Lea and C. K. Shank. Odl: Language report. Technical Report Draft 5, Rochester Institute of Technology, Nov 1994. [25] Y. Maarek, D. Berry, and G. Kaiser. An information retrieval approach for automatically constructing software libraries. IEEE Transactions on Software Engineering, 17:800–813, 8 1991. [26] A. V. Mayrhauser and A. M. Vans. Identification of dynamic comprehension processes during large scale maintenance. IEEE Transactions on Software Engineering, 22(6):424– 437, 1996. [27] E. Merlo, I. McAdam, and R. D. Mori. Source code informal information analysis using connectionist models. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 1339–1344, Chambery, France, Sept 1993. [28] R. D. Mori. Spoken dialogues with computers. Academic Press, Inc., Orlando, Florida 32887, 1998. [29] OMG. The Common Object Request Broker: Architecture and specification. OMG Document 91.12.1, OMG, December 1991. [30] N. Pennington. Stimulus structures and mental representations in expert comprehension of computer programs. Cognitive Psychology, 19:295–341, 1987. [31] F. A. C. Pinhero and A. G. J. An object-oriented tool for tracing requirements. IEEE Software, IEEE Computer Society Press, pages 52–64, March 1996. [32] D. L. Potter, J. Sinclair, and D. Till. An Introduction to Formal Specification and Z. Prentice-Hall, Englewood Cliffs, NJ, 1991.

[33] B. Ramesh and V. Dhar. Supporting systems development using knowledge captured during requirements engineering. IEEE Transactions on Software Engineering, pages 498– 510, June 1992. [34] C. Rich and R. Waters. The Programmer’s Apprentice. Addison-Wesley Publishing Company, Reading, MA, 1990. [35] R. Rosenfeld. Adaptive Statistical Language Modeling: A Statistical Approach. PhD thesis, School of Computer Science, Carnegie Mellon University, Apr 1994. [36] S. Rugaber and R. Clayton. The representation problem in reverse engineering. In Proceedings of the Working Conference on Reverse Engineering, pages 8–16. IEEE Computer Society Press, 1993. [37] E. Soloway and K. Ehrlich. Empirical studies of programming knowledge. IEEE Transactions on Software Engineering, SE-10(5):595–609, 1994. [38] R. J. Turver and M. Munro. An early impact analysis technique for software maintenance. Journal of Software Maintenance - Research and Practice, 6:35–52, 1994. [39] I. H. Witten and T. C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Trans. Inform. Theory, IT-37(4):1085– 1094, 1991.

Suggest Documents