Tracing Object-Oriented Code into Functional Requirements
G. Antoniol , G. Canfora , G. Casazza , A. De Lucia , E.Merlo
[email protected] gerardo.canfora,gec,delucia @unisannio.it
[email protected] University of Sannio, Faculty of Engineering - Piazza Roma, I-82100 Benevento, Italy University of Naples ”Federico II”, DIS - Via Claudio 21, I-80125 Naples, Italy Dep. Electrical and Computer Engineering Ecole Politechnique, C.P. 6079, Succ. Centre Ville, Montreal, Quebec, Canada
Abstract
comprehension, ranging from domain specific knowledge to general programming knowledge [4] [22] [25]. Traceability links between areas of code and related sections of free text documents, such as an application domain handbook, a specification document, a set of design documents, or manual pages, aid both top-down and bottom-up comprehension. In top-down comprehension, once a hypothesis has been formulated, the traceability links provide hints on where to look for beacons that either confirm or confute it. In bottom-up comprehension the main role of the traceability links is to assist programmers in the assignment of a concept to a chunk of code and in the aggregation of chunks into hierarchies of concepts. Traceability links between code and other sources of information are also a valuable help to perform the combined analysis of heterogeneous information and, ultimately, to construct a mental model of the software under consideration. At WCRE’99 [1] we presented a method to establish and maintain traceability links between code and free text documents. A premise of the method is that developers use meaningful names for program items, in particular classes, variables, methods, and exchange parameters. The underlying assumption is that the application-domain knowledge programmers process when writing the code is captured by mnemonics for identifiers. The method exploits probabilistic information retrieval techniques to estimate a language model for each document or document section and apply Bayesian classification to score the sequence of mnemonics extracted from a selected area of code against the language models. Higher scores suggest the existence of links between the area of code from which a particular sequence of mnemonics is extracted and the document that generated the language model. The paper presented the application of the method to establish traceability links between C++ source classes and manual pages and discussed the results of a case study on a C++ class library, namely LEDA (Library of Efficient Data
DR AF
Software system documentation is almost always expressed informally, in natural language and free text. Examples include requirement specifications, design documents, manual pages, system development journals, error logs and related maintenance reports. We propose an approach to establish and maintain traceability links between source code and free text documents. A premise of our work is that programmers use meaningful names for program items, such as functions, variables, types, classes, and methods. We believe that the application-domain knowledge that programmers process when writing the code is often captured by the mnemonics for identifiers; therefore, the analysis of these mnemonics can help to associate high level concepts with program concepts, and vice-versa. In this paper, the approach is applied to software written in an object-oriented language, namely Java, to trace classes to functional requirements.
T
Keywords: redocumentation, traceability, program comprehension, object orientation
1. Introduction
The research reported in this paper addresses the problem of establishing traceability links between the free text documentation associated with the development and maintenance cycle of a software system and its source code. These links help program comprehension in several ways. Existing cognition models share the idea that program comprehension can occur in a bottom-up manner [19] [18], a top-down manner [4] [23], or some combination of the two [11] [13] [14] [15]. They also agree that programmers use different types of knowledge during program
T
speech recognition. Indeed, the problem of identifying the documents related to a given source code component can be seen as a typical information retrieval task. Information Retrieval is mainly concerned with the retrieval of unstructured information [7]. Such information might include text, images, audio, etc.; we restrict our focus to text-based information retrieval systems. Typically, an information retrieval system operates with elements (in our case free text documents containing the system’s functional requirements) which are inherently unstructured. The set of all documents is known as document space. The system accesses the document space to select a particular set of documents relevant for a particular user query, according to a given retrieval algorithm. In our case the user query is a representation of the source code component. Documents in the document space are indexed by a set of features; in a text-based system features could include words, phrases, or also manually controlled vocabulary items. A widely used approach to retrieve documents from the document space is ranked retrieval, which returns a ranked list of documents. Ranked retrieval models take advantage of the fact that some features are better discriminators than others, due to their occurrence statistics. There are two widely used classes of ranked retrieval models: probabilistic models and vector space models [9]. The probabilistic models follow the probability ranking principle: documents are ranked according to the probability of being relevant to the user query computed on a statistical basis. On the other hand, the vector space models treat documents and query as vectors of weights in a ( is the number of indexing features). The document vectors are ranked, according to some distance function, to the query vectors. Several vector space models use the cosine similarity, or some variant, as distance function. In our study we have used a probabilistic model, whose details are presented in the next subsection.
DR AF
types and Algorithms), for which both the source code and the manual pages can be obtained from the net. One of the WCRE’99 attendees commented that since the LEDA manual pages are derived from code, they often contain the exact names of classes, variables, methods, and parameters. He hypothesized that this circumstance could explain the encouraging results obtained and suggested the need for further trials. He also suggested to apply the method to systems, not libraries, as the latter tend to be better documented. We have applied the method to a hotel management system, named Albergate, to recover traceability links between the functional requirements, as specified in the requirement document, and the Java source code. The goal was to bind each functional requirement to the classes where it is implemented. This paper discusses the results of this further case study, which were initially not as encouraging as in the LEDA case study, and the analysis we made to identify the reasons for poor performance. We identified affecting factors essentially related to the text processing rules used to extract the sequences of mnemonics from code and to prepare the documents for language model estimation. The paper shows how the traceability link recovery method has been improved according to these findings. Improvements concern essentially a transformation step to split the mnemonics for code identifiers into the component words, if any, and a normalization step of both the document text and the sequences of transformed mnemonics. Text normalization involves the transformation of all capital letters into lower case letters, the introduction of a list of stop words, and the application of morphological analysis. These simple text-processing steps have sensibly improved the results of the case study, which now confirm and strength the promises shown in [1]. The remind of the paper is organized as follows. Section 2 introduces the traceability link recovery method and discusses its key issues. The set of tools we have developed to implement the method is also described in Section 2, while Section 3 introduces the Albergate case study and discusses its results. Related research work is addressed in Section 4. Finally, Section 5 summarizes the work and gives some concluding remarks.
2. Traceability Process
Unlike most reverse engineering problems, recovering traceability links between free text documentation and source code components cannot be simply based on techniques derived from the compilers’ field, because of the difficulty of applying syntactic analysis to natural language sentences. Our approach exploits techniques widely used in other areas, such as information retrieval, information theory, and
2.1 Probabilistic model
This model attempts to formalize the process of ranked retrieval by interpreting the ranking scores as the probability that a system’s functional requirement is concerned with the source code component . Applying the Baye’s rule the conditioned probability above can be transformed in:
!#"%$ &('
&
)"
(1) !#"*$ &(',+ -&. $ " 0'/&1' ! " ' For a given source code component -&(' is a constant
and we can further simplify the model by assuming that all system’s functional requirements have the same probability. Therefore, for a given source code component
&
Query Extraction Source code component
Identifier Extraction
Text Normalization
Identifier Separation
Scored Document
Bayesian Classifier
List
Text Normalization
Language Model Estimation
T
Software Documents
Figure 1. Traceability Recovery Process.
"
0& $ " ' "
all documents are ranked by the conditioned probability . In our experiment the probabilities are computed by estimating a stochastic language model [16] for each document . Indeed, the source code component is represented as a sequence of words of length ,
(the identifiers of the source code components) defined on the same vocabulary as the documents . We assume that programmers use meaningful names (i.e., names derived from the application and problem domain) for their identifiers and/or this identifiers are preprocessed to extract names that share the semantics of the requirements (e.g., splitting sequences of words contained in a single identifier). For each document the conditioned probability above can then be written as1 :
DR AF
&
This n-gram approximation, which formally assumes a time-invariant Markov process [6] greatly reduces the statis; clearly tics to be collected in order to compute this introduces an imprecision. However, n-gram models are still difficult to estimate, because if I is the size of the vocabulary, all possible IKJ sequences of words in the vocabulary have to be considered; this can be true even for a 2-gram (bigram) model. Moreover, the occurrence of any sequence of words in a document is a rare event, as it generally occurs only a few times and most of the sequences will never occur (problem of the sparseness of data). Therefore, in our approach we have considered a unigram approximation ( = 1):
)"
"
!" #%$'($*)',+#+#+#'-$/.01 . %$ & 2 %$ 3 $ & +5+#+($ 3 6 & 3#4 )
However, when increases the conditioned probabilities involved in the above product quickly become difficult to estimate for any possible sequence of words in the vocabulary. A simplification can be introduced by conditioning the dependence of each word to the last 87 words (with :9 ):
1
%$'($*)',+#+#+#'-$/.01; . #%$'#+#+,+5'($/< 6 & 2 =%$ 3 $ 3 6 @& +#+#+A$ 3 6 & < 3#4 the probability For sake of simplicity, document B CED %FG B is also written as CEforD a HFfixed 1
-& $ " '
"
. 2 L%$ & A' $ ) 'M+#+#+#'A$ . ? !; N%$ 3 3#4 &
that corresponds to consider all words O to be independent. Therefore, each document is represented by a language model where unigram probabilities are estimated for all words in the vocabulary.
)"
2.2 Unigram Estimation and the Zero-Frequency Problem
Unigram estimation is based on the term frequency of each word in a document. However, using term the simple O frequency would turn the product P O5QR to zero, whenever any word O is not present in the document . This problem, known as the zero-frequency problem [26], can be avoided using different approaches (see [16]). The approach we have adopted consists of smoothing the
"
$ " '
unigram probability distribution by computing the probabilities as follows:
$ " ',+
if
O
occurs in
)"
otherwise
and O is the number of occurrences of words "document #" . The interpolation term is: where
is the total number of words in the document O in the
+ 0I
2.4. Tool Support
where I is the size of the vocabulary and is the number of different words of the vocabulary occurring in the document . The value of the parameter is computed according to Ney [17] as follows:
)"
A number of tools have been developed to automate the process shown in Figure1. A top-down recursive parser was developed to analyze Java source code. Once the parse tree is available it is traversed and the required information is stored in support files. For each class, comments, identifiers of attributes, methods as well as method parameters are stored in separate files. For the present study comments were disregarded: the entire traceability link recovery process relies on the attributes, methods and parameters of each class. A minimal tool suite to assist textual information processing for the Italian language was also developed. In particular, identifier separation is performed in two steps: the first step is completely automated and recognizes words separated by underscore and sequences of words starting with capital letters. The second step is semi-automatic: the tools exploits spelling facilities to prompt the software engineer with the words that might be separated, leaving the final decision to the user. The first two steps of text normalization (letter transformation and stop words removing) have also been completely automated. Moreover, a semiautomatic stemmer, based on thesaurus facilities, has been implemented that helps the software engineer to reconduct flexed words to their roots (e.g. verbs are reconducted to their infinite form 2 ). Finally, we have implemented the language model estimators and a Bayesian classifier that computes, for each language model, the for the given input text. As already stated, due to text sparseness, in the current implementation only unigrams are retained, thus the probability is actually computed as the product of the unigram probabilities:
DR AF
, '
+ ,-7 ' , 7 ' , '
$ #"'
T
O
third phase, text normalization is applied (as described previously) and the sequence of words representing the query is produced. Finally, Bayesian classification consists of scoring each query against the estimated language models; in other words, for each source code component a ranked list of documents is produced ordered according to the probabilities P O QR O that the documents are relevant to the component.
#"
where is the number of words occurring times in the document .
2.3 Process Description
Our traceability recovery process exploits the probabilistic approach to document retrieval described in the previous sections. Figure 1 depicts the overall process. For each software document a unigram language model is estimated. Language model estimation is preceded by a text normalization phase performed at three levels of accuracy: 1. at the first level all capital letters are transformed into lower case letters; 2. at the second level all stop words (such as articles, punctuation, numbers, etc) contained in the text are identified and removed;
3. at the third level a morphological analysis is used to convert plurals into singulars and to reconduct all the flexed verbs to the infinity form.
On the other hand, for each source code component a query is extracted, that consists of a sequence of words derived from the component identifiers. Query extraction consists of three phases: identifier extraction, identifier separation and text normalization. First, the source code is parsed to extract a sequence of identifiers. In the second phase, the identifiers composed of two or more words are splitted into single words (for example, the identifier ReserveRoom is splitted into the words Reserve and Rooms). In the
-& $ " '
-& $ " '
-& $ #"0',+
$ 0" ' Q $ #"-' 2
2 It is worth noting that this step is much more difficult when applied to languages such as Italian, as in the case of the documentation of the case study presented in this paper.
Top ranking 1 2 3 4 5 6
Precision 28.3 % 16.0 % 12.5 % 9.4 % 7.5 % 4.9 %
DR AF
A traceability link recovery process can be seen as an information retrieval process. Thus, performance can be evaluated using two well known and widely accepted information retrieval metrics, namely precision and recall [7]. Recall is the ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in the “database”. Precision is the ratio of the number of relevant documents retrieved over the total number of documents retrieved. We believe that other process performance characteristics such as memory requirement, time complexity, portability are not a major issue when dealing with the problem addressed in this paper, in that the process can be easily accomplished on a PC with the usual file and text processing tools (e.g., detex, Perl, Microsoft Word). In a previous experiment [1] we applied our traceability link recovery approach to the release 3.4 of a freely available C++ library of foundation classes, called LEDA (Library of Efficient Data types and Algorithms), developed and distributed by Max-Planck-Institut f¨ur Informatik, Saarbr¨ucken, Germany (freely available for academic research and teaching from http://www.mpisb.mpg.de/LEDA/). In this case, traceability links between classes and manual pages were recovered: 38.94 % of precision and a recall of 82.65% were achieved when considering the best match between classes and manual pages [1]. It is very important to highlight that with a moderately higher number of retained candidates we were able to recover all the traceability links ( 7 of recall). The low value of precision may well pay off; higher precision values should not be privileged: we believe the time spent in discarding a false candidate is sensibly lower than time required to recover a missed link. However, it could be argued that such a high value for the recall was due to the LEDA nature, i.e. a C++ library whose manual pages contain a high number of identifiers that also appear in the source code. Actually, the LEDA team generated manual pages with scripts that extract comments and special sections from the source files. Manual sections are identified by a kind of markup language: whenever the documentation is reconstructed, scripts parse the source files and latex sections or html documents are extracted. To assess our traceability recovery process and verify the feasibility of using it in more realistic case studies, we applied it to a more complex software system developed following the Boehm waterfall model [3]. For this system all the prescribed documentation was available (e.g., requirement documents, design documents, test cases, etc.). The analyzed software system, namely Albergate, was developed in Java by a team of final year students at University of Verona (Italy). Albergate is a software system designed to implement all the operations required to admin-
istrate and manage a small/medium size hotel (room reservation, bill calculation, etc.). The system has been developed from scratch on the basis of 16 functional requirements specified (as well as all the other system documentation) in the Italian language. Albergate exploits a relational database for most of the operations, therefore the system size is relatively low (about 20 KLOC with 95 classes). Non functional requirements as well as other documentation such as user manuals, UML use cases, architecture and detailed design were not used in the present study. We focused on the 60 classes implementing the user interface of the software system. To validate the results of our traceability recovery process, the original developers were required to provide a 16 60 traceability matrix linking each requirement to the classes implementing it. However, most of the functional requirements were implemented by a low number of classes. On the average, a requirement is implemented by about 4 classes with a maximum of 10.
T
3. Case Study
Recall 25.8 % 29.3 % 34.4 % 39.6 % 48.2 % 51.7 %
Table 1. Albergate baseline results.
To make the Albergate results directly comparable with the results obtained with the LEDA case study, we first applied a simplified version of the process shown in Figure 1. Indeed, in the LEDA case study the Identifier Separation and the Text Normalization activities were performed in a much simpler way than in the present case study. In particular, in the LEDA case study Identifier Separation only consisted of splitting identifiers containing underscores: for example, Reserve Room was splitted in Reserve and Room. Similarly, Text Normalization was performed only at the first level of accuracy: i.e. the transformation of capital letters into lower case letters. Table 1 shows the results of applying such a simplified process to the Albergate software system. Average precision and recall are considered at different levels of retained top ranked candidates (documents); for example, if the first 3 documents are considered for each class, only 34% of recall is achieved, compared with the 94% obtained in the LEDA case study [1]. Table 1 figures are not as encouraging as in the LEDA case study [1]. However, it is worth noting that the Italian language has a complex grammar: verbs have much more forms than English verbs, plurals are almost always
Top ranking 1 2 3 4 5 6
Precision 54.7 % 38.6 % 28.3 % 21.2 % 16.9 % 12.4 %
code identifiers were produced by concatenating (without any underscores) two or more words;
sometimes the same concept was referred to different (semantically equivalent) words either in the requirements and/or in the code; Java as well as most programming languages is casesensitive, and programmers are encouraged to create identifiers concatenating words and capitalizing the first letter of each word.
Affecting factors are not listed in order of relevance, further work is required to assess their relative influence. To mitigate the influence of these factors, we have improved the Identifiers Separation and Text Normalization activities as specified in the previous sections. As shown in Table 2, a dramatic improvement was achieved: the corresponding figures of Table 1 were increased of about 100%. All traceability links were recovered by considering the first six documents for each class. These measures undoubtedly support the effectiveness of our approach to recovering a mapping between a func-
Recall 50.0 % 70.6 % 77.5 % 87.9 % 98.2 % 100 %
Table 2. Albergate results with text processing.
T
tional requirement and the classes implementing it, provided that an adeguate text processing is applied (e.g., composite words, stop words, plural and singular of the same word, etc). As already stated, the Italian language is much more complex than the English language, thus it is likely that even better results could be obtained if the same process would be applied to a system with requirements (and code identifiers) written in English. In the current process implementation Identifier Separation and Text Normalization are not completely automated rather they are supported by tools such as spelling checker, thesaurus and Perl filters. However, it is worth noting that it only required half a day to a software engineer, without any knowledge of the application domain, to accomplish this task with the help of semi-automatic tools. In our opinion it would take more time to analyze all possible links between 95 source code classes (for a total of 20 KLOC) and 16 functional requirements. Further works will be devoted to analyze the gain in the effort induced by the use of our approach.
DR AF
irregular, adverbs and adjectives have irregular forms too. Furthermore, in the Albergate case study the relative distance between source code (namely classes) and documents (namely functional requirements) is much higher than in the LEDA case study. For example, common words between requirements and classes are quite infrequent in the Albergate system: unlike LEDA documents (manual pages of the class library), Albergate documents (functional requirements) were produced in the early phases of the software development life cycle. Several factors affecting the quality of the recovered traceability links rely on the way semantic concepts of the domain are encoded when programmers map problem domain into requirements, requirement into design and design into code. Mapping concepts from the problem domain into requirements and from requirements into design and code is an activity requiring the comprehension of the problem domain and the instantiation of some form of relation between entities at different conceptual levels. We believe that programmers face this problem with a strategy based on the use of the same semantic concepts at the different levels of abstraction: semantic concepts that are carried by words, sentences (e.g., comments) or sequences of words. Despite, the presence of such a strategy a recovery process may be affected by the way programmers encode the semantics into requirements, design and code. For example, a programming language imposes a syntax to identifiers. These encoding schemes may be regarded as affecting factors for the traceability recovery process. Besides language specific features, the following affecting factors were recognized as fundamentals:
4. Related Work
In the author’s knowledge the issue of recovering traceability links between code and free text documentation is not yet well understood and investigated. This study is the extension of a previous work [1] where a C++ library and its documentation were considered. However, in order to apply our approach to a complete project and its documentation, we were forced to introduce a more sophisticated text processing step. We believe this step is not only tied to the Italian language, as the affecting factors we identified are likely to be common to all languages. Apart from our previous work, the literature is not flourishing of other contributions. Most of the related work is in the area of impact analysis. For example, Turver and Munro [24] assume the existence of some form of ripple propagation graph describing relations between software artifacts, including both code and documentation, and focus on the prediction of the effects of a maintenance change request
By carefully analyzing preliminary results, we identified affecting factors that are likely to be language independent. Taking into account those affecting factors produced a dramatical increase in terms of precision and recall of the traceability recovery process. Future work will be devoted to extend the investigation to vector space models, to compare different model families and to assess the relative influence of affecting factors.
6. Acknowledgements
T
We would like to thank the Albergate programmer team: Colombari Andrea, Danzi Francesca, Girelli Daria, Martini Roberto, Meneghini Matteo, and Vincenti Paola who kindly provided the source code and the documentation of the system and gave us feedback on the outcomes of our analyses. A special thank to Martini Roberto for the helpful hints and for providing the requirement to code traceability matrix.
DR AF
not only on the source code but also on the specification and design documents. TOOR [20], IBIS [10], and REMAP [21] are a few examples of software development tools able to build and maintain traceability links among various software artifacts. However, these tools are focused on the development phase, furthermore, they require human interventions to define links or they force the adoption of naming conventions. An approach inspired by information retrieval was published for [12] automatically assembling software libraries based on a free text indexing scheme. This approach uses attributes automatically extracted from natural language IBM RISC System/6000 AIX 3 documentation to build a browsing hierarchy which accepts queries expressed in natural language. Indexing is performed by computing Lexical Affinities (LA), which stand for the correlation of common appearance of words in the utterances of the language, and by computing their “resolving power” defined as a function of the quantity of information of the probability cooccurrence of words within a sliding window of size +/five. Classification against a query is performed by ranking documents based on the “resolving power of their LA”. Another indexing approach based on free text analysis is the RSL [5] system which extracts free-text single-term indices from comments in source code files looking for reuse keywords like “author”, “date created”, etc. REUSE [2] is an information retrieval system which stores software objects as textual documents. CATALOG [8] is an information retrieval system to store and retrieve C components each of which is individually characterized by a set of single-term indexing features automatically extracted from natural language headers of C programs.
5. Conclusions
We have presented an approach to recover traceability links between software written in an object-oriented language, namely Java, and high level documentation: the approach rely on a probabilistic mapping to construct a manyto-many relation between classes and functional requirements. In a previous experiment on a library of C++ foundation classes we demonstrated the feasibility of recovering traceability links between manual pages and classes. The present work extends and validates the previous results on a more complex and difficult case study. Our process was inspired by information retrieval and stochastic language models: a language model from each functional requirement is estimated. Classes are represented with the class identifiers (i.e., attributes, methods and method parameters). Class representation is thought of as a query: each class is bound to the ordered list of manual sections according to the computed probability.
References
[1] G. Antoniol, G. Canfora, A. D. Lucia, and E. Merlo. Recovering code to documentation links in oo systems. In Proceedings of the IEEE Working Conference on Reverse Engineering, Atlanta, Georgia, pages 136–144. IEEE Comp. Soc. Press, October 1999. [2] S. P. Arnold and S. L. Stepowey. The reuse system: Cataloging and retrieval of reusable software. In W. Tracz, editor, Software Reuse: Emerging Technology. IEEE Comp. Soc. Press, 1987. [3] B. W. Boehm. Software Engineering Economics. PrenticeHall, Englewood Cliffs, NJ, 1981. [4] R. Brooks. Towards a theory of the comprehension of computer programs. International Journal of Man-Machine Studies, 18:543–554, 1983. [5] B. A. Burton, R. W. Aragon, S. A. Bailey, K. Koelher, and L. A. Mayes. The reusable software library. In W. Tracz, editor, Software Reuse: Emerging Technology, pages 129– 137. IEEE Comp. Soc. Press, 1987. [6] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley Series in Telecommunications, John Wiley & Sons., New York, NY 10158-0012, 1992. [7] W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1992. [8] W. B. Frakes and B. A. Nejmeh. Software reuse through information retrieval. In Proceedings of 20-th Ann. HICSS, Kona (HI), pages 530–535, January 1987. [9] D. Harman. Ranking algorithms. In Information Retrival: Data Structures and Algorithms, pages 363–392. PrenticeHall, Englewood Cliffs, NJ, 1992. [10] J. Konclin and M. Bergen. gibis: a hypertext tool for exploratory policy discussion. ACM Transaction on Office Information Systems, 6(4):303–331, October 1988.
T
DR AF
[11] S. Letovsky. Cognitive Processes in Program Comprehension: First Workshop. E. Soloway and S. Iyengar eds. Ablex Publisher, 1986. [12] Y. Maarek, D. Berry, and G. Kaiser. An information retrieval approach for automatically constructing software libraries. IEEE Transactions on Software Engineering, 17(8):800– 813, 1991. [13] A. V. Mayrhauser and A. Vans. From program comprehension to tool requirements for an industrial environment. In Proceedings of IEEE Workshop on Program Comprehension, pages 78–86, Capri, Italy, 1993. IEEE Comp. Soc. Press. [14] A. V. Mayrhauser and A. Vans. Dynamic code cognition behaviours for large scale code. In Proceedings of IEEE Workshop on Program Comprehension, pages 74–81, Washington, DC, USA, 1994. IEEE Comp. Soc. Press. [15] A. V. Mayrhauser and A. M. Vans. Identification of dynamic comprehension processes during large scale maintenance. IEEE Transactions on Software Engineering, 22(6):424– 437, 1996. [16] R. D. Mori. Spoken dialogues with computers. Academic Press, Inc., Orlando, Florida 32887, 1998. [17] H. New and U. Essen. On smoothing techniques for bigrambases natural language modelling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume S12.11, pages 825–828, Toronto, Canada, 1991. [18] N. Pennington. Comprehension Strategies in Programming. In: Empirical Studies of Programmers: Second Workshop. G.M. Olsen, S. Sheppard, S. Soloway eds. Ablex Publisher, Nordwood, NJ, Englewood Cliffs, NJ, 1987. [19] N. Pennington. Stimulus structures and mental representations in expert comprehension of computer programs. Cognitive Psychology, 19:295–341, 1987. [20] F. A. C. Pinhero and A. G. J. An object-oriented tool for tracing requirements. IEEE Software, 13(2):52–64, March 1996. [21] B. Ramesh and V. Dhar. Supporting systems development using knowledge captured during requirements engineering. IEEE Transactions on Software Engineering, 9(2):498–510, June 1992. [22] B. Shneiderman and R. Mayer. Syntactic/semantic interactions in programmer behaviour: A model and experimental results. International Journal of Computer and Information Sciences, 8(3):219–238, Mar 1979. [23] E. Soloway and K. Ehrlich. Empirical studies of programming knowledge. IEEE Transactions on Software Engineering, 10(5):595–609, 1994. [24] R. J. Turver and M. Munro. An early impact analysis technique for software maintenance. Journal of Software Maintenance - Research and Practice, 6(1):35–52, 1994. [25] I. Vessey. Expertise in debbugging computer programs: A process analysis. International Journal of Man-Machine Studies, 23:459–494, 1985. [26] I. H. Witten and T. C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transaction on Informormation Theory, 37:1085–1094, 1991.