Automatic Indexing of Software Artifacts - CiteSeerX

Automatic Indexing of Software Artifacts M. R. Girardi and B. Ibrahim University of Geneva - CUI CH-1211 Geneve 4, Switzerland E-mail: {girardi, bertrand}@cui.unige.ch Abstract This paper describes the classification mechanism of ROSA, a software reuse system based on the processing of the natural language descriptions of software artifacts. The system supports the automatic indexing of components by acquiring lexical, syntactic and semantic information from software descriptions. Major goals of the approach are cost-effectiveness and retrieval precision. Some results from a first experiment with the classification system are discussed.

1 In t r o d u c t ion Two main problems limit the practice of software reuse: the lack of mechanisms to produce reusable software artifacts, robust and easy to adapt, and the lack of mechanisms to retrieve software effectively according to the user requirements. In spite of the lack of software specially designed for reuse, there is a lot of software available that was not specially designed with reuse in mind but that can be reused. Source code is available in different programming languages and different specification languages/formalisms are used for other development products. The documentation in natural language that is normally available with these products provides a uniform description of the software artifacts that can be used to classify and retrieve them. Most on-line information systems currently offer some searching capability, but this tends to be very primitive (e.g. ‘man -k’, WAIS, Archie, Gopher/Veronica, WWW). More precision is needed for the identification of software artifacts. Information retrieval techniques have been frequently used to construct and evaluate software retrieval systems and following the research in that area, earlier reuse systems were keyword-based. In these systems, catalogues are usually constructed manually and components are retrieved by specifying a set of keywords under a software classification scheme, using either a free vocabulary or a controlled one. But constructing software libraries by indexing software

components manually is difficult and expensive. Automatic indexing is required to turn software retrieval systems costeffective. On the other hand, the effectiveness of traditional keyword based retrieval systems is limited by the expertise of users as well as by the size of the software libraries. Moreover, the effectiveness of these systems is limited by the socalled “keyword barrier”, i.e. the systems are unable to improve retrieval effectiveness beyond a certain performance threshold [14]. This situation is particularly critical in software retrieval where users require high precision, i.e. they expect to retrieve only the best components for reuse, avoiding to have to select the best one from a list containing many irrelevant components. The availability of machine-readable dictionaries and developed techniques for language analysis have made possible the application of natural language processing techniques to information retrieval systems [2] [3] [11] [12] [18] [21]. The main purpose of this work is to experiment with these techniques in the domain of software descriptions and evaluate their potentiality to construct more effective reuse systems. This paper describes an approach for automatic indexing of software components from their descriptions in natural language. Some results from an initial experiment with the classification system are also discussed. The classification approach is being used in ROSA (Reuse Of Software Artifacts), a software reuse system [6] [7] [8] [9] [10] that aims at being cost-effective, domain independent and providing good retrieval performance. The retrieval mechanism of ROSA is only briefly described in this paper. A detailed description of this mechanism with the similarity measures applied on retrieval operations is done in [9]. Some results on retrieval evaluation are also discussed in [10]. The paper is organized as follows. Section 2 summarizes main approaches for software classification. Section 3 outlines the main mechanisms in the current version of ROSA. Section 4 describes the classification formalism used to identify, in a software description, the knowledge needed to catalogue the component in a library. Section 5

introduces the defined mechanisms for the analysis of both queries and component descriptions (morpholexical, syntactic and semantic analysis) and the semantic structure of the Frame Base. Section 6 discusses the results of an initial experiment with the classification system and section 7 introduces some results on retrieval evaluation. Section 8 concludes the paper with some remarks on retrieval results [10] and on the advantages and limitations of the proposed approach.

2 C l a ssi f i c ation of softw are This section summarizes major approaches for software classification: the approaches that propose a classification scheme according to which software artifacts are catalogued in a software base by a manual indexing procedure [17][22][24] and more recent proposals that take advantage of the lexical, syntactic and semantic information available in the natural language documentation of software artifacts to both improve retrieval effectiveness and support the automatic indexing of components [1][13][16].

2.1 Manu al indexing Wood and Sommerville [24] propose a frame-based software component catalogue, that aims at capturing the meaning of software components. The catalogue consists of a frame for the basic function that a software component performs, with slots for the objects manipulated by the component. Prieto-Diaz [17] proposes a faceted classification scheme for cataloguing software components. This scheme has also been adopted by other reuse proposals, like the REBOOT project [20]. The scheme uses facets as descriptors of software components. These facets have a meaning similar to some semantic cases defined in this work. The scheme consists of four facets: the function performed by the component, the object manipulated by the function, the data structure where the function takes place (medium or location), and the system to which the function belongs. NLH/E (Natural Language Help/English) [22] is a question-answering system that accepts queries in English. The authors propose a frame-based software component catalogue for software components, organized as a case grammar. The catalogue is created manually but queries are parsed using the case grammar. The catalogue is specific to the Common Lisp domain. PROTEUS [5] is a software reuse system that supports different classification methods for software components (simple keyword-based, faceted, enumerated graph and attribute value) in order to do comparative empirical evaluation of those methods.

2 . 2 Au to ma ti c i n d ex i n g Systems that provide automatic indexing of software components can be classified in two basic groups: systems that work only at the lexical level [13] and systems that include syntactic and semantic analysis of software descriptions [1][16]. Lexical processing. The GURU system [13] classifies software components according to attributes automatically extracted from their natural language documentation by using an indexing scheme based on the notions of lexical affinity and quantity of information. The retrieval system accepts queries in free style natural language and the same technique used to index software components is applied to queries. Even without attempting any understanding of the descriptions and without using any kind of syntactic and semantic knowledge, the system exhibits better precision than single keyword-based systems. Syntactic and semantic processing. These systems make some kind of syntactic and semantic analysis of natural language specifications without pretending to do a complete understanding of the descriptions. They are based upon a knowledge base which stores semantic information about the application domain and about natural language itself. These systems appear more powerful than traditional keyword retrieval systems but they usually require enormous human resources because knowledge bases are created manually for each application domain. Some of the systems that follow this approach are described below. LaSSIE (Large Software System Information Environment) [1] is composed of a knowledge base, a semantic analysis algorithm based on formal inference and a user interface incorporating a graphical browser and a natural language parser. LaSSIE helps programmers in searching useful information about a large software system and also supports the retrieval of components for reuse. The construction of the knowledge base is currently a manual process and future directions of the project point to the definition of knowledge acquisition mechanisms to automate that process. Nie [16] proposes a software retrieval system for components documented in French. Both input sentences and software descriptions are parsed using a semantic grammar. A knowledge base for a restricted application domain has been created to assist the parsing process. Knowledge consists of a set of classes modelling data and operations in the domain and a set of relationships among concepts in the classes.

3 O v e r v i e w of R OS A

4 The c lassif ic at ion f or m alism

The current version of the ROSA reuse system consists of a classification mechanism and a retrieval mechanism. The classification system catalogues the software components in a software base through their descriptions in natural language. An acquisition mechanism automatically extracts from software descriptions the knowledge needed to catalogue them in the software base. The system extracts lexical, syntactic and semantic information and this knowledge is used to create a frame-based internal representation for the software component. The interpretation mechanism used for the analysis of a description does not pretend to understand the meaning of a description. It attempts to automatically acquire enough information to construct useful indexing units for software components. Syntactic and semantic analysis of descriptions follow the rules of a classification formalism. The formalism consists of a case system, constraints and heuristics to perform the translation of the description into an internal representation. Public domain lexicons are used to get lexical and semantic information needed during the parsing process. The WordNet [15] lexicon is used to obtain morphological information, grammatical categories of terms and lexical relationships between terms. The Software base is a base of frames where each software component has a set of associated frames containing the internal representation of its description along with other information associated to the component (source code, executable examples, reuse attributes, etc). A similar analysis mechanism to the one applied to software descriptions is used to map a query in free text to a frame-based internal representation. The retrieval system [9][10] looks for and selects components from the Software base, based on the closeness between the frames associated to the query and to the software descriptions. Closeness measures [9] are derived from the semantic formalism and from a conceptual distance measure between the terms in frames under comparison. Software components are scored according to their closeness value with the user query. The ones with a score higher than a controlled threshold become the retrieved candidates. As a first step, the system deals with imperative sentences for both queries and software descriptions. Imperative sentences describe simple actions that are performed by a software component, the object manipulated by the action, the manner by which the action is performed and other semantic information related to the action.

This section describes the formalism used to represent software descriptions. Through a case system for simple imperative sentences, the formalism establishes the rules to generate the internal representation of both queries and natural language descriptions of software components. The formalism is inspired by the linguistic case theory of Fillmore [4].

4 . 1 T h e ca s e s y s tem Definition 1 A case system for imperative sentences The case system for an imperative sentence consists basically of a sequence of one or more semantic cases: Imperative_sentence →

∑ cj

j∈C

where: C = {j | j is a semantic case} cj = the term in the sentence associated with the semantic case ‘j’

Semantic cases are associated to some syntactic compounds of an imperative sentence. An imperative sentence consists of a verb (representing an action) followed by a noun phrase (representing the direct object of the action) and perhaps some embedded prepositional phrases. For instance, the sentence ‘search a file for a string’ consists of the verb ‘search’, in the infinitive form, followed by the noun phrase ‘a file’, which represents the object manipulated by the action, and followed by the prepositional phrase ‘for a string’, which represents the goal of the ‘search’ action. In the example, the semantic cases ‘Action’, ‘Location’ and ‘Goal’ are respectively associated to the verb, direct object and prepositional phrase of the sentence: ‘search a file for a string’

-->

cAction + cLocation + cGoal

Definition 2 Semantic cases Semantic cases show how noun phrases are semantically related to the verb in a sentence. For instance, in the sentence ‘search a file for a string’, the semantic case ‘Goal’ associated to the noun phrase ‘for a string’ shows the target of the action ‘search’. We have defined a basic set of semantic cases for software descriptions by analyzing the short descriptions of Unix commands in manual pages (Table 1). These semantic cases describe the basic functionality of the component (the action, the target of the action, the medium or location, the mode by which the action is performed, etc.).

Definition 3 Case generators A semantic case consists either of a verb (the ‘Action’ semantic case) or a case generator (possibly omitted) followed by a noun phrase1: ci --> (Case generator) + Noun phrase

A case generator reveals the presence of a particular semantic case in a sentence. Case generators are mainly prepositions. For instance, in the sentence ‘search a file for a string’, the preposition ‘for’ in the prepositional phrase ‘for a string’ suggests the ‘Goal’ semantic case. Semantic case

Description

Action Agent

The main verb in an imperative sentence The software component that implements an action (implicit) Comparison The entity compared with the main object of an action Condition The premise upon which an action is performed Destination The destination of the object Duration How long an action lasts Goal The manipulated object, the target or result of an action Instrument The entity with which an action is performed Location Where an action occurs or the entity manipulated by an action Manner The mode by which an action is performed Purpose The reason for an action Source The origin of the object manipulated by an action Time When an action occurs Table 1 - Major semantic cases found in imperative sentences describing software components

Table 2 shows some generators of the semantic cases considered in this work. They have been identified by analysing the short descriptions in manual pages of Unix commands; the meaning of prepositions and the relationships of the prepositions with the defined semantic cases. Some case generators suggest more than one semantic case. For instance, the preposition ‘for’ suggests all three semantic cases ‘Goal’, ‘Purpose’ and ‘Duration’. The semantic analysis of descriptions tries to disambiguate these different interpretations by applying a set of heuristics (based both on syntactic and semantic knowledge) to the associated prepositional phrase that follows in the sentence.

1. The ‘Purpose’ semantic case is an exception. It consists of a case generator followed by a verbal phrase.

Semantic case

Case generator

Comparison Condition Destination Duration Goal Instrument Location Manner Purpose Source Time

as, between, like, with as, at, except, how, unless, until to during, for, through for, to ‘by using’, with, using at, in, into, on, through, within by for, to from at, in, into, on, through Table 2 - Semantic cases and some of their case generators

Definition 4 Case system constraints and heuristics The following constraints and heuristics govern the case system. They state syntactic and semantic rules to identify semantic cases in a sentence. 1. No semantic case may appear twice. 2. The ‘Action’ semantic case is obtained from the main verb in the sentence. 3. The ‘Location’ or ‘Goal’ semantic cases are generated from the direct object in the sentence if there is no prepositional phrase in the sentence containing a case generator for the semantic case. 4. Other semantic cases are obtained from prepositional phrases in the sentence containing a case generator for the semantic case. The case system, constraints and heuristics supporting the semantic formalism are implemented in the grammar used for the analysis of the descriptions.

5 The analysis of t he de sc r ipt ions Morpholexical, syntactic and semantic analysis of software descriptions are performed to map a description to a frame-based internal representation. The purpose of morpholexical analysis is to process the individual words in a sentence to recognize their standard forms (e.g. ‘search’ as the standard form of ‘searching’), their grammatical categories (verb, noun, adjective, adverb), and their semantic relationships with other words in a lexicon (like synonyms, hypernyms and hyponyms). Morpholexical analysis also performs the processing of collocations and idioms, i.e. where two or more individual words are used in usual association (multi-word lexical entries).

Two semantic relations between terms are currently considered: synonymy and hyponymy/hypernymy. The predicate synonym(x,y,c) means that the term ‘y’ is a synonym of the term ‘x’ in the ‘c’ lexical category (e.g. synonym (‘computer’, ‘data processor’, noun)). The predicate hyponym(x,y,d,c) means that the term ‘y’ is an hyponym (a specialization) of the term ‘x’ at a d-distance in a thesaurus in the ‘c’ lexical category (e.g. hyponym (‘computer’, ‘PC’, 1, noun)). The predicate hypernym(x,y,d,c) means that the term ‘y’ is an hypernym (a generalization) of the term ‘x’ at a d-distance in a thesaurus in the ‘c’ lexical category (e.g.hypernym (‘computer’, ‘machine’, 1, noun)). Just after morpholexical analysis, both syntactic and semantic analysis of software descriptions are performed by using a definite clause grammar, supporting the case system. The defined grammar implements a subset of the grammar rules for imperative sentences in English [23]. This parsing phase generates a set of derivation trees, corresponding to the different syntactic interpretations of the description, with the semantic constraints imposed by the case system. Further semantic analysis of the generated derivation trees is performed to apply some constraints of the case system and some heuristics to disambiguate different interpretations of a description. A set of semantic structures, used as the internal representation of the description is generated as a result of this analysis phase. These semantic structures are created according to the classification scheme of Figure 1. The classification scheme consists of a hierarchical structure of generic frames (‘IS-AKIND-OF’ relationship). Frames that are instances of these generic frames (‘IS-A’ relationship) implement the indexing units of software descriptions. Major generic frames for the Frame Base are shown in Figure 2. They model semantic structures associated to verb phrases, noun phrases and the information associated to software components, like name, description, source code, executable examples, etc. Semantic cases are represented as slots in the frames. ‘Facets’ are associated to each slot in a frame, describing either the value of the case or the name of the frame where the value is instantiated (‘value’ facet); the type of the frame that describes its internal structure (‘domain’ facet) or the lexical category of the case (‘category’ facet). For instance, the ‘Location’ slot in the verb phrase frame has a ‘domain’ facet indicating that its constituents are described in a frame of type ‘noun phrase’. Through a parsing process, the interpretation mechanism maps the verb, the direct object and each prepositional phrase in a sentence into a semantic case, based on both syntactic features and identified case generators.

Case_frame --> FRAME Frame_name Hierarchical_link CASES Case_list. Hierarchical_link --> IS_A Frame_name | IS_A_KIND_OF Frame_name Case_list

--> Case (Case_list)

Case

--> Case_name Facet

Case_name --> Semantic_case_name | Other_case_name Semantic_case_name --> Action | Agent | Comparison | Condition | Destination | Duration | Goal |Instrument | Location | Manner | Purpose | Source | Time Other_case_name --> Modifier | Head | Adjective_modifier | Participle_modifier | Noun_modifier Facet --> VALUE Value | DOMAIN Frame_name | CATEGORY Lexical_category Value --> string | Frame_name Lexical_category

--> verb | adj | noun | adv |component_id | text.

Frame_name

--> string.

Figure 1 - A classification scheme for software artifacts

Some case generators suggest more than one semantic case. For instance, the preposition ‘for’ suggests all the semantic cases ‘Goal’, ‘Purpose’ and ‘Duration’. Some interpretations are disambiguated at the syntactic level (like ‘Purpose’); the other ones at the semantic level by applying a set of heuristics to the corresponding prepositional phrase in the sentence. For instance, the rule: Duration (noun_phrase) --> hypernym (noun_phrase, ‘time period’).

states a precondition for the ‘Duration’ semantic case: the term in the ‘Duration’ semantic case must be a specialization of the term ‘time period’. Figure 3 shows the semantic structures generated for the short description of the grep family of UNIX commands, ‘search a file for a string or regular expression’. Note that the discontinuity in frame numbering is due to the frames that have been discarded by the disambiguation of the semantic cases that appear in the sentence. On the other hand, the ambiguity introduced by the ‘or’ conjunction remains, producing the following alternative interpretations for the description: 1. search a file for a string 2. search a file for a regular expression 3. search a file for a string expression

FRAME verb_phrase IS_A_KIND_OF root_frame CASES Action CATEGORY verb Agent DOMAIN component Comparison DOMAIN noun_phrase Condition DOMAIN noun_phrase Destination DOMAIN noun_phrase Duration DOMAIN noun_phrase Goal DOMAIN noun_phrase Instrument DOMAIN noun_phrase Location DOMAIN noun_phrase Manner DOMAIN noun_phrase Purpose DOMAIN verb_phrase Source DOMAIN noun_phrase Time DOMAIN noun_phrase. FRAME noun_phrase IS_A_KIND_OF root_frame CASES Adjective_modifier CATEGORY adj Participle_modifier CATEGORY verb Noun_modifier CATEGORY noun Head CATEGORY noun. FRAME component IS_A_KIND_OF root_frame CASES Name CATEGORY component_id Description CATEGORY text . . {Other information associated to the component, . e.g. source code, executable examples, reuse attributes, etc}

Figure 2 - Some generic frames of the Frame Base Even if the third interpretation is not a good one, we allow the ambiguity, considering that it will not affect substantially the retrieval effectiveness. Ambiguity is also introduced by the word ‘regular’ which is both an adjective and a noun.

was used both for morpholexical and semantic analysis of the descriptions. frame('grep_verb_phrase_15',IS_A,verb_phrase). case('grep_verb_phrase_15',agent,VALUE,'grep_component_15'). case('grep_verb_phrase_15',action,VALUE,'search'). case('grep_verb_phrase_15',location,VALUE,'grep_noun_phrase_30'). case('grep_verb_phrase_15',goal,VALUE,'grep_noun_phrase_31'). frame('grep_component_15',IS_A,component). case('grep_component_15',name,VALUE,'grep'). case('grep_component_15',description,VALUE ,'search a file for a string or regular expression '). frame('grep_noun_phrase_30',IS_A,noun_phrase). case('grep_noun_phrase_30',head,VALUE,'file'). frame('grep_noun_phrase_31',IS_A,noun_phrase). case('grep_noun_phrase_31',adj_modifier,VALUE,'regular'). case('grep_noun_phrase_31',head,VALUE,'expression'). frame('grep_verb_phrase_16',IS_A,verb_phrase). case('grep_verb_phrase_16',agent,VALUE,'grep_component_15'). case('grep_verb_phrase_16',action,VALUE,'search'). case('grep_verb_phrase_16',location,VALUE,'grep_noun_phrase_30'). case('grep_verb_phrase_16',goal,VALUE,'grep_noun_phrase_33'). frame('grep_noun_phrase_33',IS_A,noun_phrase). case('grep_noun_phrase_33',noun_modifier,VALUE,'regular'). case('grep_noun_phrase_33',head,VALUE,'expression'). frame('grep_verb_phrase_17',IS_A,verb_phrase). case('grep_verb_phrase_17',agent,VALUE,'grep_component_15'). case('grep_verb_phrase_17',action,VALUE,'search'). case('grep_verb_phrase_17',location,VALUE,'grep_noun_phrase_30'). case('grep_verb_phrase_17',goal,VALUE,'grep_noun_phrase_35'). frame('grep_noun_phrase_35',IS_A,noun_phrase). case('grep_noun_phrase_35',noun_modifier,VALUE,'string'). case('grep_noun_phrase_35',head,VALUE,'expression'). frame('grep_verb_phrase_23',IS_A,verb_phrase). case('grep_verb_phrase_23',agent,VALUE,'grep_component_15'). case('grep_verb_phrase_23',action,VALUE,'search'). case('grep_verb_phrase_23',location,VALUE,'grep_noun_phrase_30'). case('grep_verb_phrase_23',goal,VALUE,'grep_noun_phrase_47').

6 A c l a ssi f ication exp erimen t

frame('grep_noun_phrase_47',IS_A,noun_phrase). case('grep_noun_phrase_47',head,VALUE,'string').

A first prototype, implementing the basic functions of the classification system was constructed in BIMprolog on a SUN platform. Using this prototype, an initial experiment to evaluate the effectiveness of the classification mechanism has been conducted. The experiment aimed at measuring, among others: • the behavior of the defined procedures for morpholexical, syntactic and semantic analysis of software descriptions; • the frequency of occurrence and relevance of the defined semantic cases as descriptors of software components; • the correctness of the generated internal representations of descriptions and the effect of encoding ambiguity. An arbitrary corpus of 80 names and short descriptions of UNIX commands in man pages was used as testbed. A small corpus was selected in order to analyze the interpretation of descriptions individually. The WordNet lexicon [15]

Figure 3 - The internal representation of the description ‘search a file for a string or regular expression’

6 . 1 An a l y s i s o f res u l ts The next sections discuss the results obtained in each phase of analysis. The results are only partial, given the small and domain-specific corpus used as testbed and the limitations of the prototype used for the experiment. Morpholexical analysis. Figure 4 shows some results of the morpholexical analysis phase. 52% of the descriptions were accepted in this phase. These results were expected, given both the experimental state of the WordNet dictionary and the use of only one lexicon in the experiment. The distribution of the reasons of rejection was the following:

Morpholexical analysis

Syntactic analysis 52% % Accepted % Rejected

% Accepted

83%

% Rejected (Conjunctions)

Morpholexical analysis - Rejection

Figure 5 - Syntactic analysis results and reason of rejection

37%

% Absent % Proper nouns

34% 29%

% Composed modifiers

Figure 4 - Morpholexical analysis results and reasons for description rejection

• Absent terms: 37% of the descriptions were rejected because of either usual words that should be present in the lexicon or usual words of the computer domain. • Technical terms: 34% of the rejected descriptions include technical terms, specific to the UNIX domain. • Composed modifiers: 29% of the rejected descriptions include composed modifiers (e.g. window-based) not present in the lexicon. These percentages of rejection can be considerably reduced because: • New versions of WordNet will be certainly more complete and a complementary lexicon (like Webster) can be used in the classification process. • Morpholexical analysis can identify and automatically acquire the unknown words of a description. • Morpholexical analysis can identify and process composed modifiers that are absent in the lexicon. Syntactic analysis. Figure 5 shows the results of the syntactic analysis phase. 83% of the descriptions passing morpholexical analysis were accepted in this phase. The reason of rejection was merely the presence of conjunctions (and/or) in the descriptions that were not processed yet by the classification system at the time the experiment was carried out. Semantic analysis and frame construction. An average of 3.94 different interpretations by description were generated. It was also an expected result, considering that: • The semantic analysis didn’t support the disambiguation of all semantic cases when this experiment was carried out (see for instance, the high frequency (7%) of the ‘time’ semantic case in Figure 6, that was not disambiguated).

• At least two different interpretations (‘goal’ and ‘location’) are generated from the direct object of descriptions if the semantic cases are absent from the prepositional phrases. • Remaining ambiguities are due to the lexical ambiguity of some words (e.g. words that are both noun and adjective modifiers). Figure 6 shows the distribution of semantic cases in the main verbal phrase frames created through the semantic analysis process: Semantic cases

Goal (PP) Destination Time

52% 46%

Location (PP) Purpose

27%28% 15%

Location (DO) Goal (DO)

5% 7% 7%

Action

Figure 6 - Distribution of semantic cases found in verbal frames

• The ‘Action’ semantic case appears in 52% of the created frames. Thus, we observe that simple nominal phrases (in the direct object of the phrase) sometimes followed by prepositional ones are very frequent in UNIX descriptions. This type of phrase is often used to describe generic tools, while purely imperative sentences generally describe commands that perform specific actions. • The ‘Goal’ and ‘Location’ semantic cases appear in 51% and 43% of the created frames, with a percentage of 5% and 15% of the semantic cases appearing in prepositional phrases (PP) respectively. The ‘Goal’ semantic case was identified erroneously in one description because of lexical ambiguity. All other interpretations were correct according to the defined meaning. • The ‘Destination’ semantic case appears in 7% of the created frames. The semantic case was identified erroneously in one description because of lexical ambiguity.

All other interpretations were correct according to the defined meaning. • The ‘Purpose’ semantic case appears in 27% of the created frames. About one half appears in descriptions where the main ‘Action’ semantic case is absent (typically of the form ‘a program to do something’). In these cases, all the functionality is in fact expressed in the prepositional phrase associated to ‘Purpose’ and the direct object gives only information about the implementation/ environment support. New support for this particular feature must be considered. All occurrences of the ‘Time’ semantic case were erroneous, because of the absence of semantic rules to disambiguate it from the ‘Location’ semantic case, with which it has common case generators. • No occurrence of the ‘Duration’ semantic case was found because it was correctly disambiguated from the ‘Goal’ semantic case, with which it has common case generators. • No case generator was found for the semantic cases ‘Comparison’, ‘Condition‘, ‘Instrument‘, ‘Manner’ and ‘Source’.

7 S o m e r e t r ieval resu lts A description of the retrieval mechanism of ROSA is out of the scope of this paper. The main features of this mechanism along with the similarity measures for component retrieval are described in [9]. A detailed analysis of retrieval effectiveness, from some experiments with the system, is done in [10]. However, considering that the effectiveness of a classification mechanism is, in part, revealed through retrieval effectiveness, we introduce below some results of an experiment evaluating retrieval effectiveness. Table 3 shows ROSA versus WAIS (Wide Area Information Server) retrieval effectiveness results computed with a test collection composed of 20 queries and a knowledge base containing 418 general purpose UNIX commands. The knowledge base was populated automatically through the classification mechanism described in this paper. Effectiveness was evaluated with the measures of recall and precision for ranked systems [19]. An important improvement for ROSA over WAIS of both precision (97.78%) and recall (32.62%) was observed. Recall Avg.

Precision Avg.

0.99 0.89 WAIS 0.73 0.45 Improvement ROSA over WAIS +35.62 % +97.78 % Table 3 - WAIS and ROSA comparative results ROSA

System

8 Final r e mar ks A classification system for software artifacts has been proposed. Through automatic indexing, the system aims at reducing the cost of constructing software libraries. By building software descriptors with lexical, syntactic and semantic information extracted from software descriptions, the system intends to provide good retrieval effectiveness. Initial experimental results are encouraging. The classification approach is feasible and good retrieval effectiveness has been obtained. More experimentation is needed with larger corpora from divers software collections (e.g. Smalltalk library, ASSET collection, etc). The NLP approach presents clear advantages from the point of view of costs over either manual indexing approaches or knowledge-based systems where knowledge bases are constructed manually. On the other hand, the retrieval effectiveness of our approach (and, in general, of all approaches for automatic indexing) depends not only on the effectiveness of the classification mechanism and retrieval similarity measures. It depends also on the quality of the software descriptions that are available. Bad descriptions will produce necessarily poor retrieval results. Descriptions describing components both in specificity and exhaustivity are needed to ensure a reasonable retrieval effectiveness. Finally, the effectiveness of our approach depends on the quality of available dictionaries and thesauri. The WordNet lexicon has been very useful for our experiments. However, many word senses were not found as well as many semantic relationships between terms. Considering the large number of WordNet users we can expect that next releases will be more complete.

Re f e r e nc e s [1] P. Devanbu, R. Brachman, P. Selfridge and B. Ballard, "LaSSIE: A Knowledge-based Software Information System," CACM, vol. 34, no. 5, May. 1991, pp. 34-49. [2] D. A. Evans, K. Ginther-Webster, M. Hart, R. G. Leferts and I. A. Monarch, "Automatic Indexing Using Selective NLP and First-Order Thesauri," Proceedings of the Conference on Intelligent Text and Image Handling, Barcelona, April 1991, pp. 624-643. [3] J. L. Fagan, "The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval," Journal of the American Society for Information Science, vol. 40, no. 2, Sept 1989, pp. 115-139. [4] C. Fillmore, "The case for case," Universals in Linguistic Theory, E. Bach and R. Harms (Ed.), Rinehart and Wiston, New York, 1968, pp. 1-88. [5] W. B. Frakes and T. P. Pole, "Proteus: A Software Reuse Library System that Supports Multiple Representation Methods, " SIGIR Forum, vol. 24, no. 3, 1990, pp. 43-55.

[6] M. R. Girardi and B. Ibrahim, "New Approaches for Reuse Systems," Position Paper Collection of Second International Workshop on Software Reuse, E. Guerrieri (Ed.), Lucca, March 24-26, 1993. [7] M.R. Girardi and B. Ibrahim, "A software reuse system based on natural language specifications," Proceedings of 5th International Conference on Computing and Information (ICCI'93), O. Abou-Rabia, C. Chang and W. Koczkodaj (Ed.), IEEE Computer Society, Sudbury , May 27-29, 1993, pp. 507-511. [8] M.R. Girardi and B. Ibrahim, "An approach to improve the effectiveness of software retrieval," Proceedings of the 3rd Irvine Software Symposium (ISS'93), D.J. Richardson and R. N. Taylor (Ed.), IRUS, University of California, Irvine, Irvine, April 30, 1993, pp. 89-100. [9] M.R. Girardi and B. Ibrahim,"A Similarity Measure for Retrieving Software Artifacts," Proceeedings of Sixth International Conference on Software Engineering and Knowledge Engineering (SEKE’94), Jurmala,Latvia, June 21-23, 1994, pp. 478-485. [10] M. R. Girardi and B. Ibrahim, “Using English to Retrieve Software,” The Journal of Systems and Software, Special Issue on Software Reusability. (in revision process). [11] P. S. Jacobs and L. F. Rau, "SCISOR: Extracting Information from On-line News," CACM, vol. 33, no. 11, Nov. 1990, pp. 88-97. [12] R. Krovetz and W. B. Croft, "Word Sense Disambiguation Using Machine-Readable Dictionaries," Proceeding of the 12th International SIGIR Conference on Research and Development in Information Retrieval, N. J. Belkin and W. B. Croft (Ed.), Boston, 1989, pp. 127-136. [13] Y. Maarek, D. Berry and G. Kaiser, "An Information Retrieval Approach For Automatically Constructing Software Libraries," IEEE Transactions on Software Engineering, vol. 17, no. 8, Aug. 1991, pp. 800-813. [14] M. L. Mauldin, Conceptual Information Retrieval - A Case Study in Adaptive Partial Parsing, Kluwer, Boston, 1991.

[15] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross and K. Miller, "Five Papers on WordNet," CSL Report 43, Cognitive Science Laboratory, Princeton University, July 1990. [16] J. Nie, F. Paradis and J. Vaucher, "Using Information Retrieval for Software Reuse," Proceedings of 5th International Conference on Computing and Information (ICCI '93), O. Abou-Rabia, C. Chang and W. Koczkodaj (Ed.), IEEE Computer Society, Sudbury, Ontario Canada, May 27-29, 1993, pp. 448-45. [17] R. Prieto-Diaz, "Implementing Faceted Classification for Software Reuse," CACM, vol. 34, no. 5, May. 1991, pp. 8997. [18] G. Salton and M. Smith, "On the Application of Syntactic Methodologies in Automatic Text Analysis," Proceedings of the Twelfth Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (SIGIR'89), Cambridge, Massachusetts, June 25-28, 1989, pp. 137-150, Special Issue of SIGIR Forum. [19] G. Salton and M. Gill, Introduction to Modern Information Retrieval, Mc. Graw Hill, New York, 1983. [20] G. Sindre and E. Karlsson, "A Method for Software Reuse through Large Component Libraries," Proceedings of 5th International Conference on Computing and Information (ICCI '93), O. Abou-Rabia, C. Chang and W. Koczkodaj (Ed.), IEEE Computer Society, Sudbury, Ontario Canada, May 27-29, 1993, pp. 464-468. [21] A. Smeaton, "SIMPR: Using Natural Language Processing Techniques for Information Retrieval," Proceedings of RIAO'91, Huddersfield, April 1990, pp. 152-160. [22] W. Tichy, R. Adams and L. Holter, "NLH/E: A natural language help system," Proceedings of 11th ICSE , Pittsburgh, May. 1989, pp. 364-374. [23] M. Wood and I. Sommerville, "An Information Retrieval System for Software Components," SIGIR Forum, vol. 22, no. 3,4, 1988.

Automatic Indexing of Software Artifacts - CiteSeerX

Automatic Indexing of Software Artifacts - CiteSeerX

Suggest Documents

automatic indexing

Automatic color indexing of hierarchically structured ... - CiteSeerX

Thesaurus Based Automatic Keyphrase Indexing - CiteSeerX

Automatic Indexing Conversions for Nikon Lenses Automatic Indexing ...

Automating Traceability for Generated Software Artifacts - CiteSeerX

Automatic removal of ocular artifacts from EEG data using ... - CiteSeerX

Automatic Adaptation of Component-based Software - CiteSeerX

Automatic Reusability Appraisal of Software Components ... - CiteSeerX

WikiWiki Weaving Heterogeneous Software Artifacts

Automatic Parallelization of Embedded Software Using ... - CiteSeerX

Automatic composition of software systems from ... - CiteSeerX

Active Software Artifacts - Semantic Scholar

WikiWiki Weaving Heterogeneous Software Artifacts

Automatic detection and indexing of video-event

Automatic Indexing of Lecture Presentations Using Unsupervised ...

Automatic indexing of PDF documents with

Automatic Lecture Video Indexing Using Video OCR ... - CiteSeerX

Automatic Subject Indexing Using an Associative Neural ... - CiteSeerX

Automatic Subject Indexing Using an Associative Neural ... - CiteSeerX

Automatic Metadata Extraction and Indexing for Reusing e ... - CiteSeerX

Classification of Software Engineering Artifacts ... - Semantic Scholar

Automatic Message Indexing and Full Text Retrieval for a ... - CiteSeerX

Indexing Software - Waddington Electronics Inc.

A Similarity Measure for Retrieving Software Artifacts - CiteSeerX