to select the best one from a list that can contain many irrel-. evant components ..... from using 'man -k' to search for commands with keywords. extracted from Q1.
A Similarity Measure for Retrieving Software Artifacts M. R. Girardi and B. Ibrahim University of Geneva, Centre Universitaire d’Informatique CH-1211 Geneve 4, Switzerland E-mail: {girardi, bertrand}@cui.unige.ch Abstract This paper introduces the main features and the retrieval mechanism of ROSA, a software reuse system based on the processing of the natural language descriptions of software artifacts. The system supports the automatic indexing of components by acquiring lexical, syntactic and semantic knowledge from software descriptions. The retrieval mechanism is based on a similarity analysis that provides good retrieval effectiveness through partial matching of descriptions, processing of synonyms, generalizations and specializations of terms and considering the syntactic and semantic information available in the descriptors of software artifacts.
1 I n t rodu ct ion Reuse systems that index software components manually are difficult and expensive to set up. Automatic indexing is required to turn software retrieval systems cost-effective. On the other hand, the effectiveness of traditional keywordbased retrieval systems is limited by the so-called "keyword barrier" [7], i.e. these systems are unable to improve retrieval effectiveness beyond a certain performance threshold. This situation is particularly critical in software retrieval where users require high precision, i.e. they expect to retrieve only the best components for reuse, avoiding to have to select the best one from a list that can contain many irrelevant components. Natural language processing techniques, for the acquisition of lexical, syntactic and semantic information from software descriptions, are potentially useful to improve retrieval effectiveness and to reduce the cost of creating and maintaining software libraries. This paper describes the retrieval mechanism of ROSA (Reuse Of Software Artifacts), a software reuse system (earlier described in [2][3][4]) based on the processing of the natural language descriptions of software artifacts. The main features of its classification mechanism [5] are also outlined. The paper is organized as follows. Section 2 summarizes the main mechanisms in the current version of the reuse system. Section 3 outlines the semantic formalism used to identify, in a software description the knowledge needed to catalogue a component in a software base. Section 4 introduces the defined mechanisms for the analysis of descriptions (morpholexical, syntactic and semantic analysis) and
the semantic structure of the Software Base. Section 5 presents the mechanism for query processing and retrieval with the measures used for the similarity analysis of the indexing structures. Section 6 describes an experiment conducted to evaluate the effectiveness of the proposed approach. Section 7 summarizes related work in the area of reuse systems. Section 8 concludes the paper with some remarks on planned experiments with the system and further research.
2 O v erv i ew o f th e reu s e s y s tem Figure 1 shows an overview of the current version of the reuse system. The system consists of a classification mechanism and a retrieval mechanism. The classification system catalogues the software components in a software base through their descriptions in natural language. An acquisition mechanism automatically extracts from software descriptions the knowledge needed to catalogue them in the software base. The system extracts lexical, syntactic and semantic information and this knowledge is used to create a frame-like internal representation for the software component. The interpretation mechanism used for the analysis of a description does not pretend to understand the meaning of a description but to automatically acquire enough information to construct useful indexing units for software components. Semantic analysis of descriptions follows the rules of a semantic formalism. The formalism consists of a case system, constraints and heuristics to perform the translation of the description into an internal representation. Both syntactic and semantic rules are implemented in a grammar to parse descriptions into a set of frames. The semantic formalism is based on some semantic relationships between noun phrases and the verb in a sentence. These semantic relationships provide that similar software descriptions have similar internal representations. A classification scheme for software components derives from the semantic formalism, through a set of generic frames. The internal representation of a description constitutes the indexing unit for the software component, constructed as an instance of these generic frames. The WordNet [8] lexicon is used to obtain morphological information, grammatical categories of terms and lexical relationships between terms.
The Knowledge base is a base of frames where each software component has a set of associated frames containing the internal representation of its description along with other information associated to the component (source code, executable examples, reuse attributes, etc). Software descriptions and components
Query
Query classification
Lexicon
Internal representation of the query
Soft. classification
Knowledge base
Retrieval
Candidate(s)
Figure 1 - Overview of the reuse system
A similar analysis mechanism to the one applied to software descriptions is used to map a query in free text to an internal representation. The retrieval system uses the set of frames generated for the query to identify similar ones in the Knowledge base. The retrieval system looks for and selects components from the Knowledge base, based on the closeness between the frames associated to the query and software descriptions. Closeness measures are derived from the semantic formalism and from a conceptual distance measure between the terms in compared frames. Software components are scored according to their closeness value with the user query. The ones with a score higher than a controlled threshold become the candidates to retrieve. As a first step, the system deals with imperative sentences for both queries and software component descriptions.
3 The s em antic for malism This section outlines the semantic formalism used to represent software descriptions [5]. The formalism establishes the rules to generate the internal representation of both queries and natural language descriptions of software components. The formalism consists of a case system for simple imperative sentences with some constraints and heuristics that are used to map a description into a frame-like internal representation. The case system basically consists of a sequence of one or more semantic cases. Semantic cases are associated to some syntactic compounds of an imperative sentence. An imperative sentence consists of a verb (representing an action) possibly followed by a noun phrase (representing the direct
object of the action) and perhaps some embedded prepositional phrases. For instance, the sentence ‘search a file for a string’ consists of the verb ‘search’, in the infinitive form, followed by the noun phrase ‘a file’, which represents the object manipulated by the action, and followed by the prepositional phrase ‘for a string’, which represents the goal of the ‘search’ action. In the example, the semantic cases ‘Action’, ‘Location’ and ‘Goal’ are respectively associated to the verb, direct object and prepositional phrase of the sentence. Semantic cases show how noun phrases are semantically related to the verb in a sentence. For instance, in the sentence ‘search a file for a string’, the semantic case ‘Goal’ associated to the noun phrase ‘for a string’ shows the target of the action ‘search’. We have defined a basic set of semantic cases for software descriptions by analysing the short descriptions of Unix commands in manual pages. These semantic cases describe basically the functionality of the component (the action, the target of the action, the medium or location, the mode by which the action is performed, etc.). A semantic case consists of a case generator (possibly omitted) followed by a nominal or verbal phrase. A case generator reveals the presence of a particular semantic case in a sentence. Case generators are mainly prepositions. For instance, in the sentence ‘search a file for a string’, the preposition ‘for’ in the prepositional phrase ‘for a string’ suggests the ‘Goal’ semantic case.
4 T h e a n a l y s i s o f th e d es cri p ti o n s Morpholexical, syntactic and semantic analysis of software descriptions is performed to map a description to a framelike internal representation. The purpose of morpholexical analysis is to process the individual words in a sentence to recognize their standard forms, their grammatical categories and their semantic relationships with other words in a lexicon. Morpholexical analysis also performs the processing of collocations and idioms. Two semantic relations between terms are currently considered: synonymy and hyponymy/ hypernymy. The predicate synonym(x,y) means that the term ‘y’ is a synonym of the term ‘x’ in a particular lexical category. The predicate hyponym(x,y,d) means that the term ‘y’ is an hyponym (a specialization) of the term ‘x’ at a d-distance in a thesaurus in a particular lexical category. The predicate hypernym(x,y,d) means that the term ‘y’ is an hypernym (a generalization) of the term ‘x’ at a d-distance in a thesaurus in a particular lexical category. Just after morpholexical analysis, both syntactic and semantic analysis of software descriptions are performed interactively by using a definite clause grammar. The defined grammar implements a subset of the grammar rules for im-
perative sentences in English [12] and is considered broad enough for our initial experimental purposes. The grammar supports the case system and states domain-independent knowledge of the English language through a set of syntactic and semantic rules. The classification mechanism uses the grammar to parse software descriptions. A set of semantic structures is generated as a result of the parsing process, representing the internal structures of software descriptions. A language for modelling these semantic structures is shown in Figure 2. Case_frame--> FRAME Frame_name Hierarchical_link CASES Case_list. Hierarchical_link--> IS_A Frame_name | IS_A_KIND_OF Frame_name Case_list --> Case (Case_list) Case --> Case_name Facet Case_name --> Semantic_case | Other_case Semantic_case --> Action | Agent | Comparison | Condition | Destination| Duration | Goal | Instrument | Location | Manner| Purpose| Source | Time Other_case --> Modifier | Head | Adjective_modifier | Participle_modifier | Noun_modifier Facet --> VALUE Value | DOMAIN Frame_name | CATEGORY Lexical_category Value --> string | Frame_name Lexical_category--> verb | adj | noun | adv |component_id | string
Figure 2 - A language to model the semantic formalism
The language defines a frame-like classification scheme for software components based on the defined semantic cases. The classification scheme consists of a hierarchical structure of generic frames (‘IS-A-KIND-OF’ relationship). Frames that are instances of these generic frames (‘IS-A’ relationship) implement the indexing units of software descriptions. Major generic frames for the Knowledge Base are shown in Figure 3. FRAME verb_phrase IS_A_KIND_OF root_frame CASES Action CATEGORY verb Agent DOMAIN component Comparison DOMAIN noun_phrase Condition DOMAIN noun_phrase Destination DOMAIN noun_phrase Duration DOMAIN noun_phrase Goal DOMAIN noun_phrase Instrument DOMAIN noun_phrase Location DOMAIN noun_phrase Manner DOMAIN noun_phrase Purpose DOMAIN verb_phrase Source DOMAIN noun_phrase Time DOMAIN noun_phrase. FRAME noun_phrase IS_A_KIND_OF root_frame CASES Adjective_modifier CATEGORY Participle_modifier CATEGORY Noun_modifier CATEGORY Head CATEGORY
adj verb noun noun.
FRAME component IS_A_KIND_OF root_frame CASES Name CATEGORY component_id Description CATEGORY string . . {Other information associated to . the component, e.g. source code, executable examples, reuse attributes, etc}
Figure 3 - Some generic frames of the Knowledge Base
The generic frames model semantic structures associated to verb phrases, noun phrases and the information associated to software components, like name, description, source
code, executable examples, etc. Semantic cases are represented as slots in the frames. ‘Facets’ are associated to each slot in a frame, describing either the value of the case or the name of the frame where the value is instantiated (‘value’ facet); the type of the frame that describes its internal structure (‘domain’ facet) or the lexical category of the case (‘category’ facet). For instance, the ‘Location’ slot in the verb phrase frame has a ‘domain’ facet indicating that its constituents are described in a frame of type ‘noun phrase’. Through the parsing process, the interpretation mechanism maps the verb, the direct object and each prepositional phrase in a sentence into a semantic case, based on both syntactic features and identified case generators. Figure 4 shows the indexing structure for the ‘grep’ family of Unix commands built from the description ‘search a file for a string’. An instance of the verb_phrase frame is generated by instantiating the slots corresponding to the semantic cases identified in the description (’Action’, ‘Location’ and ‘Goal’). These cases have an associated ‘value’ facet indicating either the value of the slot (as ‘search’ for the ‘Action’ case) or the name of the instance frame with its value (grep_component, grep_noun_phrase_1 and grep_noun_phrase_2 for the semantic cases ‘Agent’, ‘Location’ and ‘Goal’ respectively). FRAME verb_phrase_1 IS_A verb_phrase CASES Agent VALUE grep_component Action VALUE ‘search’ Location VALUE grep_noun_phrase_1 Goal VALUE grep_noun_phrase_2. FRAME grep_noun_phrase_1 IS_A noun_phrase CASES Head VALUE ‘file’. FRAME grep_noun_phrase_2 IS_A noun_phrase CASES Head VALUE ‘string’. FRAME grep_component IS_A component CASES Name VALUE ‘grep’ Description VALUE ‘search a file for a string’.
Figure 4 - An indexing structure for the “grep” command: “search a file for a string”
5 S i mi l a ri ty a n a l y s i s A similar analysis mechanism to the one applied to software descriptions is used to map the user’s query (in the form of a verbal or nominal phrase) to a frame-like internal representation (Figure 1). The internal representation is then compared with the ones associated to software descriptions in the Knowledge base. A similarity analysis is performed and matching components are scored according to their closeness with the user query. Two mechanisms of retrieval are proposed: a major retrieval mechanism, based on the similarities of the semantic structures, and a complementary mechanism, based on the matching of the noun phrases in the semantic structures. The main retrieval mechanism provides good level of pre-
cision by reducing the number of irrelevant components that are retrieved. The retrieval is based on the detection of similar semantic structures, i.e. structures that share the same semantic cases and where there is some lexical relationships between the terms in the shared semantic cases. Precision is controlled by basing the similarity analysis on the syntactic and semantic information available on the internal representation of both query and software descriptions; by scoring retrieved candidates and establishing a threshold for the required similarity. The threshold controls the number of retrieved documents and thus, the level of precision. Recall is increased by allowing partial matching of the semantic structures and considering synonyms, hyponyms and hypernyms of the terms in the semantic cases. The complementary retrieval mechanism is based on the simple matching of noun phrases identified in both query and software descriptions. The approach is independent of the semantic formalism we used in the main retrieval mechanism but makes use both of the instance frames associated to noun phrases and the closeness measures defined for the main retrieval mechanism. This mechanism has been defined to be used in comparative experimental work with the main retrieval mechanism, to evaluate the expected improvements in precision results, obtained through the processing of semantic information, in relation to the processing of mere syntactic information, through the matching of noun phrases.
5 .1
sc_closeness (cqj, cdj) = np_closeness (cqj, cdj) closeness (cqj, cdj) SCase (cqj, cdj)
Similarity analysis is performed by comparing the internal representation of the query with the ones of software components descriptions in the Knowledge Base and computing a similarity measure (Figure 1). The function SCase (Fq, Fd) below measures the similarity between the frame Fq (the internal representation of the user query Q) and the frame Fd (the internal representation of a description D of a software component). The measure is based both on the number of the semantic cases that have a match and on the closeness among the terms in the matched cases. Semantic cases are weighed by a factor (wj) that represents the relative importance of a semantic case in describing the functionality of the component. Closeness between semantic cases is a function of the conceptual distance between the terms in the semantic cases, according to its ‘domain’ or ‘category’ facet (single terms, noun phrases or verb phrases). S Case ( F q, F d ) =
∑
∀j ∈ SC q
∑
w j • sc_closeness ( c qj, c dj )
∀j ∈ SC q
wj = 1
if if if
DOMAIN (j) = noun_phrase j = Action DOMAIN (j) = verb_phrase
SCq = {j| j is a semantic case in the frame of Q} cqj = the term of the semantic case ‘j’ in the frame of Q cdj = the term of the semantic case ‘j’ in the frame of D wj = the weight of the semantic case ‘j’
5.1.1
Cl o s en es s b etween s i n g l e ter m s
Definition 2 Conceptual distance between single terms
The function dist(x,y) measures the conceptual distance between the single terms or collocations ‘x’ and ‘y’, by considering the distance of the terms in a lexicon, according to the lexical relations of synonymy, hyponymy and hypernymy described in section 4 (see example in Table 1). 0 if x = y d if hyponym (x,y,d) ∞ otherwise
dist(x,y) =
or or
synonym(x,y) hypernym (x,y,d)
Definition 3 Closeness between single terms
The function closeness(x,y) measures the closeness between the single terms or collocations ‘x’ and ‘y’. The function is inversely proportional to the conceptual distance between the terms in Definition 2 (see example in Table 1). closeness (x,y) = vdist(x,y) dist
closeness v = 0.5
‘PC’
0
1
‘desktop computer’ ‘laptop’
1
0.5
2
0.25
‘digital computer’ ‘computer’
1
0.5
2
0.25
S im ilarity c omputation
Definition 1 Similarity between software descriptions
with,
where,
y
with
0