Case-Based Retrieval of Software Components CARMEN FERNÁNDEZ-CHAMIZO, PEDRO A. GONZÁLEZ-CALERO, LUIS HERNÁNDEZ-YÁÑEZ AND ALVARO URECH-BAQUÉ Dep. Informática y Automática – Escuela Superior de Informática Universidad Complutense de Madrid 28040 Madrid, Spain e-mail:
[email protected]
Abstract–A major problem concerning the reusability of software is the retrieval of software components. Different approaches have been followed to solve this problem. In this paper we present the Reuse Assistant, a hybrid approach to support the retrieval of software components from a library of object classes. The Reuse Assistant consists of two subsystems that follow two different approaches: information retrieval techniques based on statistical methods, and knowledge-based techniques using some of the representation and indexing mechanisms found in case-based systems. The Information Retrieval approach grants system extendibility, and permits the use of a natural language interface. The Case-Based approach enables reasoning about concepts, allowing the retrieval of “approximate” components. Both subsystems can be operated from a common interface, where free-text and form-filling queries can be posed.
1.
Introduction
Software reuse is widely believed to be one of the most promising technologies for improving software quality and productivity (Biggerstaff & Richter 1987). Object-oriented languages constitute an important step in the way to reusability (Meyer 1987). However, the reusability of objects, although promising, must face the problem of classification, storage and retrieval of reusable components (classes and methods). By considering the natural language documentation of software components, we could apply some of the general purpose Information Retrieval (IR) techniques (Salton & McGill 1983) to the retrieval of software components. IR techniques have proved to be useful in certain domains. However, the software domain has been found quite different from some of the usual application domains of IR techniques, not only due to the nature of documents but also due to the special requirements of the user. Some of the special characteristics of the software domain, that make IR techniques ill-suited for retrieving software components, are described below. This work is supported by the Spanish Committee of Science & Technology (CICYT TIC92-0058)
Much research on information retrieval assumes that the required information can be fully specified a priori, to then concentrate on retrieval efficiency. Studies in software design have shown, however, that the integration of problem setting and problem solving is necessary to articulate the adequate queries. Therefore, tools are needed for incrementally constructing queries and exploring an information space that can support the evolution of an information need. These tools should aid in comprehension, as well as location, by helping users comprehend the relevance of retrieved information (Henninger 1991). In order to evaluate the relevance of the retrieved components it is necessary a careful review of their code and documentation. If the number of retrieved components is too high, it can discourage users and promote programming from scratch instead of reusing software. Therefore, to achieve a high precision in retrieving is crucial in the software domain. Object-oriented programming, while improving the whole software development process, adds its own drawbacks to the software retrieval problem (Wilde & Huitt 1992). The use of polymorphism and inheritance introduces a large number of dependencies between components. Dynamic binding increases the number of implementations to be examined, and the dispersion of functionality into differ-
ent components makes difficult the global understanding. All these factors contribute to reduce the effectiveness of the retrieval. This paper is concerned with the retrieval of software components. In particular, our work has concentrated on the problem of retrieval of Object-Oriented Programming (OOP) classes. Section 2 summarizes related work in this area, classifying the existing systems in two basic groups: automatic indexing systems and knowledge-based systems. Section 3 analyzes the advantages and disadvantages of both approaches and justifies the need of a hybrid approach. The Reuse Assistant, a system based in this hybrid approach, is also introduced in this section by describing its two subsystems along with the functionality of the user interface. In Section 4 a usage scenario is described. Section 5 is concerned with the evaluation of the retrieval effectiveness in the Reuse Assistant. Section 6 concludes the paper with some remarks on current and further research.
2.
The Retrieval of Software Components: Related Work
Traditional approaches to software retrieval fall into two complementary categories: low-level cross-reference tools, which facilitates browsing at the code level, and high-level classification techniques, that emphasize retrieval by software category. Most of the low-level tools generate automatically a database of relationships between the different program entities, and respond to user queries according to the database information. Examples of such tools are CScope (Steffen 85) and CIA (Chen & Ramamoorthy 86), which work on C programs. More recent tools as Trellis (O'Brien, Halbert & Killian 87) and CIA++ (Grass & Chen 90) work on object-oriented programs, managing the complex relations established between classes. The high-level classification of software components can follow basically two approaches. We can extract the classification information from the component itself (code, documentation) using automatic indexing methods. Or, on the other hand, we can implement a knowledge-based classification scheme using information about the components that lies outside of them. Both high-level approaches are described in the following sections.
2.1.
Automatic indexing approach
Due to the increasing size of the natural language descriptions of software components in recent libraries, information retrieval (IR) techniques based on statistical methods (Salton & McGill 1983) are becoming more usual in com-
ponent retrieval. This approach extracts information from the natural-language documentation of the software components. It does not use any semantic knowledge, nor intend to understand the documentation. The goal of this approach is to characterize each component by a set of indices that are automatically extracted from its natural language documentation. The system proposed in (Frakes & Nejmeh 1987) uses an existing IR system, CATALOG, for storing and retrieving C software components. Each component is characterized by a set of single-term indices that are automatically extracted from the natural-language headers of C programs. The GURU system (Maarek et al. 1991) classifies the software components according to attributes automatically extracted from their natural language documentation using an index scheme based on pairs of related words. This kind of indices along with some clustering methods used by the GURU system have proved to be successful when applied to the documentation of the Unix command set. GURU has also been applied to an object-oriented class library (Helm & Maarek 1991) by complementing it with domain specific approaches based on code analysis. Some recent proposals (Girardi & Ibrahim 1993) include a partial natural language analysis of the component descriptions in order to improve the retrieval effectiveness. As the information provided by IR tools is derived automatically, this approach presents advantages in cost, transportability and scalability. Statistical methods, however, cannot be a substitute for meaning.
2.2.
Knowledge-Based approach
There is a growing interest in the potential contributions of Artificial Intelligence to Software Engineering. Development of knowledge-based tools for software reusability is one of the most promising research topics in this area. The key feature of this approach is that it draws semantic information about software components from a human expert. Knowledge-based systems are often very sophisticated. Unfortunately, as a tradeoff, they require domain analysis and a great deal of pre-encoded, manually provided semantic information. Prieto-Díaz (Prieto-Díaz & Freeman 1987) created a classification scheme based on the library science. He proposes a set of six facets: three related to the functionality of the component and three related to its environment. The different values a facet can have are called terms. These terms are organized in a conceptual graph that represents manually encoded knowledge about the domain. The system proposed in (Wood & Sommerville 1988) uses Conceptual Dependency (Schank 1972) to represent knowledge about software components. This knowledge is
encoded in the component descriptor frames that represent the function performed by the component and the objects manipulated by the function. Embley and Woodfield (Embley & Woodfield 1987) define a knowledge structure for a software library consisting of abstract data types (ADTs). This knowledge structure supports different relations among ADTs, and it includes natural language descriptions and keywords to assist in finding and browsing activities. The LaSSIE system (Devanbu et al. 1991) embodies a frame-based knowledge representation. Software components are described in terms of the operations they perform. Each of these actions is described by providing its actor, object, recipient, agent, environment, and so on. These relationships are coded in a specialized knowledge representation system which classifies them into a conceptual hierarchy. All these knowledge-based systems have several things in common. They all represent the knowledge in frame-like structures with similar sets of slot-fillers. They all organize the knowledge around the functions performed by the components, although some of them also include frame representations of the objects involved. In every single case, the characterization of the components with slot-fillers is done manually, following a pre-established model of the domain. Another approach is undertaken in CODEFINDER (Henninger 1991). This system faces a slightly different goal. It explores the problem of finding program examples which are relevant to a design task. CODEFINDER uses an associative spreading activation method for locating software objects. Curtis (Curtis 1989) has analyzed different software indexing methods used by knowledge-based systems. He postulates that the effective use of a reusable library will require an indexing scheme similar to the cognitive structures held by most experts in the application area. The identification and representation of those cognitive structures still remain an open issue. Although it is generally assumed that semantic retrieval may lead to a more effective software retrieval, present knowledge-based systems are being questioned due to the high cost of building knowledge bases.
3.
The Reuse Assistant: a Hybrid Approach
Currently, in terms of retrieval efficiency it is not easy to decide which is the best approach, automatic-indexing or knowledge-based, because there are not comparable empirical results about the performance of systems based on those approaches. The differences come from the effort necessary during the knowledge acquisition process, the availability of
additional knowledge (not included explicitly in the components), and the type of interface imposed by the selected representation. Both approaches have advantages and drawbacks. Knowledge-based systems make use of a deep knowledge about the component design and implementation, and interrelate the components through a rich set of relationships expressing system architecture, design decisions and all the information needed to use and, more important, to reuse (adapt) the components. Usually, this information is presented to ease query construction by reformulation, in a form-filling interface or even in a restricted natural language interface (as in LaSSIE), and serves as the basis for some kind of browsing system where dependencies among components can be explored. The major drawback of this approach comes from the need of a manual, high-cost knowledge acquisition process, which handicaps the scalability and extendibility of these systems. On the other hand, automatic indexing systems make use of the readily available information, performing some kind of analysis on the text (and, sometimes, also on the code) associated with the components. Therefore, extendibility is granted since this process may be automatically applied to any new component added to the software library. The most sophisticated of these systems use advanced techniques, such as clustering, and text or code understanding, to impose a more conceptual structure on the representation, trying to resemble that of a hand-coded knowledge base. But the actual state of development of these technologies for information extraction is still far from being capable of extracting the knowledge that can be easily elicited from an expert. Usually, these systems use a natural language interface, where queries are processed in the same way that the text associated with the components. In this paper we present the Reuse Assistant, a hybrid approach to support the retrieval of software components from a library of objects classes (Fernández-Chamizo et al. 1993). The Reuse Assistant consists of two subsystems: an information retrieval (IR) module based on statistical methods, and a knowledge base, indexed with the techniques used in Case-based (CB) systems (Riesbeck & Schank 1989). We intend to maintain the advantages of both approaches while minimizing the drawbacks, and, as stated in (Callan & Croft 1993), we believe that a combination of both approaches is superior to either one. In our system, every component is indexed using both techniques. In this way, the system offers a richer interface, where it is possible to pose queries in natural language (treated by the IR module), to incrementally fill a form specifying the user needs (access to the knowledge base), and to use any retrieved component as entry point to browse through the knowledge base.
Our work has concentrated on general purpose software libraries for object oriented languages. The commercially available systems (Smalltalk/80, Smalltalk/V, C++ and Eiffel environments, for example) include some basic libraries with general purpose components for graphical interfacing, data structure manipulation, communication with the host system and some utilities considered of general interest. When building applications for a particular domain, these components are a crucial part of any implementation. We believe that the components in the basic libraries cannot be considered at the same level, in terms of reusability, than the modules implemented by the users. And, due to their higher reuse potential, we believe that a complete handcoded representation of those components is well justified, and it will report enough benefit to be a profitable effort. User defined components are indexed through the IR module by a statistical analysis of the comments in the code, so that they are partially included in the retrieval system, which is, to some extent, scalable and extendable. It would be desirable that the users could add the representations of their own components to the knowledge base, but this would cause a number of problems concerning consistency and knowledge quality. The control of the component descriptions introduced in the system by the users as they create new components is still an open issue in our system. Next subsections describe with more detail the IR and CB modules, and the user interface.
3.1.
The IR subsystem
We have developed a subsystem for class retrieval based on IR techniques (Salton &McGill 1983). This tool is a modified version of a previously developed prototype (Fernández et al. 1993), and it uses an indexing scheme based on pairs of related words according to the notions of lexical affinity (LA) and quantity of information proposed in (Maarek et al. 1991). A LA between two units of language stands for a correlation of their common appearance. When two words appear frequently in a single sentence (separated by at most five other words) a potential LA (or lexical relation between the two words, e.g. “system administrator”, “remote computer”) is identified. In order to reduce the influence of words appearing too often in a given context, only those LA's carrying a big quantity of information (high frequency in a document, low frequency in the rest of documents) are selected as indices. The IR subsystem takes the class library documentation in natural language as input and produces the component profiles. A component profile is a set of LA-based indices characterizing the component. These indices are ordered in decreasing order by the quantity of information associated
to them. Profiles are obtained by performing a statistical analysis of the distribution of words in the document describing the component and in the whole set of documents (the library documentation). In this process, names of methods and objects are given a higher weight because in an object oriented programming environment they have been carefully selected to transmit as much information as possible. The analysis of the class library documentation is performed only once, and the results of this analysis, the component profiles, are stored in auxiliary files to be used during the retrieval stage. When retrieving a component, the user expresses a query in free style natural language. The system obtains the query profile, applying the same method used to obtain the component profiles. The query profile is then compared with the component profiles and a ranked list of potentially relevant components is obtained. Relevance is assessed as a function of common LA's between the query and the component, and taking into account the quantity of information associated with the required LA's in the different component profiles. The ranked list is shown to the user as described in Section 3.3.
3.2.
The CB subsystem
Human programmers perform many programming tasks by “reusing” previously acquired mental schemes, rather than by reasoning from first principles (Rich and Waters 1988) (Steier 1991). Therefore, Case-Based techniques constitute a natural approach to represent the experience acquired in the design of past applications. This assert is even more relevant when dealing with object-oriented design where, very often, the new programs are built from components available in the class library. Descriptions of these components will constitute the case base in our system. First, it is necessary to specify a mechanism to represent cases in the knowledge base. We use a classification-based knowledge representation system (Fernández-Valmayor & Fernández-Chamizo, 1992) which is based on a subset of the KL-ONE system (Brachman & Schmolze, 1985). This system has the ability of automatically classifying a structured concept with respect to a taxonomy of other concepts. That is, on the basis of its structure and the structure of concepts already in the taxonomy, a new concept can be automatically inserted at the correct position. The same classification algorithm is used for retrieval. The classifier can determine the position in the concept hierarchy where a given description should be located. The knowledge base is implemented as a semantic network, where every node is a frame-like structure representing a concept. Concepts are
),*85($SDUWLDOYLHZRIWKHFRQFHSWKLHUDUFK\
restricted by a number of slots that relate them to other nodes in the network. In order to build the knowledge base, a domain analysis of the software library has to be made. The representation is organized around the operations performed by the routines in the library, and the objects involved in those operations. Therefore, the domain analysis must identify the operations and the objects needed to characterize the components, along with dependencies among them. This analysis leads to an inheritance hierarchy containing general purpose programming concepts (operations, objects, and relations) at higher levels, concepts specific to the library, and the actual descriptions of the components (i.e., the cases) at a lower level of the hierarchy. The root concepts of the knowledge base are ACTION and OBJECT. Operations are represented as specializations of the ACTION concept, while objects manipulated by those operations are represented as specializations of the OBJECT concept. In Object-Oriented Languages (OOLs) the basic components are classes. A class consists of data structures and routines (in Smalltalk terminology, a routine is called a method, which is the term we will use). For every method and every class in the software library, a description is included in the knowledge base, containing the following information: Methods. A frame representing a method consists of an action name, and three required slots specifying the owner class of that method, the actual implementation of the method, and the identifier of the method. Additionally, the frame may include some slots to connect the method with other methods that play an important role in its functionality. And, finally, there will be some
slots to represent the specific modifiers of the action, for example, a search will have an object searched and a medium where the search is accomplished. Classes. We create for each class a frame containing information about its structural description, its code level relationships with other classes (i.e., inheritance, clientelism), its conceptual relationships with other classes and abstract concepts in the knowledge base, and the links to all its methods. The conceptual relationships have to be hand-coded but the code level relationships are automatically inferred from the source code. The structural description defines a set of values for the class, i.e. a list of other classes needed to construct instances of the current class. For example, the structural description of the Stream class in Smalltalk consists of an IndexedCollection, and two integers representing the position and the readLimit in the stream. Class descriptions are represented as OBJECT concepts, since they are the objets manipulated by the methods. The descriptions of the actual components included in the class library are defined as specializations of more abstract concepts detected in the domain analysis. Figure 1 shows a partial view of a portion of the concept hierarchy. In the figure appears the abstract concept Collection which describes any aggregation of objects. Collections are classified attending to the following criteria (extracted from the domain analysis): ordered, fixed or variable size, allows duplicates, and accessible by an external key. These criteria are further specialized to reflect more specific constraints. The figure also shows two concepts describing actual classes, Indexed_Collection and Dictionary, connected to their corresponding abstractions.
3.3.
The User Interface
The interface reflects the two sources of information available in the system: natural language queries are processed through the IR module, and a form-filling interface along with a concept browser are supported by the knowledge base. First, the user must state his or her needs by describing the class or the method(s) s/he requires, using one of the following query modes: Natural language. To retrieve a component, the user constructs a free-text query which is indexed as if it were the documentation associated with the components. A profile of this query is produced and then it is compared with the profiles obtained from the components in the software library. As a result of this comparison, the system produces an ordered list of candidate classes, ranked by degree of similarity with the profile of the query. Inside each candidate class, methods related with the functionality requested by the user are activated. Studies have shown that people are often very ungrammatical when communicating in a problem solving setting. Therefore, keywords are also allowed as queries. Although keyword matching is generally not an effective retrieval mechanism, its simplicity and ease of use holds an advantage for query construction. Keywords provided by the user are taken as a query profile by the IR subsystem, in the same way that profiles obtained from free-text queries. Form-filling. The user can also construct the query by using a form-filling interaction mode. This interaction is driven by the frames stored in the knowledge base. The user has to select, either a verb describing the action the component performs, or a noun representing an object manipulated by the component. From the user input the system displays a skeleton frame corresponding to the action or the object. For each frame slot the system prompts the user for the corresponding filler. The user can ask for help regarding appropriate responses but it is not necessary for every slot to be filled. Both query modes, natural language and form-filling, allow to access the adequate frames in the knowledge base. In form-based queries the access is conducted by the classifying mechanism of the knowledge base. For free-text queries the access is through the IR indices. Components whose profiles are similar to the query profile are selected as candidate components, being these components previously linked to their corresponding frames in the knowledge-base. Therefore, no matter the query mode, the frames that encode the knowledge about the retrieved components are accessed.
Figure 2 shows the interface of the Reuse Assistant. This figure is the result of a query in natural language. The user has posed the query, then has selected one of the retrieved classes, and finally has selected one of the methods activated by the query in that class. At the upper right corner the description associated with that method is shown. At the bottom, the actual code (and comments) of the method appears. If the answer to an initial query is unsatisfactory, the user can reformulate the query or browse amongst functionally related classes. The concept hierarchy browser in our system lacks a graphical interface, so the interaction must be accomplished in text mode. The user may ask for more abstract or more specific concepts than the one selected, or follow any link from that frame to any of its fillers, since these are also concepts in the knowledge base. These requests will result in the system displaying the corresponding frame(s). Retrieval by reformulation (Devanbu et al. 1991) considers retrieval as an incremental process of retrieval cue construction. With this technique, the user and the computer system cooperate, being the former able to incrementally improve his or her query according to the results of the previous queries. In order to enhance the retrieval process, traditional relevance feedback systems ask the user for assigning relevance ratings to the items retrieved by the initial query. From this values, the system automatically reformulate the query. New retrievals are made without any need of user intervention. On the other hand, retrieval by reformulation sets the user in control of query formation. We believe this is more efficient because it is easier for the user to understand the knowledge representation mechanism when there is some kind of mapping between the natural language query and its corresponding frame. In the Reuse Assistant, after a query, a number of component description frames are retrieved and the users can use them as a cue for refining their query. This descriptions will show concrete information about the retrieved component that may help users refine their information needs and express their query in the system's terms. The code of the components along with the associated comments (shown on screen at the same time than the frames) may help to novice users of the system to understand the more cryptic framebased description. Traditional information retrieval systems assume that users know exactly what they are searching for. At the other end, browsing systems assume that users are exploring the information space, taking the risk of getting lost. Our system can close the gap between exact queries and browsing. In addition to finding software components that match the query, we can explore the conceptual hierarchy to obtain near inexact matches. This provides a means to browse
),*85(7KHXVHULQWHUIDFH
amongst functionally related components instead of browsing components related by the inherited code, as browsers usually supplied with class libraries do.
4.
A Usage Scenario
Let us assume that a user is in the process of implementing a method for “extracting words from a string”. This would be the initial query specified in the natural language query mode. This query is indexed by the same statistical method used to index the classes. The profile of lexical affinities detected in the query allows the access to the classes with
similar profiles. If the requested actions are present in the method profiles, some methods of these classes may also be activated. Figure 2 shows the user interface after processing this query. The closest retrieved class is String. The selection of this class in the list pane displays a list with the String methods activated by the query. The first of these methods is the asArrayOfSubstrings method. As the figure shows, the frame associated with this method is displayed when the method is selected. This method divides a string into substrings at the occurrences of one or more space characters. At first glance, this may look as a good solution for the problem at hand, but the “words” extracted by this method will be any sequence of
characters different from the space character, including punctuation characters. If the “words” that the user meant to extract were sequences of alphanumeric characters then it will be necessary to find a different method. The frame description for this method has two action-specific slots: obj and condition. The obj slot identifies the collection that is traversed, a character sequence, and the condition slot specifies the selection criteria, as not being the space character. The user must recognize that the condition slot is the point of disagreement between his specification and the retrieved method. If the user detects this divergence then s/he can query the system about concepts related to spacechar to find out if there is some other concept that could serve as character selection condition. Doing so the user will find the concept alphanumeric-char which is connected to the method isAlphaNumeric implemented in the Character class. It is also possible to follow the link, through the concept hierarchy browser, from the condition slot to the space-char concept and look in its surroundings. The concept alphanumeric is close to it. With this information the user can build a new query using the action that appears in the retrieved frame together with the alphanumeric-char concept: WUDYHUVH FRQGLWLRQDOO\VHOHFW REMFKDUVHTXHQFH FRQGLWLRQDOSKDQXPHULFFKDU
Using the classification mechanism implemented in the knowledge base the system will retrieve the closest cases, having the same action and as many compatible slots as possible. The closest description is the one associated with the nextWord method implemented in the Stream class. This method answers a string containing the next word in the receiver stream, where a word starts with a letter, followed by a sequence of letters and digits. This method does not constitute a complete solution because the goal is to extract characters from a string and not from a stream, so it is necessary to find a method to transform strings into streams. A query such as “transform a string into a stream” will retrieve some frame descriptions specializing the action Classconversion, where the specific action slots are: origin and destination. With this information it is possible to build the form: FODVVFRQYHUVLRQ RULJLQVWULQJ GHVWLQDWLRQVWUHDP
And the classification facility of the knowledge base, using the subsumption relationships among concepts, will retrieve the following frame: FODVVFRQYHUVLRQ RULJLQLQGH[HGFROOHFWLRQ
GHVWLQDWLRQVWUHDP LPSOHPHQWRU6WUHDP VHOHFWRURQ
This method can be used to transform strings into streams because the filler of the origin slot, indexedcollection, is a more abstract concept than string, the filler of that slot in the query. The example shows some of the benefits achieved by the hybrid approach to software component retrieval. This approach results in a more flexible interface, where IR allows the use of informal specifications expressed in natural language, and the underlying knowledge representation permits a more fine-grained search. Using only IR techniques, it is necessary to pose the “right” query to retrieve the required component. This is not a trivial issue, because the way in which object oriented libraries are organized depends on code inheritance criteria, and it may not be intuitive for the reuser. A useful component may be located in an unexpected place in the class library, or it may be less specific than what is needed. Some functional relationships among components are easily detected by a statistical analysis of their corresponding documentation. However, detecting other relevant relationships would demand an indepth understanding of the library documentation. This is the kind of knowledge that can be easily identified by an expert, and it is very unlikely that an automatic indexing technique could produce similar results.
5.
Evaluation
The IR module and the interface of our system have been implemented in C for a SUN SPARCStation platform. The knowledge base is implemented in LISP for the same platform. Our work has concentrated on building the descriptions of a subset of the Smalltalk class library, along with the general purpose programming concepts needed to articulate this representation. The system needs to be completed with the descriptions of the rest of the class library, and the related programming concepts. In order to evaluate the performance of our system we concentrate on the retrieval effectiveness. Usually, the performance of an information retrieval system is measured in terms of recall and precision. Recall measures the proportion of relevant material that is actually retrieved. Precision measures the proportion of retrieved material that is relevant. In general, relevance is a subjective notion, but in the retrieval of software components, subjectivity is increased due to the acceptability of close matches. Therefore, the evaluation of this kind of systems is particularly difficult.
In other domains, there are several test collections in order to evaluate and compare the retrieval effectiveness of the systems being developed. In the software domain, however, these test collections are practically inexisting. To our knowledge, there is a test collection of 30 queries for the AIX 3 system (Maarek et al. 1991), but no test collection is available for any object-oriented library. Ideally, in order to have a valid test collection, we would need a large number of independent users, selected at random, accessing the system. Collecting the queries actually issued by these users, along with the relevance judgments made by several Smalltalk experts, we would obtain a test collection for our system. For the moment, we are preparing a list of queries selected from real questions posed by our undergraduate Smalltalk students. This test collection can not be considered as ideal but it will be useful to evaluate the IR subsystem until the whole system is completed and it can be tested in different sites. The evaluation of the CB subsystem is even more difficult, because of the inherent interactive nature of the formulate-retrieve-reformulate cycle. In this case, it would be necessary to measure the effort expended by the user to locate the required component with and without the Reuse Assistant. Due to the difficulty of monitoring this kind of situations, we believe that it is better to analyze the retrieval effectiveness in a qualitative manner like in (Wood & Sommerville 1988). In this way, we are analyzing the improvements to precision we can obtain using the CB approach in addition to the IR subsystem. Examples like the one shown in the usage scenario are being carefully analyzed. We have tested the prototype with some groups of Smalltalk students. The first results are optimistic. After some initial difficulties (students tend to overestimate the system potentiality and they ask for classes that directly solve their problems) they find that the Reuse Assistant is a useful tool. Usually, they begin with a free-text query and after the corresponding frame is shown, they modify the slots to reformulate the query instead of introducing a new free-text query. Our preliminary results show that, in general, free-text queries are preferred when the requirements for the component are not well established. Form-filling is directly used when the requirements are clear and a certain level of experience with the system has been attained.
6.
Conclusions and Future Work
We have presented a hybrid approach to the problem of indexing and retrieving software components. In this approach, CB techniques are used to represent the knowledge about components, and, simultaneously, statistical indexing methods are used to facilitate the access to the components through natural language queries. The Reuse Assistant, a system following this approach, has also been introduced. The goal of our approach is to maintain the advantages of both techniques, while minimizing their drawbacks. Automatic indexing methods provided by the IR subsystem allow the users to easily incorporate new component profiles, granting the system extendibility. The conceptual organization of the case base provides a means to browse amongst functionally related components, when exact retrieval is not achieved. The high cost of hand-coding the component descriptions for a basic class library is compensated by its high reuse potential. At the moment, only a subset of the Smalltalk classes has been represented in the knowledge base. In order to evaluate the system in actual environments, the knowledge base is going to be completed. We are also studying the construction of knowledge bases for other object oriented languages (C++ and Eiffel), using the same general purpose programming concepts (the upper levels of the concept hierarchy). Also, we plan to enhance the two subsystems, incorporating natural language processing techniques into the IR module, and new reasoning mechanisms into the CB module that assess the applicability of the retrieved components. The Reuse Assistant is part of a wider environment that will embody several tools for helping and teaching the usage of object oriented class libraries. This will be a casebased environment because we believe that CB approaches are particularly useful when teaching how to reuse software components. By considering the components in the library as cases, the programming experience embodied in those cases can be of use to novices.
References Biggerstaff, T. J. & Richter, C. (1987). Reusability Framework, Assessment, and Directions. IEEE Software, vol.4, 2. Brachman, R.J. & Schmolze, J. G. (1985). An Overview of the KL-ONE Knowledge Representation System. Cognitive Science 9, 2. Callan, J. & Croft, B. (1993). An approach to incorporating CBR concepts in IR systems. Symposium on Case-Based Reasoning
and Information Retrieval, Standford University, AAAI Spring Symposium Series, March 1993. Chen, Y.F. & Ramamoorthy, C.V. (1986). The C Information Abstractor. COMPSAC, Chicago, Oct. 1986. Curtis, B. (1989). Cognitive issues in reusing software artifacts, in Software Reusability, volume II, Applications and Experience, (Biggerstaff, T.J. & Perlis, A.J., eds.), ACM Press, AddisonWesley Publishing Company. Devanbu, P., Ballard, B.W., Brachman, R.J. & Selfridge, P.G. (1991). LaSSIE: A Knowledge-Based Software Information System, in Automating Software Design (Lowry, M.R. & McCartney, R.D., eds.), AAAI Press/ The MIT Press. Embley, D.W. & Woodfield, S. N. (1987). A Knowledge Structure for Reusing Abstract Data Types, Procs. of the Ninth International Conference on Software Engineering, ACM, 360-368. Fernández, B., Buenaga, M. & Vaquero, A. (1993). An On Line Intelligent Assistant, World Conference on Educational Multimedia and Hypermedia (ED-MEDIA '93), Orlando, USA. Fernández-Chamizo, C., Hernández-Yáñez, L., González- Calero, P. A. & Urech-Baqué, A. (1993). A Case-Based approach to Software Component Retrieval. Symposium on Case-Based Reasoning and Information Retrieval, Standford University, AAAI Spring Symposium Series, March 1993. Fernández-Valmayor, A. & Fernández Chamizo, C. (1992). Educational and Research Utilization of a Dynamic Knowledge Base. Computers & Education 18, No.1-3, 51-61. Frakes, W. B. & Nejmeh, B. A. (1987). Software Reuse through Information Retrieval. Procs. of the 20th Annual HICSS, Kona, HI. Girardi, M. R. & Ibrahim, B. (1993). A Software Reuse System Based on Natural Language Specifications. Procs. of International Conference on Computing and Information (ICCI '93), Sudbury, Ontario, Canada, May 27-29, 1993, 507-511. Grass, J.E. & Chen, Y.F. (1990). The C++ Information Abstractor. Procs. USENIX C++ Conf. 1990.
Helm, R., Maarek, Y. S. (1991). Integrating Information Retrieval and Domain Specific Approaches for Browsing and Retrieval in Object-Oriented Class Libraries. Procs. OOPSLA-91. Henninger, S. (1991). Retrieving Software Objects in an ExampleBased Programming Environment. Procs. of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval. Maarek, Y.S, Berry, D. M. & Kaiser, G. E. (1991). An Information Retrieval Approach for Automatically Constructing Software Libraries. IEEE Transactions on Software Engineering, vol. 17, 8. Meyer, B. (1987). Reusability: The Case for Object-Oriented Design. IEEE Software, vol.4, 2. O'Brien, P.D., Halbert D.C. & Kilian, M.F. (1987). The Trellis Programming Environment. Procs. OOPSLA '87. Prieto-Díaz, R. & Freeman, P. (1987). Classifying Software for Reusability. IEEE Software, January 1987. Rich, C. & Waters, R.C. (1988). Automatic Programming: Myths and Prospects. Computer, August 1988, 40-51. Riesbeck, C.K. & Schank, R.C. (1989). Inside Cased-Based Reasoning. Hillsdale, N.J. Lawrence Erlbaum Associates. Salton, G. & McGill, M. J. (1983). Introduction to Modern Information Retrieval, McGraw-Hill, New York. Schank, R.C. (1972). Conceptual Dependency: A Theory of Natural Language Understanding. Cognitive Psychology, 552-631. Steffen, J.L. (1985). Interactive examination of a C program with CScope. Procs. USENIX Association Winter Conf. 1985. Steier, D. (1991). Automating Algorithm Design within a General Architecture for Intelligence. Automating Software Design (Lowry, M.R. & McCartney, R.D., eds.), AAAI Press/ The MIT Press. Wilde, N. & Huitt, R. (1992). Maintenance Support for ObjectOriented Programs. IEEE Transactions on Software Engineering, vol. 18, 12. Wood, M. & Sommerville, I. (1988). An Information Retrieval System for Software Components. ACM SIGIR, vol. 22, 3,4.