Approximate Component Retrieval: An Academic Exercise or a Practical Concern? Lamia Labed Jilani Regional Institute for Research in Computing and Telecommunications Cite Montplaisir, Belvedere 1002 Tunisia Tel: (216) 1 787 757, Fax: (216) 1 787 827 Email:
[email protected] Rym Mili School of Engineering and Computer Science University of Texas at Dallas, Richardson, TX 75028, USA Tel: (972) 883-2091, Fax: (972) 883-2349 Email:
[email protected] Ali Mili Department of Computer Science, University of Ottawa Ottawa, Ont. K1N 6N5, Canada Tel: (613) 562 5800 X 6714, Fax: (613) 562 5187 Email:
[email protected]
Abstract When one uses informal methods to retrieve a component that satis es some requirements out of a software reuse library, one cannot distinguish between the retrieved components that do satisfy the requirements and those that merely approximate the requirements (i.e. almost satisfy them). On the other hand, if one uses formal retrieval methods based on precise speci cations of components and queries and on formal matching criteria, then one can clearly distinguish between two retrieval methods: exact retrieval, which seeks to identify components that are proved to satisfy the requirements at hand; and approximate retrieval, which is content with components that do not necessarily satisfy but approximate the requirements at hand. In this paper we advocate the need to make the distinction between these two families of methods, and introduce a possible approach thereto.
Keywords: Component based software, software component storage and retrieval, software libraries, software reuse, formal speci cations, information retrieval, measures of distance between speci cations.
Workshop Goals: Learning; networking; assessing the pertinence of our work; advocating the need for scienti cally based methods. Working Groups: components based software, formal methods, reuse libraries. Jilani- 1
1 Background Software reuse libraries are reporsitories where reusable software components are stored and retrieved. They play a crucial role in determining the success of a software reuse policy, because they have a profound impact on the practice of software reuse in an organization:
A library which is poorly stocked (few components, or few relevant components) may cause a signi cant overhead on the development process, while seldom producing reusable components. A library whose retrieval method has poor recall causes users (programmers) to miss reuse opportunities, when such opportunities do exist. A library whose retrieval method has poor precision causes users (programmers) to be unecessarily distracted by components that are retrieved but prove to be irrelevant.
The weight of these problems increases as a function of the size of the library, and there is every indication that reuse libraries increase in size all the time. In order to ensure that library components remains relevant to the application domain, one has to de ne precise inclusion criteria in the reuse library. Also, in order to ensure that the library maintains good recall, one has to design a retrieval method that is as exhaustive as possible (which visits all the entries, or at least ensures that it skips an entry only if it knows it to be irrelevant). Finally, in order to ensure that the library maintains good precision, one must de ne a storage and retrieval method which provides precise descriptions of components and queries, and formally de ned matching criteria.
2 Position In light of the foregoing observations, one may think that formal methods of software components storage and retrieval are widely used in practice. Yet despite the abundance of such methods, and despite the wide range of cost vs quality that these methods provide [1, 2, 3, 4, 5, 6, 7], they are mostly ignored by industry, in favor of traditional, low-tech solutions that are inspired from information retrieval or from library science [8]. We submit the position that both kinds of methods are needed to do a satisfactory job in component storage and retrieval: traditional retrieval techniques are most useful in the early stages of the search process, when large chunks of the library can be excluded by simple keyword matches; mathematically based techniques are most eective in the later stages of the search process, when a great deal of pprecision is required to discriminate between several candidates which dier only slightly from each other. One of the key dierences between informal retrieval methods and formal retrieval methods is the ability to distinguish between exact retrieval and approximate retrieval. Because informal methods focus on matching component descriptions with user queries, they do not support the idea of correctness: a component may well match the query in all its detail but still fail to be correct (due to a mismatch between the library manager's interpretation of a feature, and the user's); also, a component may fail to match a query but still be correct with respect to the query (the component does satisfy a required feature, but the library manager neglected to record it). Hence, with informal retrieval methods, all retrievals are approximate retrievals: the decision of whether Jilani- 2
a component is correct (and can be used verbatim), is not correct but is close enough (and can be used after modi cation), or is not correct and costs too much to modify (and must be discarded) |this decision is taken after the retrieval operation, rather than as part of it. We have investigated a formal method of component retrieval [3], based on formal speci cations and program correctness, and have discussed in turn exact retrieval then approximate retrieval under this method. In this paper, we brie y introduce our main results on approximate retrieval.
3 Discussion In [9], Mili de nes four measures of distance between speci cations; we review these measures in turn and see how they can be used to perform approximate retrieval. Basically, for a given measure of distance, say , we consider a reuse library L and a query K , and we seek to identify all the components C of L that minimize the distance (C; K ).
3.1 Functional Consensus The rst measure of distance is what we call functional consensus. The rationale for this measure can be summarized as follows: Given a component C and a query K , we consider that C is close to K if C and K have plenty of information in common. Among all the components of the library, this measure will select that which has most information in common with the query.
3.2 Re nement Dierence Given two speci cations C and K such that K re nes C (i.e. all the requirements information of C is recorded in K ). The re nement dierence between K and C is the smallest functional increment that we must add to C to obtain K . The rationale of this measure is the following. Given a component C and a query K , we consider that C is close to K if the amount of functionality of K that is not satis ed by C is small. Note that unlike all other measures of distance presented in this section, the measure of re nement dierence is not symmetric.
3.3 Re nement Distance Given two speci cations K and C ; the re nement distance between K and C re ects all the functional information of K that is not recorded in C and all the requirements information of C that is not in K . We denote this measure by (K; C ). The rationale of this measure of distance is the following: Jilani- 3
C11
? ? C9 C?10???C5 ??? ?@@ ? ? C6 ? ?C@8 @ C7? ?C4 ?@ ? @ @? ? @?@ ? ?@? C@3@@ C1 ???C2 @@@??? C0
Figure 1: A Database of Pascal Compilers The re nement distance re ects two terms: the functional requirements of K that C does not satisfy; and the functional properties of C that K does not need. Ideally, we want to minimize both of these terms: we minimiize the rst term in order to have fewer additional features to add to C ; and we minimize the second term in order to have fewer irrelevant features of C to deal with when we are modifying C to satisfy K .
3.4 Functional Distance The rationale of functional distance is the following: Given two speci cations A and B . The distance between A and B is re ected by two features: the amount of requirements information that A have in common, which is re ected by the functional consensus of A and B (denoted by (A; B )); and the amount of requirements information that sets them apart, which is re ected by (A; B ). Consequently, we de ne the functional distance between A and B as the vector denoted by ! ( A; B ) (A; B) = (A; B) :
4 Experimentation: A Library of Compilers In order to illustrate how these distances can be used to perform approximate retrieval in a database of software components, we have considered the library of compilers that is presented in [3] and a user query K that no element of the library satis es. Figure 1 gives a graphic representation of these compilers, where the nodes are ordered by means of the re nement relation. For each measure of distance (say ), we consider all the entries of the original database and compare them with respect to their distance to speci cation K . Speci cally, whenever component Jilani- 4
o C5; C11
C11; C5o
? ? C9; C8; C6 ?? o ?o C4; C10 ?@ ? @@ ?? ? @?o C1; C2; C7 ?o C@3@ @@?o ?? C 0
C1; C2, C3; C4, o C7; C10
Functional Consensus o C5 ; C11 o C6; C8; C9
o C3 oC4; C10 ? ? ? ? ? o C1 ; C2; C? 7 ? ?? ? ?Re nement o C0 Dierence
o C6; C8; C9 ? ? ? ? ? ? ? ?Re nement o C0 Distance oC5; C11
C6; C8 o C9
oC4; C10 C7 oH ?oC3 HHH H oC1; C2 ?? ? ?? ? C0 o? Re nement Ratio
Figure 2: Graphs derived from Measures of Distance
C is -closer to K than component C , we draw C higher than C in the new graph; also, whenever two components C and C have the same distance to K (i.e. (K; C ) = (K; C )), we represent i
j
i
i
j
j
i
j
them at the same node in the new graph. The graphs that we obtain for functional consensus, re nement dierence, re nement distance and functional distance are given in gure 2. On each graph, the speci cations that minimize the measure of distance (hence are prime candidates in an approximate retrieval) are those that appear at the top of the graph.
References [1] R. Hall, \Generalized behaviour-based retrieval," in Proceedings, 16th Int. Conf. on Soft. Eng., (Sorento, Italy), IEEE Computer Society Press, May 1994. [2] J. Jeng and B. Cheng, \Formal methods applied to reuse," in Proceedings, 5th Workshop on Software Reuse, (Palo Alto, CA), University of Maine, November 1992. [3] R. Mili, R. Mittermeir, and A. Mili, \Storing and retrieving software component: A re nement based approach," in Proceedings, 16th Int. Conf. on Soft. Eng., (Sorento, Italy), IEEE Computer Society Press, May 1994. Jilani- 5
[4] D. Perry and S. Popovich, \Inquire: Predicate-based use and reuse," in Proceedings, Knowledge Based Software Engineering Conference, (Chicago, IL), IEEE Computer Society Press, September 1993. [5] A. Podgurski and L. Pierce, \Behaviour sampling: a technique for automated retrieval of reusable components," in Proceedings, 14th International Conference on Software Engineering, (Melbourne, Victoria, Australia), pp. 300{304, IEEE Computer Society Press, September 1992. [6] A. M. Zaremski and J. M. Wing, \Signature matching: A tool for using software libraries," ACM Transactions on Software Engineering and Methodology, vol. 4, pp. 146{170, April 1995. [7] A. M. Zaremski and J. M. Wing, \Speci cation matching of software components," in Proceedings, SIGSOFT '95: Third ACM SIGSOFT Symposium on the Foundations of Software Engineering, (New York, NY), ACM Press, April 1995. [8] W. Frakes and T. Pole, \An empirical study of representation methods for reusable software components," IEEE Transactions on Software Engineering, vol. 20, pp. 617{630, August 1994. [9] R. Mili, \Assessing the reuse worthiness of a component: Empirical and analytical approaches," Tech. Rep. PhD Dissertation, University of Ottawa, 1996.
5 Biographies Lamia Labed Jilani holds an Engineering degree in Computer Engineering from the University of Tunis II; she is a PhD candidate at the University of Tunis II and is a researcher with the Regional Institute for Research in Computing and Telecommunications in Tunis, Tunisia. Rym Mili holds
a Doctorate in Computer Science from the University of Tunis and a PhD in Computer Science from the University of Ottawa; she is an Assistant Professor of Computer Science at the University of Texas at Dallas. Ali Mili holds a PhD from the University of Illinois and a Doctorat d'Etat from the University of Grenoble; he is Professor of Computer Science at the University of Ottawa.
Jilani- 6