Retrieving Software Components That Minimize ... - CiteSeerX

1 downloads 0 Views 237KB Size Report
Jul 29, 1997 - retrieval fails, and which (ideally) identi es the library components that ... In the context of software reuse, approximate retrieval is a phase of ...
Retrieving Software Components That Minimize Adaptation E ort Lamia Labed Jilani Regional Institute for Research in Computing and Telecommunications Cite Montplaisir, Belvedere 1002 Tunisia Fax: (216) 1 787 827 Email: [email protected] Marc Frappier Jules Desharnais Departement de Mathematiques Departement d'Informatique et d'Informatique Universite Laval Universite de Sherbrooke Quebec, PQ G1K 7P4, Sherbrooke, PQ J1K 2R1, Canada Canada Fax: (819) 821 8200 Fax: (418) 656 3424 Email: [email protected] Email: [email protected] Ali Mili Rym Mili Department of Computer Science, School of Engineering University of Ottawa and Computer Science Ottawa, Ont. K1N 6N5, University of Texas at Dallas, Canada Richardson, TX 75028, USA Fax: (613) 562 5187 Fax: (972) 883-2349 Email: [email protected] Email: [email protected] July 29, 1997

Abstract

Given a software library whose components are represented by formal speci cations, we distinguish between two types of retrieval procedures: exact retrieval, whereby, given a query K , we identify all (and only) the library components that are correct with respect to K ; approximate retrieval, which is invoked in case exact retrieval fails, and which (ideally) identi es the library components that minimize the required adaptation e ort (once such a component is retrieved, the e ort of adapting it to satisfy query K is minimal over the set of all the components of the library). To this e ect, we de ne four measures of functional distance between speci cations, and devising algorithms that minimize these measures over a structured set of components; then we discuss to what extent these measures can be used as predictors of adaptation e ort. 

Correspondence author.

1

1 Exact Retrieval and Approximate Retrieval Software libraries are repositories where software components can be stored and retrieved/ indexed; while software reuse is the most common application of software libraries, it is not necessarily their only application [47, 48]. Retrieval procedures are critically dependent, of course, on the representation of software components in the library. When components (and queries) are represented by formal speci cations, it is possible to distinguish between two types of retrieval procedures: exact retrieval, whose purpose is to identify all the components that are correct with respect to the query; and approximate retrieval, which is called if exact retrieval fails to return components, and which identi es all the components that come closest to the query at hand. In the context of software reuse, approximate retrieval is a phase of whitebox reuse, whereby the retrieved components are known not to satisfy the query, and must be modi ed prior to being used. In order to formalize the process of identifying the closest components, we introduce measures of distance between speci cations and we formulate the approximate retrieval procedure as the identi cation of those components of the library that minimize a measure of distance to the query. It is possible to distinguish [22, 23] between two measures of distance between speci cations: syntactic (or structural) distance, which re ects to what extent the two speci cations look alike (i.e. have similar representations); and semantic (or functional) distance, which re ects to what extent the speci cations act alike (have similar functional features). In this paper we focus our investigation on semantic/functional distances, because these lend themselves to a mathematical formulation. Speci cally, we have de ned four measures of functional distance, which are: functional consensus, re nement di erence, re nement distance and re nement ratio. In sections 2, 3, 4 and 5, we consider in turn these four measures, discuss how they can be used for approximate retrieval, and illustrate them on a running example. Finally in section 8 we brie y present a summary of our results and our prospects, then compare our work to others. For the sake of readability, we do not devote a section to mathematical background; rather we introduce mathematical de nitions (as lightly as possible) as we go. We do assume that the reader has some familiarity with elementary set theory and relations theory ([4]).

2 Functional Consensus

2.1 De nition

In this paper, we represent program speci cations with homogeneous relations on a space S . For S =real, we consider the following pair of relations:

f(s; s0)js ? 1  s0  s + 1g. R0 = f(s; s0)js ? 2  s0  s + 2g. R=

2

One may observe that R re ects a stronger requirement than R0 (by imposing a stronger condition on outputs). We consider another pair of relations: R = f(s; s0)js ? 1  s0  s + 1g. R0 = f(s; s0)js  0 ^ s ? 1  s0  s + 1g. One may observe that R re ects a stronger requirement than R0 (by imposing the same condition on a larger set of inputs). The following de nition generalizes these two examples: De nition 1 Relation R is said to re ne (or be a re nement of) relation R0 if and only if RL \ R0L \ (R [ R0) = R0; where L is the universal relation on S (L = S  S ) and RL represents the relational product of R by L. This relation is denoted by R w R0 or, equivalently, R0 v R. We have found in [3] that the re nement relation is an ordering relation; furthermore, it has lattice-like properties, whereby any two relations R and R0 have a meet, which is de ned by R u R0 = RL \ R0 L \ (R [ R0 ): Most importantly, we have found that the meet of R and R0 can be interpreted as the requirements information that R and R0 have in common; this serves as the basis of our rst measure of distance. De nition 2 Given two speci cations R and R0 (represented by relations), the functional consensus of R and R0 is the relation denoted by (R; R0) and de ned by1 (R; R0) = R u R0:

2.2 Usage

Given a set of components C = fCig; 1  i  n, in a software library, and given a speci cation (relation) K that represents a user query, we wish to identify all the components (Ci) that maximize (with respect to the re nement ordering) the measure (K; Ci). The rationale for using the measure of functional consensus for approximate retrieval is given below. Rationale. The functional consensus of K and Ci represents the requirements information of K that Ci satis es; the greater this measure, the less functional features we have to add to Ci to satisfy K . The measure (K; Ci) represents in e ect the savings achieved (in satisfying K ) by reusing Ci ; the rationale of this measure is to maximize the savings. One may argue that it is something of a misnomer to refer to the functional consensus as a measure of distance since, unlike traditional measures of distance, functional consensus increases (becomes more re ned) as R and R0 grow closer. Had the lattice of speci cations been complemented, we would have de ned this measure as the complement of the meet; because the lattice is not complemented, we have to live with this slight anomaly. 1

3

2.3 Illustration

For the sake of illustration, we introduce a running example, which we borrow from Mili et al [20]. This example deals with a library of twelve Pascal compilers, de ned as follows: C0: accepting simple Pascal programs (where simple refers to the property that it does not include record and pointer types nor user de ned types), yielding complete Pcode (where complete refers to the property that it uses all 16 registers and has an increment instruction); C1: accepting simple Pascal programs, yielding reduced P-code (where reduced refers to the property that it uses only 8 registers, and that it does not include increment instructions); C2: accepting simple Pascal programs, yielding peepholed medium P-code (where medium refers to the property that it uses all 16 registers, but does not include increment instructions); C3: accepting standard Pascal programs, yielding complete P-code; C4: accepting standard Pascal programs, yielding peepholed medium P-code if the input is in simple Pascal, and reporting an unavailable feature otherwise; C5: accepting any string of Pascal terminal symbols, yielding peepholed medium P-code if the input is in simple Pascal, reporting an unavailable feature if the input is in standard Pascal, and an error message otherwise; C6: accepting standard Pascal, yielding reduced P-code; C7: accepting simple Pascal, yielding peepholed reduced P-code; C8: accepting standard Pascal, yielding peepholed medium P-code as output; C9: accepting standard Pascal, yielding globally optimized reduced P-code; C10: accepting standard Pascal, yielding peepholed reduced code if the input is in simple Pascal, reporting an unavailable feature otherwise; C11: nally, accepting any string of Pascal terminal symbols, yielding peepholed reduced code if the input is in simple Pascal, reporting an unavailable feature if the input is in standard Pascal, and an error message if the input is not in standard Pascal. Figure 1 shows how these components are ordered by the re nement relation; this ordering structure is used in [20] to orient the retrieval operations. We consider the following user query, whose speci cation we denote by K : We are looking for a compiler that accepts any string of Pascal terminal symbols and returns medium P-code if the input is in standard Pascal, or an error message otherwise. 4

C11

? ? ? ?C5 C10 C9 ? ? ? ? ? @ ? @ ???? ? @ ?C@8 @ C?7 ? ? C4 C6 ? @@ @? ? ? ? ? ? @? @ ? @ ? @ ? @ ? ? C@ 3@ @ C1 ? ? C2 @@@??? @? C0

Figure 1: A Database of Pascal Compilers Figure 2 shows how the components of the database are ordered with respect to their proximity to K , as measured by functional consensus. Speci cally, we have Ci  Cj , (K; Ci) w (K; Cj ): According to this measure, the components that come closest to K are C5; C6 (that are equally close) and C6 ; C8; C9 (that are also equally close); these components are optimal because they appear at the top of the functional consensus graph ( gure 2).

3 Re nement Di erence 3.1 De nition

In [3] we have found that in addition to the meet operation, we can also de ne a join operation for the re nement ordering. Speci cally, whenever two relations R and R0 satisfy the condition RL \ R0L = (R \ R0 )L; (which we call the consistency condition), they have a join, which is de ned by R t R0 = R0 L \ R [ RL \ R0 [ (R \ R0): Most importantly, we have found that the join of R and R0 is the speci cation that captures all the functional features of R and all the functional features of R0 ; hence in e ect the 5

C11; C5o

? ? ?

C9; C8; C6

?o@

? C@ 3@ @ ? o?

?? @@ @@ ?? @o?

??

? ? o C4 ; C10 ??

@@ ? @o? C1; C2; C7 ? ?

C0

Figure 2: Graph derived from Functional Consensus join performs the addition of the speci cations at hand. Also, we have found that not all pairs of speci cations can be added: in order for two speci cations to be added, they must meet the consistency condition, which can be interpreted as the condition under which the speci cations add to each other's information (rather than contradict each other's information). From the de nition of (requirements) addition, we will attempt to derive the notion of (requirements) subtraction. Given two speci cations R and R0 such that R w R0, it is natural to think of the di erence between R and R0 as the smallest (least re ned) speci cation X whose join with R0 yields R. The equation R0

tX wR

admits a feasible solution, which is X = R; the di erence between R and R0 is the minimal element (if it exists) of the (non-empty) set of feasible solutions. De nition 3 Given speci cations R and R0 such that R w R0, the re nement di erence between R and R0 is the relation that is denoted by R R0 and de ned as the least re ned solution X (if it exists) of the equation R

v R0 t X:

By analogy, observe that given two real numbers a and b such that a  b, the di erence between a and b is the smallest number x such that a  b + x: 6

The following proposition, which we give without proof (see [22, 23]), establishes the existence and unicity of the optimal solution, and provides an explicit expression thereof. Proposition 1 Given relations R and R0 such that R w R0, the re nement di erence between R and R0 is given by the formula: R R0 = R \ R0L [ (R0 \ R)L [ (R [ R0 ): Understanding the details of this formula is not crucial for the purpose of readability. Suce it to observe that this formula is easily converted to a formula of rst order logic, and that comparisons between distances can be formulated as theorems of rst order logic |well within the range of theorems that today's automated theorem provers can handle [46].

3.2 Usage

Given a set C of components and given a speci cation K that represents a user query, we are interested in identifying the components of C for which the functional features of K that are left unful lled are minimal. For a component Ci 2 C and a query K the functional features of K that are ful lled by Ci can be represented by K u Ci; hence the functional features of K that are left unful lled by Ci can be represented as K (K u Ci ); this expression is de ned since K w (K u Ci ) (a trivial lattice identity). If we denote this expression by  (K; Ci), then we are interested in identifying those components Ci 2 C that minimize the expression  (K; Ci). Rationale. The re nement di erence between K and K u Ci measures the amount of functional features that must be added to Ci is order to make it satisfy K . We are interested in components that minimize this quantity, because they minimize the amount of functional features that must be added to them to satisfy K . Whereas the functional consensus of K and Ci measures the savings achieved by using to satisfy K , the re nement di erence of K and K u Ci measures the expenditures that one must consent in order to adapt Ci to satisfy K .

3.3 Illustration

By using the same software library as above, and the same user query, we consider minimizing the re nement di erence between the query and the library components. Figure 3 shows the graph that is derived from this distance. This graph represents the following ordering relation: Ci  Cj ,  (K; Ci) w  (K; Cj ): From this graph, it appears, again that C5; C6; C8; C9; C11 are the optimal components |with C5 and C11 being equally close, and C6; C8; C9 being equally close. 7

o C5 ; C11

o C6; C8; C9       o C4; C10 ?o C3 ? ? ? ? o C1 ; C2; C? 7 ? ? ? ? oC ? 0

Figure 3: Graph derived from Re nement Di erence

4 Re nement Distance 4.1 De nition

We consider two speci cations R and R0 , and we are interested in measuring the amount of requirements information (or functional properties) that discriminates between them. We nd that this information can be articulated as the sum (or, in terms of the lattice of speci cations, the join) of two components:  The information of R that R0 does not have; this is re ected, as we discussed in section 3, by the expression R

(R u R0):

 The information of R0 that R does not have, which is given by the expression R0 (R u R0 ): Hence the following de nition. De nition 4 The re nement distance between speci cations R and R0 is denoted by (R; R0) and de ned by the following expression, when it exists (because it is a join, it may fail to exist): (R; R0) = R (R u R0 ) t R0 (R u R0 ): We have a simple proposition regarding this measure. Proposition 2 The re nement distance is de ned if and only if R and R0 have a join. Then its expression is given by 8

= (R t R0) (R u R0): Observe the analogy with real numbers, modulo the  ordering: the distance between two numbers, say x and x0 , which is the absolute value of (x ? x0), can be written as jx ? x0j = M ax(x; x0) ? M in(x; x0): If we consider that Max and Min are (resp.) the join and meet operations of the lattice de ned by , we see the analogy between the two de nitions. Interestingly, the measure of re nement distance satis es the traditional distance axioms, interpreted over the set of relations:  (R; R0) w ;.  (R; R0) = (R0; R).  (R; R0) = ; , R = R0.  (R; R0) t (R0; R00) w (R; R00). (R; R0)

4.2 Usage

Given a library C of components, and given a query K , we wish to identify all the components Ci 2 C that minimize the re nement distance to K , (K; Ci). Rationale. The measure (K; Ci) includes two terms, which must both be minimized:  The term K (K uCi) represents the functional features of K that are not covered by Ci; these must be minimized so that we have fewer features to add to Ci to satisfy K .  The term Ci (K u Ci) represents the functional features of Ci that are irrelevant to K ; these must be minimized so that we have fewer irrelevant features of Ci to deal with (work around) as we modify Ci to satisfy K .

4.3 Illustration

We consider the compilers library and the user query K submitted in section 2. Figure 4 shows how the components of C are ranked by their proximity to K , using the re nement distance to assess proximity. In other words, the graph of gure 4 represents the following relation: Ci  Cj , (K; Ci) w (K; Cj): Even though the graph is di erent, it still shows the same set of optimal elements as the distances we have seen so far, which are: C5; C6; C8; C9; C11.

9

o C5 ; C11

o

C1 ; C2, C3 ; C4, C7 ; C10

o C6 ; C8 ; C9

?? ? ? ?? ? oC ? 0

Figure 4: Graphs derived from Re nement Distance

5 Re nement Ratio 5.1 De nition

Given two speci cations R and R0 we wish to measure their distance by assessing the information that they have in common as well as the information that discriminates between them; we de ne the re nement ratio accordingly. De nition 5 The re nement ratio of two relations R and R0 is the vector denoted by (R; R0) and de ned by ! (R; R0) 0 (R; R ) = : (R; R0) We consider that the re nement ratio (as a measure of distance) increases whenever the rst term increases and the second term decreases; for this reason, we refer to the rst term as the numerator of the ratio, and refer to the second term as the denominator.

5.2 Usage

For a given query K , the ideal component C is one that has most functional features in common with K (as little as possible to change), for which the features to add are minimal (to minimize modi cation e ort) and whose irrelevant features are minimal (to minimize distraction by irrelevant information); this argument is articulated below. Rationale. We want to identify the library components that have most in common with the query, and have as little as possible to set them apart from the query. 10

oC5; C11

C6; C8 o C9

   

oC4; C10     C7  oH oC3  ? HHH HHoC1; C2 ??  ? ?? C? o? 0

Figure 5: Graphs derived from Re nement Ratio

5.3 Illustration

Figure 5 shows how the re nement ratio ranks the components of our sample library according to their proximity to the query K , as measured by the re nement ratio. This graph represents the following ordering relation among components: Ci  Cj , (K; Ci) w (K; Cj) ^ (K; Cj ) w (K; Ci): Interestingly, this graph shows a di erent list of optimal elements, which are: C3, C5, C6, C8, C9, C11. In addition, the graph that we derive from this measure of distance is markedly di erent from all the previous graphs.

6 Further Experimentation The running experiment presented in the previous sections gives us a mixed feeling: while we nd the diversity of graphs encouraging, we are concerned with the fact that most measures produce (more or less) the same optimal elements. We have hypothesized that the reason why all measures of distance produce (almost) the same optimal elements can be explained by the fact that the query we had submitted (K ) is very high in the lattice of the original component library (see gure 1), so that being near K is the same as being high in the lattice. In order to test our hypothesis, we have run a number of experiments, varying K to t in several places throughout the lattice, but always making sure that K is not solved by any component of the library. In this section, we brie y present our results for ve values of K , which we introduce below in prose form. K1 We are looking for a compiler that accepts simple Pascal code and produces correct reduced P-code, and accepts full Pascal code, for which it produces complete peephole-optimized P-code. 11

We are looking for a compiler that accepts standard Pascal code and produces medium P-code, and returns an error message if the input stream is a full Pascal program that is not in standard format. K3 We are looking for a compiler that accepts full (but not simple) Pascal code and produces reduced globally optimized P-code. K4 We are looking for a component that parses full and not simple Pascal code and produces an error message. K5 We are looking for a compiler that accepts full (but not standard) Pascal code and returns complete globally optimized P-code. For each query Ki, 1  i  5, and for each measure of distance, say , we scan all the components Cj , 0  j  11 and formulate the distance between Ki and Cj as an Otter c Argonne National Laboratory) formula. Then for each pair of components, say Ck and ( Ch , we compare (by the re nement ordering) the distances (Ki; Ck ) and (Ki ; Ch); this produces two predicates, which we formulate as Otter theorems, namely (Ki; Ck ) w (Ki; Ch ) and (Ki; Ch) w (Ki ; Ck ): We submit these two theorems to Otter and observe their outcome; the following table indicates how the outcome of these theorems determines how we draw components Ck and Ch in the graph induced by the measure of distance . K2

(Ki ; Ck ) w (Ki; Ch) (Ki; Ch) w (Ki ; Ck ) Representation Proved Proved Ch and Ck are the same node. Proved Not Proved Ch drawn above Ck . Not Proved Proved Ck drawn above Ch . Not Proved Not Proved Ch and Ck unrelated. For the sake of space, we do not show the four graphs that each of these queries produces; rather, we content ourselves with presenting in gure 6 the optimal components retrieved by each distance for each query. For the sake of precision, and perhaps at the expense of some loss of recall, we select synthetic criteria that re ne the selection produced by these measures of functional distance. Speci cally, we consider two synthetic criteria, which we present below:  Universal Optimal. A component is said to be universal optimal if and only if it is optimal by all four measures of distance.  Majority Optimal. A component is said to be majority optimal if and only if it is optimal by at least two measures of distance. 12

Queries K1

Functional Consensus

C9

, C11

Re nement Re nement Re nement Di erence Distance Ratio

C9

, C11

C1

, C3 ::C6

C1

, C3 ::C11

C8 ::C11 K2

K3

K4

K5

C5 ; C 6 ; C 8

C5 ; C6 ; C 8

C0 ; C1 ; C2

C0 ; C 1 ; C 2

C9 ; C11

C9 ; C11

C5 ; C6 ; C8

C5 ; C 6 ; C 8

C5 ; C9 ; C11

C5 ; C9 ; C11

C0

C5 ; C9 ; C11

C5 ; C11

C5 ; C9 ; C11

C5 ; C11

, C3 ::C6

C0

, C3 ::C6

C8 ; C9

C8 ; C 9

C0 ; C3 ; C5

C0 ; C 3 ; C 5

C6 ; C8 ; C9

C6 ; C 8 ; C 9

C0 ; C5

C0 ::C2

, C5

Figure 6: Optimal Components by Query and Distance. Figure 7 gives the selected components that are returned for each query by the two synthetic criteria. From this preliminary experiment, it does appear that the universal optimal criterion is adequate for most cases: it gives two to three components that are known to optimize all measures of functional distance.

7 Adaptation E ort When we perform approximate retrieval, it is with the hope that retrieved components are easy to adapt to satisfy the query at hand. All the measures of functional distance that we have presented in this paper are justi ed by the fact that they help us predict adaptation e ort. In this section, we discuss an experiment that we are currently running to determine whether there is some correlation between the measures of distance we have and some empirical estimates of adaptation e ort. In order to assess adaptation e ort in the speci c context of our experiment, we have adopted the following procedure: 1. We consider the decomposition of a compiler into a set of modules, namely: the lexical analyzer, the syntactic analyzer, the semantic analyzer, the global optimizer, the code generator, and the peephole optimizer. 2. For each pair made up of an available compiler C and a query K , we consider all the modules of C that are a ected by an adaptation of C to satisfy K . For example, if C handles simple Pascal and K requires full Pascal, then most components will likely 13

Queries K1

Universal Optimal

C9

, C11

Majority Optimal

C9 ::C11 C1

K2

C5 ; C 6 ; C 8

, C3 ::C6

C0 ; C 1 ; C 2 C5 ; C 6 ; C 8

K3

C5 ; C 9

C0

, C3 ::C6

C8 ; C9 ; C11 K4

C5 ; C 9

C5 ; C9 ; C11 C0 ; C 3 ; C 5 C6 ; C 8

K5

C5

C5

, C0 ::C2

C11

Figure 7: Optimal Components by Query and Synthetic Criterion. be a ected: the lexical analyzer, the syntactic analyzer, the semantic analyzer, the code generator |perhaps also the optimizers. 3. For each module that we have selected for modi cation, we consider a ve-rating scale that re ects the extent of the required modi cation; this may range from minor modi cation to a complete rework. To account for the non linear e ects of program modi cation e ort, we consider the curve of gure 8 (due to Selby et al [2, 40]), in which we divide the Y axis (which measures the ratio of modi cation e ort) into ve equal parts, and see how the X axis (which measures ratios of modi ed code) is divided. The following table shows how ratings are assigned as a function of the amount of code modi ed: Rating

Minor Small Medium Major Complete Modi Modi Modi Modi Rework cation cation cation cation

Percentage of Code 5% Modi ed

10 %

40 %

80 %

100 %

4. For each pair (K; Ci), where Ci is a compiler of the library, we establish a table that indicates to what extent each component of Ci must be modi ed to accommodate K. 14

1.0

0.75

0.5

0.25

0.046

Relative cost

,? ? , ,? ? , ,? , ? ,? , ? ,? , ,? 0.70 , , ?? , , ??     ?     ?  0.55   ?     ?  ??   ??  ? ?  ??  ??  ??  ??  ? ?  ?  ?? Amount Adapted  ?? ? 6

0.25

0.50

Figure 8: Non Linear Reuse E ects

15

0.75

1.0

5. For each pair of components (Ci; Cj ), we consider that Ci is harder to adapt to K than Cj if and only if for each entry of the table, the rating of Ci is higher than or equal to the rating of Cj . 6. For a given K , we rank all the components of the library according to the ordering de ned in the previous item; this is, clearly, a partial ordering. Once we have established an ordering between components, we consider the orderings induced by all the measures of functional distance and check whether one measure of distance is a good predictor of adaptation e ort by checking whether the ordering induced by adaptation e ort looks like one of the graphs produced by each measure of distance. To assess how much two graphs look alike, we may use one of two measures:  The number of edges that are in one graph and not in the other.  The ratio of common edges over the total number of edges. This work is currently under way, at the time of writing. In addition to running this experiment on the compilers example, by varying K , we also wish to run it on other examples, using the same principle as above to order adaptation e orts.

8 Conclusion

8.1 Summary and Assessment

In this paper, we have considered four measures of functional distance between speci cations, and have discussed how they can be used to perform approximate retrieval from a software library in a systematic manner: For each measure of distance, we formulate the approximate search as the identi cation of those library components that minimize the distance to the query. Further, we have illustrated the approximate retrievals derived from our four measures of distance on a running example, that involves twelve software components; for each measure of distance, we show how the components of the library are ordered according to their proximity to the user query. We are encouraged by the observation that overall, the various measures of distance produce distinct graphs, re ecting the fact that they portray di erent aspects of functional distance. We are also encouraged by the observation that, as we vary K to range over the whole storage structure, the selection of optimally close components yields distinct candidates, re ecting the fact that measures of distance do indeed recognize which components are closest to the query (rather that to systematically return the most feature-loaded components). Recognizing that, ultimately, the purpose of a procedure of approximate retrieval is really to minimize adaptation e ort, we have discussed the design of an experiment which is currently under way to correlate the measures of functional distance to an empirical measure of adaptation e ort.

16

8.2 Comparisons

Even though it is not the most crucial success factor in software reuse, the issue of component storage and retrieval has received widespread attention to date [1, 5, 13, 15, 19, 26, 27, 31, 32, 35, 36, 38, 39, 43, 45, 47, 48]. Part of the reason, perhaps, is the scienti c interest of the problem, and the technical challenge inherent in its solution. Other reasons for the high pro le of this issue include the fact that software libraries are useful for other purposes than software reuse [47], as well as the expectation that software reuse libraries of the future will be quite large [9], hence will require adequate storage and retrieval structures. In a recent survey [21], Mili et al divide methods of components storage and retrieval into six families, which are: information retrieval methods, which apply traditional informational retrieval technology to the problem of software libraries; descriptive methods, which use specialized library science-like techniques and tools; operational methods, which exploit the operational semantic features of software components to perform storage and retrieval; denotational methods, which use the denotational semantics of components to perform classi cation and retrieval; topological methods, which perform retrieval by minimizing some measure of (semantic or syntactic) distance; and nally structural methods, which match components against queries on the basis of the components' structure rather than their function. The method we present here can be classi ed as a denotational method (because it relies on a denotational semantic de nition of components) and can be classi ed as a topological method, because it is based on minimizing some measure of distance. We view it primarily as a topological method, and will brie y compare it to other such methods. In [27] (1992) Ostertag, Hendler, Prieto-Diaz and Braun present an AI based library system called AIRS (AI based Reuse System). The AIRS library contains components and/or packages. Both assets and queries are represented using features. Features represent a classi cation criterion and are de ned by a nite set of related values called terms. The retrieval method is based on the computation of similarity metrics which allow to compare either components or packages. The rst metric, used to compare components, is based on the subsumption and closeness relations. The second metric, called the package distance measures the e ort required to implement a target package given a candidate package. This distance is computed as the sum of the distances between the feature terms, plus the distance between the member sets. It is computed by mapping each component in a target member set to its best reuse component in a candidate member set as de ned by either the closeness or the subsumption relation, and then summing up the distances between these pairs of components. Ostertag et al present a prototype of the AIRS system using the EVB GRACE and the CTC CCIS libraries. The rst library includes Ada packages that implement data structures; the second contains C modules that implement basic functionalities of command, control and information systems. The authors describe the classi cation models used for both libraries. In [41, 42] (1993) Spanoudakis and Constantopoulos introduce a conceptual modeling language (under the name TELOS) which is well adapted to the representation of software artifacts, speci cally within the context of object oriented analysis and design. To 17

this e ect, the language supports the representation of object classes, and distinguishes between attributes and entities. Spanoudakis and Constantopoulos use this language to represent queries and assets, and de ne a measure of structural distance between queries and assets on the basis of an analysis of their TELOS representations. The distance they introduce is a weighted linear combination of four functions which re ect whether relevant entities in the query and the asset are identical and to what extent the query and the asset have common attributes via their shared subclasses and their shared superclasses. Spanoudakis and Constantopoulos built a prototype of their software library on the basis of the foregoing de nitions of distance, and integrated it with the existing Semantic Index System. In [11, 12] (1994) Girardi and Ibrahim introduce a structural measure of distance between software assets, and use it to perform retrieval in a software library. Library assets are represented using case-frame-like representations that are derived from a declarative de nition of the asset in natural language; and queries are derived in a similar fashion from an imperative de nition of the desired requirements. The distance between a query and a candidate asset is de ned by a linear combination of weighted terms, where each term corresponds to a slot of the case-frame. The term associated to a given slot is the product of two factors: a weight, which re ects the relative importance of the slot in de ning the function of the asset; and a similarity index, which re ects to what extent the slot of the query and the slot of the candidate asset are similar. The weight can be determined by the domain analyst who stores the asset in the library, while the similarity index can be retrieved from specialized natural language thesauri. Using existing natural language technology, Girardi and Ibrahim produce a large scale prototype that supports their retrieval method; they call it ROSA (Reuse Of Software Artifacts). In [28, 29] (1995) Penix and Alexander discuss a formal speci cation-based method for the storage and retrieval of software components in a software library. In order to improve the recall of their method, they make provisions for approximate retrieval whenever exact retrieval fails to produce assets, or produces too few. Their approximate retrieval operates in a similar manner to their exact retrieval (by matching query features against asset features), but with the di erence that query features are generalized, so that the range of assets that satisfy them is widened. Generalization is achieved by weakening the criteria of feature identity or by deleting features from the query. The proximity of a candidate asset to the query is then assessed by considering the number of query features that are satis ed by the asset, as well as the extent to which they are satis ed; in this sense, the approximate retrieval of Penix and Alexander can be viewed as a topological method. In [7] (1996) Faustle et al propose a classi cation and retrieval method for object oriented repositories, based on the use of fuzzy logic. The approach is developed within the Ithaca application development environment [14]. The library, called Software Information Base (SIB), is organized according to the Telos knowledge representation language [17]; SIB entries are Telos classes. A software description consists of a set of keyword pairs also called features, which describe the behavior of the asset. The number of features is unbound. Each feature is weighted with a fuzzy value called the relevance index. Queries have the same structure as assets; they are represented by class attributes and software descriptions. Similarity between a query and a candidate asset is assessed in terms of a 18

Con dence Value (CV); CV is a function that takes two software descriptions as inputs and returns a value between 0 and 1, called the similarity index. Faustle et al present a prototype that is used to perform an evaluation of their approach. The experiment presented is based on 87 C++, Smalltalk and Ei el assets. Our work di ers from Ostertag et al [27], Girardi et al [11, 12], who use arti cial intelligence and natural language techniques to measure distances and assess proximity. Also, it di ers from Spanoudakis et al [41, 42] and Faustle et al [7] who use TELOS as their representation language, and the conceptual modeling technique that the language supports. Like us, Penix and Alexander [28, 29] represent queries and components by formal speci cations; unlike us, however, their approximate retrieval is performed by running an exact match with a weakened speci cation.

References [1] S.P. Arnold, and S.L. Stepoway. The reuse system: cataloguing and retrieval of reusable software. Proceedings, COMPCON'87, p 376-379. IEEE Computer Society, 1987. [2] B. Boehm, B. Clark, E. Horowitz, C. Westland, R. Madachy, and R. Selby. Cost models for future software life cycle processes. Annals of Software Engineering Special Volume on Software Process and product Measurement, 1995. [3] N. Boudriga, F. Elloumi, and A. Mili. The lattice of speci cations: Applications to a speci cation methodology. Formal Aspects of Computing, 4:544{571, 1992. [4] Ch. Brink, W. Kahl and G. schmidt (editors). Relational Methods in Computer Science. Wien: Springer Verlag, 1997. [5] P. Devanbu, R. Brachman, P. Selfridge, and B. Ballard. LaSSIE: A Knowledge-based software information system. Communications of the ACM, Vol 34, No 5, pp 34-49, (May 1991). [6] R. Di Cosmo. Type isomorphisms in a type-assignment framework. Proceedings, 19th Annual Symposium on Principles of Programming Languages. New York, NY: ACM Press, 1992. [7] S. Faustle, M. G. Fugini, and E. Damiani. Retrieval of reusable components using functional similarity. Software Practice and Experience, 26(5):491{530, May 1996. [8] W.B. Frakes and B.A. Nejmeh. An information system for software reuse. Proceedings, 10th Minnowbrook Workshop on Software Reuse, 1987. [9] W.B. Frakes, and T.P. Pole. IEEE Transactions on Software Engineering. Vol. 20, No 8 (August 1994), pp 617-630. [10] M.G. Fugini and S. Faustle. Retrieval of reusable components in a development information system. In Proc. 2nd Intl. Workshop on Software Reusability, pages 89{98, Lucca, Italy, march 1993. 19

[11] R. Girardi and B. Ibrahim. Automatic indexing of software artifacts. In W.B. Frakes, editor, Third International Conference on Software Reuse, Rio de Janeiro, Brazil, November 1994. [12] R. Girardi and B. Ibrahim. A similarity measure for retrieving software artifacts. In W. Berztiss, editor, International Conference on Software Engineering and Knowledge Engineering, Jurmala, Latvia, June 1994. [13] R.J. Hall. Generalized behaviour-based retrieval. In Proceedings, International Conference on Software Engineering, Baltimore, MD, May 1993. [14] Ithaca. Integrated toolkit for highly advanced computer applications. Technical Report 2705, EEC-Esprit II Project, december 1993. [15] J.J. Jeng, and B.H.C. Cheng. Formal methods applied to reuse. Proceedings, 5th Workshop on Software Reuse. Palo Alto, CA, 1992. [16] S. Katz, C.A. Richter, and T-S. The. Paris: A system for reusing partially interpreted schemas. In Proc. 9th Intl. Conf. on Software Engineering, pages 377{385, Monterey, CA, april 1987. [17] M. Koubarakis. Telos: Features and formlization. Technical Report KRR-TR-89-1, University of Toronto, 1989. [18] Y.S. Maarek and D.M. Berry. The use of lexical anities in requirements extraction. Proceedings, 5th International Workshop on Software Speci cation and Design. Pittsburgh, Pa, May 1989. [19] Y.S. Maarek, D.M. Berry, and G.E. Kaiser. An information retrieval approach for automatically constructing software libraries. IEEE Trans. on Soft. Eng., Vol 17, No 8, (August 1991), pp 800-813. [20] R. Mili, R. Mittermeir, and A. Mili. Storing and retrieving software component: A re nement based approach. In Proceedings, International Conference on Software Engineering, 1994. [21] A. Mili, R. Mili and R. Mittermeir. A Survey of Software Storage and Retrieval. University of Ottawa, Ottawa, Ont. May 1997. [22] R. Mili. Assessing the Reusability of a Software Component: Empirical and Analytical Approaches. PhD Dissertation. University of Ottawa, 1996. [23] R. Mili, J. Desharnais, M. Frappier and A. Mili. Measures of Syntactic and Semantic Distance Between Speci cations. University of Texas at Dallas, School of Engineering and Computer Science, 1996. [24] R. T. Mittermeir and E. Ko er. Layered speci cations to support reusability and integrability. Journal of Systems Integration, 3(3):273{302, sept. 1993. 20

[25] Th. Moineau and M.C. Gaudel. Software reusability through formal speci cations. In Proc. 1st Intl. Workshop on Software Reusability, number Memo Nr 57, Dortmund, 1991. [26] F. Nishida, S. Takamatsu, Y. Fujita, and T. Tani. Semi-automatic program construction from speci cations using library modules. IEEE Transactions on Software Engineering, Vol 17, No 9, pp 853-870 (September 1991). [27] E. Ostertag, J. Hendler, R. Prieto-Diaz, and C. Braun. Computing similarity in a reuse library system: An AI-based Approach. ACM TOSEM, 1(3):205{228, july 1992. [28] J. Penix and P. Alexander. Design representation for automating software component reuse. In Proceedings, Fifth International Workshop on Knowledge Based Systems for the (Re)Use of Software Libraries, November 1995. [29] J. Penix and P. Alexander. Ecient speci cation based component retrieval. Technical report, University of Cincinnati, Knowledge Based Software Engineering Laboratory, ECECS, July 1996. [30] D.E. Perry. The Inscape environment. Proceedings, 11th International Conference on Software Engineering, pp 2-12. IEEE Computer Society Press, 1989. [31] D.E. Perry and S.S. Popovich. Inquire: Predicate-based use and reuse. Proceedings, Knowledge Based Software Engineering Conference. Chicago, IL, September 1993. [32] A. Podgurski and L. Pierce. Behaviour Sampling: a technique for automated retrieval of reusable components. In Proceedings, 14th International Conference on Software Engineering, pp 300-304. New York, NY: ACM Press, 1992. [33] A. Podgurski and L. Pierce. Retrieving reusable software by sampling behavior. acm TOSEM, 2(3):286{303, july 1993. [34] R. Prieto-Diaz and P. Freeman. Classifying software for reusability. IEEE Software, 4(1):6{16, 1987. [35] R. Prieto-Diaz. Classi cation of reusable modules. Software Rusability, volume 1: Concepts and Models. T.J. Bigersta and A.J. Perlis, editors. New York, NY: ACM Press, 1989. [36] M. Rittri. Using types as search keys in function libraries. Conference on Functional Programming Languages and Computer Architectures. Reading, Ma: Addison Wesley, 1989. [37] W. Rossak and R.T. Mittermeir. A dbms based repository for reusable software components. In Proc. 2nd Intl. Workshop Software Engineering and Its Applications, pages 501{518, Toulouse, France, 1989.

21

[38] C. Runciman and I. Toyn. Retrieving reusable software components by polymorphic type. Conference on Functional Programming Languages and Computer Architectures. Reading, Ma: Addison Wesley, 1989. [39] M. Sitaraman. Performance parameterized reusable software components. International Journal of Software Engineering and Knowledge Engineering. Vol2, No 4, pp 567-587, 1992. [40] R. W. Selby. Empirically analysing software reuse in a production environment. In W. Tracz, editor, Software Reuse: Emerging Technology. IEEE Computer Society Press, Los Alamitos, California, 1988. [41] G. Spanoudakis and P. Constantopoulos. Similarity for analogical software reuse: A conceptual modelling approach. In Proceedings, CAiSE '93, LNCS vol 685, June 1993. [42] G. Spanoudakis and P. Constantopoulos. Measuring similarity between software artifacts. In W. Berztiss, editor, International Conference on Software Engineering and Knowledge Engineering, Jurmala, Latvia, June 1994. [43] R.A. Steigerwald. Reusable component retrieval with formal speci cations. In Proceedings of the 5th Annual Workshop on software reuse. October 1992. [44] D.W.J. Stringer-Calvert. Signature matching for Ada software reuse. Master's thesis, University of York, York, UK. [45] B.W. Weide, W.F. Ogden, and S.H. Zweben. Reusable software components. In M.C. Yovits, editor, Advances in Computers, pages 1{65. Academic Press, 1991. [46] L. Wos, R. Overbeek, E. Lusk, and J. Boyle. Automated Reasoning: Introduction and Applications. McGraw Hill, New York, NY, 1992. [47] A. Moormann Zaremski and J. M. Wing. Signature matching: A Tool for Using Software Libraries. ACM Transactions on Software Engineering and Mathodology, 4(2), 146{170, april 1995. [48] A. Moorman Zremski and J. M. Wing. Speci cation Matching of Software Components. In Proceedings, SIGSOFT '95: Third ACM SIGSOFT Symposium on the Foundations of Software Engineering. New York, NY: ACM Press.

22