Clustering Support for Automated Tracing - Semantic Scholar

9 downloads 494 Views 586KB Size Report
Nov 9, 2007 - drawbacks because the retrieved list is unstructured, hard to understand, and ... agglomerative hierarchical clustering algorithm on TREC data sets [28] and ... can lead to serious problems such as failing to fully understand the impact of a ... closest inter-cluster elements or as the distance between the two.
Clustering Support for Automated Tracing Chuan Duan and Jane Cleland-Huang Center for Requirements Engineering DePaul University 243 S. Wabash Avenue, Chicago IL 60604 312-362-8863

{duanchuan,jhuang}@cs.depaul.edu

in a trace matrix that is constructed manually by analysts during system development. Unfortunately building and maintaining complete and accurate trace matrices is arduous and effort consuming and so practitioners often fail to implement consistent and effective traceability processes [4,15].

ABSTRACT Automated trace tools dynamically generate links between various software artifacts such as requirements, design elements, code, test cases, and other less structured supplemental documents. Trace algorithms typically utilize information retrieval methods to compute similarity scores between pairs of artifacts. Results are returned to the user as a ranked set of candidate links, and the user is then required to evaluate the results through performing a topdown search through the list. Although clustering methods have previously been shown to improve the performance of information retrieval algorithms by increasing understandability of the results and minimizing human analysis effort, their usefulness in automated traceability tools has not yet been explored. This paper evaluates and compares the effectiveness of several existing clustering methods to support traceability; describes a technique for incorporating them into the automated traceability process; and proposes new techniques based on the concepts of theme cohesion and coupling to dynamically identify optimal clustering granularity and to detect cross-cutting concerns that would otherwise remain undetected by standard clustering algorithms. The benefits of utilizing clustering in automated trace retrieval are then evaluated through a case study.

One solution to this problem is to utilize information retrieval methods such as the vector space model (VSM) [17], probabilistic network model (PN) [3,4], or latent semantic indexing (LSI) [1,23], to automatically generate traceability links. These methods compute similarity scores between pairs of artifacts, and those scoring above a certain threshold value are considered candidate links. Numerous studies have been conducted to evaluate the effectiveness of automated trace construction techniques. Although results are strongly affected by the quality of the dataset, in most cases recall of 90% is achievable at precision levels of 5-30% [1,3,4,11,15,17,23], where recall measures the number of correct links retrieved over the total number of correct links, and precision measures the number of correct links retrieved over the total number of retrieved links[13]. Several tools have been created to support trace generation including Poirot [22], RETRO [17], and ADAMS [7]. In each of these tools, the user issues a trace query and the tool generates a set of candidate links which are presented to the user for evaluation. However in any non-trivial system the number of candidate links can be extensive, meaning that although automated trace tools alleviate the effort of the tracing task, the analyst is still required to evaluate a significant number of links. As Domges et al. had previously observed, an excessive number of links can lead to an ‘unwieldy tangle’ that is hard for an analyst to understand or use effectively [9].

Categories and Subject Descriptors D.2.7 [Distribution, Maintenance, and Enhancement]: Restructuring, reverse engineering, and reengineering.

General Terms Management, Documentation.

Most traceability algorithms and tools assume a straightforward and simple model in which all related artifacts are retrieved and displayed to the users as a sequential list, typically ranked in order of similarity score [7,17,22]. This simple model has serious drawbacks because the retrieved list is unstructured, hard to understand, and more importantly, isolates the retrieved artifacts from their context. These limitations hinder comprehension and evaluation of the generated links. In fact, humans naturally attempt to compensate for inherent limitations in perception and memorization through creating categories and hierarchies.

1. INTRODUCTION Software traceability supports a wide variety of activities throughout the software development life cycle. For example, traces are used to validate and verify requirements, support impact analysis and regression test selection, understand requirements rationales, and to retrieve supplemental information from supporting documents [15]. Traceability links are typically stored Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASE’07, November 5–9, 2007, Atlanta, Georgia, USA. Copyright 2007 ACM 978-1-59593-882-4/07/0011...$5.00.

This research therefore investigates the use of clustering within automated traceability, and evaluates its usefulness to improve trace performance through increasing understandability and reducing the human effort needed to evaluate a set of candidate links. The remainder of the paper is laid out as follows. Section 2 provides a survey of related work. Section 3 describes three standard clustering algorithms and evaluates them against the

244

its way into web-based search engines such as Clusty [5] and Quintura where it is used primarily to organize query results.

requirements of three different projects. Section 4 describes several techniques for measuring quality of the generated clusters and for determining optimal cluster granularity. A new technique for identifying ideal cluster granularity based on the concept of a dominant theme is introduced. Section 5 introduces the concept of cross-cutting concerns, which represent themes that cut across the dominant clustering of the system and are largely ignored by standard clustering algorithms. A technique for mining these concerns and promoting them to the cluster level is described. Section 6 then illustrates and evaluates the overall benefits of cluster-based tracing through a case study implemented in our web-based tracing tool named Poirot [22]. Finally, section 7 concludes with an overall analysis of cluster-based tracing and a discussion of future work.

Although clustering methods have been used effectively in various IR contexts, they have not yet been systematically applied to improve automated trace generation. In fact the context of trace generation is significantly different to that of a more typical IR query for several reasons. First, a trace query is generally not as short and precise as a typical IR query. Instead of containing keywords that are designed specifically for searching, a trace query is composed from the text of a requirement or other software artifact. Although ideally every requirement is concise, primitive, and non-ambiguous, the reality is that requirements can often be relatively long and may also contain superfluous information. This problem tends to reduce the precision of the retrieved results.

2. CLUSTERING IN INFORMATION RETRIEVAL

Second, in the trace problem the searchable documents are composed of individual requirements, classes, or test cases, that tend to be significantly smaller than typical documents targeted in a web search or online library search. This makes it feasible to cluster entire sets of documents instead of just clustering the retrieved ones, meaning that additional contextual information can be presented to the user. Furthermore, unlike documents on the web or in a library, software engineering artifacts rarely contain explicitly defined keywords that have been inserted to support document retrieval. These differences create new challenges for trace retrieval and mean that clustering techniques must be finetuned to support the tracing environment.

The idea of using clustering to improve the performance of general information retrieval (IR) algorithms is certainly not a new one. It has been employed to address several issues such as information retrieval performance improvement [19], document browsing [6], topics discovery [12], organization of search results [30], and concept decomposition [8]. Clustering has been used to improve both the quality of an IR query as well as organization of the results. On the query side, clustering can be particularly useful for addressing the well-known “short query” problem, in which users enter insufficient information to yield a rich set of results. In this context, historical query data is collected, and queries are placed into clusters. When a new query is received, it is compared to existing query clusters and associated with the one it is most similar to. Terms from the cluster are then used to expand and enhance the current query. Sometimes previously validated results are associated with each of the standard query clusters [29].

Finally, it is interesting to note that the ultimate goal of a trace query is subtly different from that of an IR query. Whereas for many IR style queries it is sufficient to find a good subset of relevant documents, in automated traceability it is essential to retrieve close to 100% of the related artifacts. Undiscovered links can lead to serious problems such as failing to fully understand the impact of a change or falsely implying that a requirement is incompletely implemented in the design.

On the result side, several different techniques have been used to cluster documents either around keywords and phrases, or according to external factors such as shared references or incoming links [21]. For document clustering, a hierarchical approach is generally preferred because of its natural fit with the hierarchy found in many documents. Clusters are constructed from the bottom up by progressively identifying the most similar elements and placing them into a shared cluster. Steinbach et al. [27] compared agglomerative hierarchical clustering with partitional K-means clustering, which is a top-down approach that starts with a single cluster and progressively bisects the "poorest" cluster at each stage into a pair of sub-clusters by applying kmeans. They found that this approach outperformed the others in generating better clusters [31].

3. CLUSTERING IN TRACEABILITY In order to apply clustering to the traceability problem, it is useful to first evaluate the effectiveness of several standard clustering algorithms applied in a typical tracing environment. This section therefore briefly describes three clustering algorithms and evaluates their abilities to cluster requirements from three different datasets. The three algorithms are agglomerative hierarchical clustering, K-means, and bisecting divisive clustering.

3.1 Computing Similarity All three of these algorithms utilize a term-based approach to compute the similarity between artifacts. For the experiments described in this paper, these similarity scores were computed using a probabilistic network algorithm which has been deployed in our traceability tool Poirot [22] and validated through numerous experiments [3,5,22]. Each artifact is preprocessed to eliminate ‘stop’ words, which represent commonly occurring words such as conjunctions, adverbs, and pronouns that are too widespread to be useful in computing similarity scores. The remaining words are then stemmed to their root forms using Porter’s stemming algorithm to remove suffixes and prefixes [13]. These stemmed words are labeled ‘terms’. Each artifact is then represented as a vector in a metric space Rm determined by an ordered set of terms T, where T = {t1, t2, …, tm}. The vector for artifact ai is comprised

Leuski [21] conducted a series of experiments that demonstrated the value of clustering results from information retrieval queries. He compared the performance of six variations of the agglomerative hierarchical clustering algorithm on TREC data sets [28] and designed a search strategy that improved recall and precision based on retrieval feedback. Bellot and El-Beze integrated agglomerative hierarchical clustering and K-means clustering to achieve better clustering structure but found insignificant improvement in retrieval precision [2]. They performed regression analysis to identify the optimal relationship between cluster number and query size. Clustering has also made

245

The initial selection of centroids can have a significant impact on the final clustering and the algorithm performs best when initial centroids are as dissimilar to each other as possible. For experimental purposes the following heuristics were adopted:

of frequencies representing the occurrence of each term in ai, written as ( f i1 , f i 2 ,..., f im ) . The similarity score between any pair of artifacts a1 and a2 is then computed based on the similarity s(a1,a2) of their vectors as follows:

[

]

s(a1, a2 ) = pr(a2 | a1) = ∑i=1 pr(a2 | ti ) pr(a1, ti ) / pr(a1) m

pr(a2|ti) is computed as pr ( a 2 | t i ) = f 2 i /



k

1. Select the optimal number of clusters k. (Note this is discussed in section 4). 2. Initialize a set of centroids M = {m1, m2, …, mk} for clusters {c1, c2, …, ck}. To avoid poor quality clusters, pick k artifacts from D to serve as initial centroids that exhibit as little mutual similarity as possible. 3. For each artifact ai, compute the similarity scores between ai and each centroid. Identify the centroid mj that is most similar to ai, and assign or reassign ai to cluster cj. 4. For each cluster cj, recompute the newly formed center mj as the mean of all the artifacts contained in cj. 5. Repeat steps 3 and 4 until no membership reassignment occurs.

(1)

f 2 k and estimates

the extent to which the information in ti describes the concept a2 in respect to the total number of terms in the artifact, while pr(a1,ti) is computed as pr (a1 ,t i ) = f1i / ni , where ni represents the number of artifacts in the collection containing the term ti. Finally, pr(a1) is computed as pr (a1 ) = pr (a1 ,t i ) and represents the



i

extent to which the term ti describes the concept a1 [3,4]. During the clustering phase, similarity scores are computed for each potential pair of artifacts.

K-means clustering exhibits time complexity of O(kn) = O(n), meaning that it converges quickly, however it suffers from the problem that no foolproof technique exists for selecting the optimal initial set of centroids.

3.2 Agglomerative hierarchical clustering Agglomerative hierarchical clustering [18] provides a bottom up approach in which smaller clusters are iteratively combined into larger ones. Initially, each object is placed into an individual cluster. At each step in the process, the two clusters exhibiting the highest similarity score are combined into a higher-level cluster.

3.4 Bisecting divisive clustering The bisecting divisive clustering algorithm relies on K-means (K=2 specifically) clustering to consecutively bisect a larger cluster into two smaller ones. It runs as follows:

There are several variants to this approach. For example, cluster proximity could be computed as the distance between the two closest inter-cluster elements or as the distance between the two most distant elements. One of the most common methods, known as the Unweighted Pair Group Method with Arithmetic Mean (UPGMA), has been shown to generally outperform other approaches [18]. UPGMA calculates the average similarity between all pairs of inter-cluster artifacts in clusters ci and cj as:

S (ci , c j ) =

1 | ci || c j | a ∈c

∑ s(a , a )

1

1

2

1. Start by assigning all the artifacts to a single cluster. 2. For each ci in the present clustering C, bisect it using 2-means clustering and then compute the score of the objective function E over the resulting clusters (where E is defined in section 3.3). 3. Select the cluster cp that exhibits the highest E score. For this cluster, commit the splitting of cp, by removing cp, and adding the two new clusters into the clustering. 4. Repeat steps 2 and 3 until the desired clustering granularity has been achieved.

(2)

i ,a 2 ∈c j

This algorithm exhibits a time complexity of O(n), which is equivalent to K-means clustering. However, contrary to K-means clustering, it tends to produce relatively uniformly sized clusters which is a nice property when considering usability of the clusters for supporting tasks such as trace retrieval. In bisecting divisive clustering, the E score progressively increases; however this optimization tends to slow down as an increasing number of clusters are formed. As will be discussed in section 4, it is nontrivial to determine the optimal number of clusters purely through analyzing the E score.

Agglomerative hierarchical clustering has been widely used for textual clustering due to its simplicity and because it has been shown to produce acceptable clusterings [21]. However, it has a high time complexity of O(n2) and once a cluster is placed into a group there is no way to backtrack and place it into a more appropriate group that may surface later in the clustering process.

3.3 K-means clustering The partitional clustering paradigm provides an alternate approach in which a search is made for the best way to group a set of artifacts D = {a1, a2, …, an} into clusters C = {c1, c2, …, ck} in order to optimize a global objective function E expressed over C. In text based mining, an intuitive and widely used objective function is computed as the sum of cohesion E =

∑ ∑ k

i =1

a∈c i

3.5 Analysis of clustering algorithms for tracing These three techniques were each evaluated against the requirements of three different data sets, simulating the situation of tracing to requirements from artifacts such as higher level business goals, design elements, or code. Clustering quality is typically evaluated against an answer set, however our subjective examination suggested that it would be infeasible to find an ideal requirements clustering solution against which to compare the results. Any evaluation would therefore be biased according to our subjective decisions. Although clustering results can also be evaluated using standard metrics, such as those described in section 4 of this paper, similar metrics are in fact used in the objective functions of each clustering process. As a

s ( a , m i ) , where mi is the centroid

of cluster ci, and s(a, mi) represents the similarity score between artifact a and mi. In K-means clustering a predefined number of centroids are first created, each one representing a cluster. The distance from each artifact to each of these centroids is computed, and each artifact is placed into the cluster for which it exhibits the closest proximity. Centroids are then adjusted to represent the center of each cluster.

246

Table 1. Correlation between three clustering algorithms at various numbers of clusters. PHW

IBS

|C|

B,K

B,A

K,A

B,K

2

1.00

0.53

0.53

0.41

3

0.84

0.75

0.71

0.84

4

0.84

0.74

0.76

0.79

5

0.85

0.81

0.81

6

0.87

0.84

B,A

Sim(C i , C j ) =

EBT K,A

B,K

B,A

K,A

0.15

0.15

0.41

0.12

0.08

0.21

0.16

0.77

0.56

0.63

0.23

0.20

0.74

0.74

0.72

0.83

0.66

0.59

0.81

0.79

0.81

0.90

0.87

0.81

0.81

0.84

0.83

0.84

8

0.93

0.92

0.93

0.90

0.83

0.82

0.89

0.88

0.91

10

0.94

0.94

0.95

0.92

0.88

0.89

0.91

0.90

0.92

14

0.95

0.94

0.97

0.94

0.93

0.93

0.95

0.95

0.96

18

0.97

0.96

0.98

0.95

0.95

0.95

0.97

0.96

0.97

22

0.97

0.97

0.98

0.96

0.96

0.96

26

0.97

0.97

0.98

0.96

0.97

0.96

30

0.98

0.97

0.98

0.97

0.97

0.97

34

0.98

0.97

0.98

0.97

0.97

0.98

vi ⋅ v j

(3)

| v i || v j |

Scores approaching 1 represent greater similarity between the two clusters. Table 1 depicts the correlation results, indicating that at reasonable clustering granularity, of 5-6 clusters or higher, no significant difference was observed between the clusterings achieved by the three approaches. It was also observed that the agglomerate hierarchical approach tended to leave a number of unclustered artifacts, while existing clusters tended to grow increasingly large. Although additional steps can be added to the process to encourage more uniform clustering, this was seen as a significant disadvantage to the approach. K-means clustering suffers from the problem of finding a suitable set of initial centroids, and can therefore tend to perform poorly under certain circumstances. On the positive side, the bisecting approach has a significantly faster runtime than most other approaches and was reported in [21] to outperform K-means and agglomerative hierarchical clustering in precision. Based on these observations, the bisecting divisive clustering approach was adopted throughout the remainder of the experiments described in this paper.

B = Bisecting divisive clustering K = K-means A = Agglomerate hierarchical

result, each metric is biased towards the clustering technique that adopts it. For experimental purposes we therefore decided to first answer the simpler question of whether the three standard clustering techniques produced significantly different or similar results. The more difficult question concerned with assessing the “goodness” of a trace clustering is deferred until Section 4. The three datasets included in the experiment were the Ice Breaker System (IBS), Event Based Traceability (EBT), and Public Health Watcher (PHW). IBS was initially described in [25] and then enhanced with requirements mined from documents obtained from the public work departments of Charlotte, Colorado; Greeley, Colorado; and the Region of Peel, Ontario. IBS manages de-icing services to prevent ice formation on roads. It receives inputs from a series of weather stations and road sensors within a specified district, and uses this information to forecast freezing conditions and manage dispersion of de-icing materials. The system consists of 180 functional requirements, 72 classes, and 18 packages. EBT, which was initially developed at the International Center for Software Engineering at the University of Illinois at Chicago, provides a dynamic traceability infrastructure based on the publish-subscribe scheme for maintaining artifacts during longterm change maintenance. It is composed of 54 requirements, 60 classes, and 8 packages. Finally PHW represents a health complaint system developed to improve the quality of the services provided by health care institutions [26]. The specification is mainly structured as use cases, and in this paper, each use-case step is extracted as a requirement, resulting in 241 requirements.

4. EVALUATING CLUSTER QUALITY The previous section compared three different techniques and determined that for the three datasets there was no significant difference in the quality of clusterings produced once clustering reached a reasonable level of granularity. In this section, the more difficult issue of evaluating the quality of an evolving clustering and determining how many clusters to generate, is addressed. Intuitively the objective is to maximize the cohesion of individual clusters while simultaneously minimizing inter-cluster coupling. Jain [18] evaluated several metrics for validating cluster quality and for identifying ideal granularity level. He concluded that there is no single ideal metric capable of consistently producing optimal results across multiple datasets. He also demonstrated that popular metrics returned different results when applied to the 0.03

IBS 0.025

PHW

Hubert Value

EBT 0.02

Potential stopping points

0.015

11 0.01

3.6 Results For this initial series of experiments, clusterings were generated at varying levels of granularity ranging from 2-34 clusters for the PHW and IBS systems, and from 2-18 for the smaller EBT system. A pairwise correlation analysis was performed for each dataset to compare the performance of the three algorithms. Given the artifacts set D = {a1, a2, …, an} and two clustering structures Ci and Cj, then two membership vectors vi and vj associated with Ci and Cj respectively were constructed. The similarity or correlation between Ci and Cj was defined as the cosine score of vi and vj:

6

22 56

0.005

11

20

0 1

6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

Iterations

Figure 1. Huberts index against three datasets same datasets. In this section, two common metrics are tested against the three datasets. A new metric is then introduced that

247

we designed specifically to support granularity decisions for clustering software engineering artifacts and supporting trace generation.

within a cluster into a single consolidated artifact, and then computes the mean of the similarity scores between each pair of resulting consolidated artifacts.

4.1 Hubert Index

The CC-ratio is then defined as CC = CH / IS. As CH tends to increase and IS to decrease as the number of clusters increases, the CC curve tends to grows continuously. Potentially optimal cluster granularities are again found by identifying the knee(s) of the curve. The results for each of the three datasets are depicted in Figure 2. It is interesting to note that the ideal cluster numbers discerned by the CC-Ratio are significantly different to those found by the Hubert index. For example, for the IBS datasets Hubert suggested clustering at 11, 22, or 56 iterations, while CCRatio proposed clustering at 18, 28, or 54 iterations. Similarly for the PHW system, the Hubert index suggested 6 or 20 iterations while CC-Ratio suggested 11, 20, 33, or 40. As the peaks and troughs of each graph tend to be relatively small and frequent, this approach proved difficult to apply in the requirements tracing problem. In fact these techniques have primarily been used in applications that require coarser grained clusterings.

The first standard metric that was evaluated is a variant of the well known Hubert index [16] and is defined as follows: n −1

H (C ) = M1

n

∑ ∑ s(i, j)Q(i, j)

(4)

i =1 j = i +1

where M represents the number of artifact pairs n(n-1)/2, s(i,j) denotes the similarity score between artifacts i and j, and Q(i,j) is the similarity score between two clusters containing item i and j respectively. The Hubert index takes both cohesion and distinctiveness into consideration. The results for the Hubert metric applied at various clustering levels are shown in Figure 1. Halkidi et al. have shown that optimal granularity points are found at the various knees in the curve [16]. For example, the IBS dataset exhibits knees at 11, 22, and 56 iterations representing optimal coarse grained, medium grained, and fine grained clustering options.

4.3 Theme cohesion and coupling The quality of the generated clusters was manually assessed by the researchers for each of the three datasets at various clustering granularities. During this process it was observed that requirements often tend to contain supplemental information in addition to the dominant theme. As an example, consider the IBS requirement stating that “When new road sensors are added, the thermal map shall be updated to reflect the new weather data.” In this case, the requirement could logically have been placed into a cluster focusing on thermal maps, road sensors, or even a weather data theme. Unfortunately neither the Hubert Metric nor the CCRatio metric consider the possibility of primary and secondary themes within and across clusters. Based on our observations of clusters that were subjectively evaluated as either ‘good’ or ‘bad’ by human analysts, we define a good cluster as one that contains a dominant or cohesive theme running throughout its member artifacts. A user should be able to look at a good cluster, and intuitively understand why its members were placed together.

4.2 CC-Ratio As previously stated, the two primary clustering objectives are to maximize cluster cohesion and to minimize inter-cluster coupling. These factors are combined in the CC-ratio metric where cluster cohesion (CH) is defined as:

CH = [

∑ ∑ k

i =1

a∈ci

s (a, mi )] / n

(5)

and where n represents the total number of artifacts. This formula computes the mean of the similarities between each point in a cluster and the cluster’s centroid. As clusters are formed by placing similar items together, then clusters with points that are closer to a centroid are intuitively more cohesive than clusters with more dispersed elements. Inter-cluster coupling is minimized when the terms in each cluster are distinct from those in other clusters. CC-Ratio measures this through a metric called inter-cluster similarity (IS) defined as follows:

∑ s( AG(c ), AG(c

IS =

i

ci ,c j ∈C ∧ ci ≠ c j

j

Based on this observation, two new metrics were designed named theme cohesion (TCH) and theme coupling (TCP). Both metrics rely on the underlying concept of a dominant term, which is defined for a given cluster as a term that occurs across a significant number of artifacts in that cluster. More formally, the set of dominant terms Dt = {dt1, dt2, dt3,…dtl} represents all terms

)) (6)

k (k − 1) / 2

This formula uses the function AG(ci) to merge all of the artifacts

0.8

40 33

20

0.6 0.5

0.9

0.8

0.8

0.7

CC-ratio Value

0.7

0.9

11

0.4 0.3

CC-ratio Value

0.9

CC-ratio Value

1

1

1

54

0.6 0.5

28

0.4

18

0.3

1

6

11

Iterations

16

21

26

31

36

41

46

a. Public Health Watcher System.

51

56

61

66

0.5 0.4

6

0.1 0

0

0

13

0.6

0.2

0.1

0.1

0.7

0.3

0.2

0.2

CH: Cohesion IS: Intercluster similarity CC-ratio

1

6

11

16

21

26

31

36

41

46

51

56

61

66

b. Ice Breaker System.

Figure 2. CC-Ratio results for three datasets.

248

1

6

11

16

c. Event Based Traceability

Potential stopping point

dti for which

NR (dti ) ≥ p , where NR(dti) is the number of NR

0

requirements in the cluster containing term dti; NR represents the total number of artifacts in the cluster; and 0 ≤ p ≤ 1. For experimental purposes p was set to 0.5. A subset of dominant terms taken from the Public Health Watcher system and organized hierarchically are shown in Figure 3.

4

9

TCH is designed to measure the extent to which all of the artifacts within a cluster are associated with its dominant theme. It is computed

as

follows:

TCH =

NR ( I ( Dt ) ≥ α ) NR

34

where

36

NR(I(Dt)≥α) represents the number of requirements containing at least percentage α of all dominant terms. A reasonable value for α was determined statistically through analyzing over 200 clusters as they evolved during various iterations of the clustering process, and setting α equal to the average ratio of dominant terms contained in each individual artifact. The α values for PHW, IBS, EBT were computed at 0.8, 0.8, 0.7 respectively.

37 identifi uniqu sent server

cluster pairs, namely TCP =

ci ,c j ∈C

16

6

23. complaint occur error store

12. data enter

11

24. problem occur commun

7

13

14

27

29. employe

28

30. employe updat

19 31 unit health roll updat ani

18. complaint

32. Unit health retriev specialti

21. data consist assur

20. complaint roll entri

22. system data ensur consist

25. system data retriev messag rais error option

26. complaint data problem occur retriev

average person can handle around seven chunks of information in working memory at a time. We therefore established the rule of thumb that an average cluster size should contain from five to nine items. Granularity was therefore determined as follows: 1. Compute the maximum cluster count (MaxCC). Based on Miller’s 7 + 2 thesis, the minimum cluster size should be 5, therefore MaxCC = Total number of artifacts / 5. 2. Compute the minimum cluster count (MinCC). Based on Miller’s 7 + 2 thesis, the maximum cluster size should be 9, therefore MinCC = Total number of artifacts / 9. 3. Compute Theme Cohesion and Theme Coupling for all clusterings between MinCC and MaxCC and select the cluster in this range with the highest TM value.

[ Dt i ⋅ Dt j / | Dt i || Dt j |] k (k − 1) / 2

8

17

35. system retriev inform avail 38. us, uniq identifi system retreiv

3

Figure 3. Dominant terms in the Health Watcher System

Like the Intercluster Similarity (IS) metric defined by the CCRatio technique, the second theme-based metric also attempts to measure intercluster coupling. However, instead of considering all of the terms in a cluster, it looks only at the dominant terms. If Dti and Dtj represent the dominant terms of clusters ci and cj respectively, then a term space comprised from the union of Dti and Dtj can be constructed. Each of the clusters is then represented as a vector of terms, and TCP is computed as the sum of all pair wise coupling scores normalized by the number of the



15. user Local present return displai

1

5. messag rais error

10

33. system complaint retriev type appropri

2

.

Based on these heuristics, the PHW should contain from 26 to 48 clusters. As shown in Figure 4a, TM values within this range peak at 48 clusters. The IBS requirements should contain from 20-36 clusters, and TM values peak at 25 clusters. Finally EBT should contain 6-11 clusters. Two equal peaks were found at TM values of 9 and 11 respectively, and so 9 was chosen as this was closer to the optimal cluster count (OptCC) computed as OpCC = Total number of artifacts /7.

Once TCH and TCP have been computed, then a combined theme metric TM is computed as TM = λTCH + (1-λ)(1-TCP). For experimental purposes λ was set at 0.5. Results for the three datasets are shown in Figure 4. Unsurprisingly TM metrics do not increase consistently as the number of clusters increases. This is explained by the fact that theme cohesion and theme coupling were not used by the objective function during the clustering process. Informal experiments suggested that using TCH and TCP to drive the clustering process is problematic and can lead to erratic results, because these metrics do not take all terms in an artifact into consideration. However these metrics do provide a feasible technique for evaluating cluster quality in a way that mimics the approach a human analyst might take to evaluate a set of clusters.

5. FINDING RECESSIVE CLUSTERS All of the clustering algorithms described in this paper cluster artifacts around dominant themes but fail to identify less obvious clusters that may appear as cross-cutting concerns [10]. As an example, consider the following clusters which were generated from requirements in the PHW dataset. The cluster names were created manually for pedagogical purposes. Cluster name:1123 - Send login and password to server - The login and password are sent to the server. - The login and password are sent to the server.

One notable observation from the TM results is that TCH values start off high for the EBT dataset and significantly lower for the other two datasets. This observation is confirmed through the fact that all modules in EBT focus around a rather narrow theme of tracing in an event-based environment.

Cluster name:1092 – Show Message or Screen - The system shows the specific screen for each type of complaint. - The system shows the login screen - Error message should be showed. - Show a message informing the employee of the missing/incorrect data.

4.4 Determining granularity As previously discussed, the Hubert index and CC-Ratio suggest several possible granularity levels, however they do not take human usability of the clusters into consideration. In 1956, George Miller wrote a seminal paper entitled “Magical Number Seven, Plus or Minus Two” [24], in which he showed that an

Cluster name:1126 – Present results to user on local display - The result of the login attempt is presented to the employee on their local display.

249

1

1

1

0.9

0.9

0.9

TCH

0.8

0.8

1-TCP

0.7

0.7

weighted sum Value

0.8

48

0.7

25

0.6

0.6

0.5

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.2

0.3 0.2

0.1

0.1

0

0

0.1 0

1

9 17 25 33 41 49 57 65 73 81 89 97

Iterations

a. Public Health Watcher System.

Combined

9

0.6

Ideal cluster window Stopping point

0.2

1

9 17 25 33 41 49 57 65 73 81 89 97

b. Ice Breaker System.

1

6

11

16

c. Event Based Traceability

Figure 4. Theme cohesion and coupling for three datasets showing ideal cluster window. - The result of the login attempt is presented to the employee on their local display. - The query results are formatted and presented to the user on their local display.

Cluster name:C3 – System retrieves unique identifier -The system retrieves the employee details using the login as a unique identifier. -The system retrieves the employee details using the login as a unique identifier. -The unique identifier is used to retrieve the complaint entry. -The unique identifier is used to retrieve the disease type to query. -The unique identifier is used to retrieve the list of health units which are associated with the selected specialty. -The unique identifier is used by the system to search the repository for the selected health unit.

Cluster name:1124 – Retrieve data using unique identifier - The system retrieves the employee details using the login as a unique identifier. - The system retrieves the employee details using the login as a unique identifier. - The unique identifier is used to retrieve the complaint entry. - The unique identifier is used to retrieve the disease type to query. - The unique identifier is used to retrieve the list of health units which are associated with the selected specialty. - The unique identifier is used by the system to search the repository for the selected health unit.

Once dominant terms and stop words have been removed, requirements with no remaining terms are eliminated from further consideration as it is assumed that their concepts have been fully represented by the existing clusters. The second clustering phase then forms clusters using both the initial set of requirements plus skeletal requirements containing only recessive terms. For example the requirement shown above in cluster C3 “The system retrieves the employee details using the login as a unique identifier” would be stripped of dominant and stop terms to form “employee details login.”

In this example four distinct clusters are shown. Each of these clusters contains information related to logging into the system, yet this login information is dispersed across four distinct clusters. This occurs because many of the login requirements, highlighted in gray, also contain terms that are unrelated to login, and which dominated the clustering process. For example in the second cluster, the requirements are clustered around the dominant theme of “showing” either a screen or message, while in the fourth cluster, requirements are clustered around retrieving data using unique identifiers. Logins are just one type of unique identifier in the PHW system. In this example the “login” theme cross-cuts several other clusters, and so we term it a recessive theme.

The clustering process is then re-executed and an additional set of clusters are formed. In this example the following cluster was created which clearly illustrates how previously dispersed and undetected concerns can be brought together into a cohesive cluster. Note that duplicates occur in this cluster, because they also occurred in the original use case specification.

For traceability purposes, if the user is generating traces for the higher level requirement that “Only authorized users shall access the system”, then it would be helpful if the recessive login theme could be elevated and displayed as an additional cluster. In an earlier workshop paper we defined a technique for detecting and extracting such recessive themes [10]. First dominant terms are identified in each of the existing clusters. These terms can then be removed, and the remnants of each requirement used to form new clusters. As previously stated, a dominant term is defined as one that occurs across a specified percentage of the artifacts in a cluster. For example, in the following cluster, in which the percentage is fixed at 50%, dominant terms are underlined, standard stopwords (i.e. words to ignore) are italicized, and the remaining bold-faced terms are considered recessive.

Recessive Cluster name: C4 - login - The system retrieves the employee details using the login as a unique identifier. - The system retrieves the employee details using the login as a unique identifier. - The system shows the login screen - The login and password are sent to the server. - The login and password are sent to the server. - The employee provides the login and password. - The employee provides the login and password. - The result of the login attempt is presented to the employee on their local display. - The result of the login attempt is presented to the employee on their local display.

250

Figure 5. Trace-base clustering prototyped in Poirot In fact it is possible to repeat the process multiple times if requirements are more verbose and contain multiple themes. A more complete evaluation of the process for identifying recessive themes is described in [10].

4. Generate a meaningful name for each cluster. 5. Rank the clusters according to their relevance to the query. 6. Identify recessive clusters using the approach described in section 5. 7. Display the results to the user.

6. TRACING WITH CLUSTERS

For the clusters to be useful to the analyst they must be meaningfully named. Unfortunately finding representative names or abstractions for text is a non-trivial open research question [14] and so clusters were named according to their dominant themes. Significant improvement in the naming quality was observed by determining the dominant theme based only on the candidate links in each cluster. Following a series of informal experiments, cluster ranks results were ranked as follows:

The work in this paper focused on document-side clustering as opposed to query-side clustering which seemed less useful given the fact that trace queries do not tend to suffer from the shortquery problem. Document side clustering was used to organize traceable documents in order to increase understandability of the results and minimize the effort needed to retrieve and evaluate all of the correct links. Although clusters could have been organized around fixed predefined categories, this would have required significant upfront human effort and would also render a less flexible approach. An unsupervised clustering approach, which required no predefined categorization, was therefore adopted. The proposed trace-based clustering technique includes the following steps:

⎛∑ ⎞ links (ci ) (7) ⎜ a∈links ( ci ) ( L − order (a )) ⎟ + (1 − β )⎜ L ⎟⎟ L ⎜ i ∑i=1 ⎝ ⎠ where L represents the number of retrieved links, links(ci) is the set of retrieved links in cluster ci, order(a) is the ordered position number of artifact a within the list of retrieved links, and β is the weight assigned to the first component. Intuitively this formula takes into account two primary factors. The first component computes the proportion of total retrieved links that are found in cluster ci, while the second component takes into consideration the ranking of those links so that a cluster with more highly ranked links would be considered before one with only lowly ranked links. R(ci ) = β

1. Create individual clusterings for each set of artifacts i.e. separately cluster all requirements, java classes, test cases etc. - Utilize the bisecting divisive clustering technique. - Determine the correct number of clusters using the technique described in section 4.3 of this paper. 2. When a query is issued against a set of artifacts, compute probability scores utilizing the standard probabilistic network algorithm described in formula 1, and discussed in detail in numerous papers [3,4,20,22]. 3. Group candidate links according to previously generated clusters.

The approach is illustrated through a high-fidelity prototype developed for our automated traceability tool named Poirot [22].

251

Table 2. Evaluation of Cluster-Based Traceability Goal

Description

BG1

Roads shall be closed and appropriate bodies notified in extreme weather conditions when roads become non-navigatable. Operational costs shall be minimized through efficient route planning. Operational costs shall be minimized through efficient management of de-icing inventories. The system shall coordinate the maintenance of a fleet of trucks. District maps shall be imported and updated from external sources.

BG2 BG3 BG4 BG5

Number of links

Number of Clusters

Candidate

True

Total

28

6

14

3

15

46%

11

5

6

4

8

27%

23

9

11

2

11

52%

44

8

14

4

22

50%

14

4

8

2

11

21%

In the prototype screenshot depicted in Figure 5, the user issued a trace query from a high level business requirement stating that “Temperature, precipitation, and wind chill data will be received from weather stations and roadside sensors” to the set of lower level requirements.

Decision Effort With true links points (clicks) reduction

An additional benefit of clustering is that candidate links can be displayed within the context of other related artifacts To evaluate the effectiveness of trace-base clustering in respect to trace queries issued against requirements, a small analysis was performed in which five business requirements were traced to the lower level requirements of the Ice Breaker System. Results are shown in Table 2. As an example, consider the first business goal (BG1), for which 28 candidate links were generated. Six of these links represented true links while 22 represented false links, yielding a precision value of 21.4%. 14 clusters were presented to the user, each of which contained one or more candidate links. Correct links, as validated through an answer set, were found in three of these clusters. In fact for this particular query, four of the good links were found in a single cluster entitled “Road Closings” and that cluster contained no incorrect candidate links. By simply examining the cluster title and glancing over the requirements contained in the cluster, the analyst could quickly determine the correctness of the links and accept them as a group with a single click of a button. The final column in Table 2, analyzes the minimum number of decisions (estimated by the number of clicks) needed to accept and reject all of the candidate links. The number of clicks are compared to the number that would have been needed to evaluate traces in a non-clustered results set. As reported in Table 2, the number of decision points, reflected in the column entitled “effort reduction”, is significantly reduced in each of the queries when clustering is used to organize the results.

6.1 Evaluating Trace-Based Clustering As trace-based clustering algorithms retrieve exactly the same set of links as non-clustered techniques, recall and precision metrics are unaffected by this approach. However, there are several significant benefits to clustering trace results. First, as depicted in Figure 5, traces are presented to the analyst as part of a meaningful group. The analyst has three options which are to accept all the candidate links in the cluster, reject all of the candidate links, or make individual decisions. From a usability perspective this enables the analyst to assess similar artifacts at the same time, which can reduce overall ‘thinking’ time, and coordinate the physical inspection of supplemental documents needed to make a decision. As an added bonus, the number of decision points can be reduced when the analyst is able to evaluate and accept or reject an entire cluster as a consolidated group. For example the first three clusters in the query depicted in Figure 5 include the following candidate links: Cluster: 14 - Temperature readings X (Reject all) - The system shall report on precipitation readings. (2nd) - Road temperature readings shall be recorded. (6th) - The system shall report on road temperature readings. (8th) - The system shall report on air temperature readings. (9th) Cluster: 26 - Updates - Road maps shall be updated by importing data from an external source. (4th) X - Data received from the road sensors shall be updated regularly (5th) 9 - Weather forecasts shall be updated as received by the weather bureau. (10th) X - When new road sensors are added, the thermal map shall be updated to reflect the new weather data. (25th) X - Weather forecast update. (26th) X

7. CONCLUSIONS The techniques described in this paper have all been fully automated in java prototypes and are now being integrated into Poirot in order to support industry level pilot studies with Siemens. The results from the reported experiments demonstrated that even through standard clustering algorithms can be utilized to generate meaningful requirements clusters, customized techniques were needed to identify optimal clustering granularities. The theme based approach in combination with target cluster sizes was shown to be an effective method and produced clusters that were well able to support tasks related to trace generation.

Cluster: 43 – Transmission 9 (Accept all) - The system shall validate that all road sensors are functioning correctly and transmitting data. (19th) - The system shall receive transmission from road sensors. (3rd)

This paper also proposed a technique for identifying recessive or cross-cutting clusters, however additional work is needed to evaluate the usefulness of such clusters in the trace evaluation process. Future work will therefore include evaluating clusterbased tracing in a broader set of artifact types, and in larger and more complicated datasets.

Correct links are highlighted in gray, and each requirement’s ranking in the candidate list is also shown. In this example the user can reject all artifacts in cluster 14, accept all those in cluster 43, and make individual decisions for those in cluster 26.

252

8. ACKNOWLEDGMENTS The work described in this paper was partially funded by NSF grants CCR- 0306303 and CCR-0447594, and through a grant from Siemens Corporate Research. We also would like to thank Monika Antos for her help in developing the GUI prototype.

[16]

[17]

9. REFERENCES [1] Antoniol, G., Canfora, G., Casazza, G., De Lucia, A., and Merlo, E. Recovering Traceability Links between Code and Documentation. IEEE Transactions on Software Engineering, 28, 10 (2002) 970-983. [2] Bellot, P., and El-Beze, M. A clustering method for information retrieval. Technical Report IR-0199, (1999) Laboratoire d'Informatique d'Avignon, France. [3] Cleland-Huang, J., Settimi, R., BenKhadra, O., Berezhanskaya, E., and Christina, S. Goal Centric Traceability for Managing Non-Functional Requirements. Intn’l Conf on Software Engineering, (ICSE’05), (St Louis, USA, May 2005), ACM Press, 362-371. [4] Cleland-Huang, J., Berenbach, B., Clark, S., Settimi, R., and Romanova, E. Best Practices for Automated Traceability. IEEE Computer, 40, 5, (June, 2007), 24-32. [5] Clusty, the Clustering Search Engine, URL http://clusty.com [6] Cutting, D. R., Karger, D.R., Pedersen, J.O., and Tukey, J. W. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Conf. on Research and Development in information Retrieval, (Copenhagen, Denmark, 1992), 318-329. [7] De Lucia, A., Fasano, F., Oliveto, R., and Tortora, G. ADAMS: advanced artefact management system. 10th European Conference on Software Maintenance and Reengineering, (CMSR’06), (2006), 349-350. [8] Dhillon, I. S. and Modha, D. S. Concept decompositions for large sparse text data using clustering. Machine Learning, 42, ½, (Jan. 2001), 143-175. [9] Domges, R., and Pohl, K. Adapting Traceability Environments to Project Specific Needs. Communications of the ACM, 41,12, (1998), 55-62. [10] Duan, C., and Cleland-Huang, J. A Clustering Technique for Early Detection of Dominant and Recessive Cross-Cutting Concerns. Early Aspects at ICSE 2007. [11] Egyed, A. and Grünbacher, P. Identifying Requirements Conflicts and Cooperation: How Quality Attributes and Automated Traceability Can Help. IEEE Software, 21, 6, (November/December 2004), IEEE Comp. Soc. Press, 50-58. [12] Ertz, L., Steinbach, M., and Kumar, V. Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach. Text Mine '01 at SIAM Intn’l. Conf. on Data Mining, (Chicago, IL, 2001) [13] Frakes W.B., and Baeza-Yates, R., Information retrieval: Data structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1992. [14] Goldin, L. and Berry, D.M. AbstFinder, A Prototype Natural Language Text Abstraction Finder for Use in Requirements Elicitation. Automated Software Engineering, 4,4, (October, 1997), 375-412. [15] Gotel, O. C. Z. and Finkelstein A. C. W. An Analysis of the Requirements Traceability Problem. Proc.of the 1st Intn’l

[18] [19] [20]

[21]

[22]

[23]

[24]

[25] [26]

[27]

[28] [29] [30]

[31]

253

Conf. on Requirements Engineering (ICRE '94), (Colorado Springs, CO, 1994), IEEE Computer Society Press, 94-101. Halkidi, M., Batistakis, Y., and Vazirgiannis, M. On Clustering Validation Techniques. Journal of Intelligent Information Systems, 17,2-3,. (Dec. 2001), 107-145. Huffman Hayes, J., Dekhtyar, A., and Karthikeyan Sundaram, S. Advancing Candidate Link Generation for Requirements Tracing: The Study of Methods. IEEE Transactions on Software Engineering, 32, 1, (2006), IEEE Computer Society Press, 4-19. Jain, A.K., and Dubes, R.C. Algorithms for Clustering Data. Prentice Hall, 1988. Kowalski, G. Information Retrieval Systems – Theory and Implementation. Kluwer Academic Publishers, 1997. Laurent, P., Cleland-Huang, J., and Duan, C. Towards Automated Requirements Triage. IEEE Requirements Engineering Conference, (September, 2007), (New Delhi, India). To appear. Leuski, A. Evaluating document clustering for interactive information retrieval. In Proc. of the Tenth Intn’l Conf on Inf. and Knowledge Mgmt (2001) (Atlanta, Georgia), 33-40. Lin, J., Lin, C.C., Cleland-Huang, J., Settimi, R., Amaya, J., Bedford, G., Berenbach, B., Ben Khadra, O., Duan, C., and Zou, X. Poirot: A Distributed Tool Supporting EnterpriseWide Traceability. IEEE International Conference on Requirements Engineering, (September, 2006), 356-357. Marcus, A., Maletic, J. I., and Sergeyev, A. Recovery of Traceability Links Between Software Documentation and Source Code. International Journal of Software Eng. and Knowledge Eng., 15, 4, (October 2005), World Scientific Publishing Co. 811-836. Miller, G.A. The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity for Processing Information. The Psychological Review, 63, (1956), 81-97. Robertson, S., and Robertson, J. Mastering the Requirements Process. Addison-Wesley, 1999. Soares, S., Laureano, E., and Borba, P. Implementing Distribution and Persistence Aspects with AspectJ. In Proc. of Object Oriented Programming, Systems, Languages, and Applications, (OOPSLA’ 02), (November, 2002), 174–190. Steinbach, M., Karypis, G., and Kumar, V. A comparison of document clustering techniques. Workshop on Text Mining at Intn’l Conf on Knowledge Discovery and Data Mining, (2000) TREC Data collection from Text REtrieval Conference (TREC), URL http://trec.nist.gov/. Wu, W., Xiong, H., and Shekhar. S., (Eds.) Clustering and Information Retrieval. Kluwer 2003, ISBN 1-4020-7682-7 Zamir, O., Etzioni, O., Madani, O., and Karp, R.M. Fast and Intuitive Clustering of Web Documents. In Proc. of the Intn’l Conf on Knowledge Discovery and Data Mining, (August 1417, 1997), 287-290 Zhao, Y, and Karypis, G. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the Intn’l Conf. on information and Knowledge Management, (McLean, Virginia, Nov 4-9, 2002), 515-524.

Suggest Documents