A Consensus Based Approach to Constrained Clustering ... - CiteSeerX

1 downloads 660 Views 851KB Size Report
Oct 30, 2008 - activities such as viewpoint extraction, feature detection, and requirements ... requirements elicitation [7], and certain types of automated traceability ..... supports campaign management, email marketing, lead management ...
A Consensus Based Approach to Constrained Clustering of Software Requirements Chuan Duan+, Jane Cleland-Huang+, Bamshad Mobasher* Systems and Requirements Engineering Center + Center for Web Intelligence* School of Computing, DePaul University 243 S. Wabash, Chicago IL 60604 +1(312)362-8863

{duanchuan, jhuang, mobasher}@cs.depaul.edu ABSTRACT Managing large-scale software projects involves a number of activities such as viewpoint extraction, feature detection, and requirements management, all of which require a human analyst to perform the arduous task of organizing requirements into meaningful topics and themes. Automating these tasks through the use of data mining techniques such as clustering could potentially increase both the efficiency of performing the tasks and the reliability of the results. Unfortunately, the unique characteristics of this domain, such as high dimensional, sparse, noisy data sets, resulting from short and ambiguous expressions of need, as well as the need for the interactive engagement of stakeholders at various stages of the process, present difficult challenges for standard clustering algorithms. In this paper, we propose a semisupervised clustering framework, based on a combination of consensus-based and constrained clustering techniques, which can effectively handle these challenges. Specifically, we provide a probabilistic analysis for informative constraint generation based on a co-association matrix, and utilize consensus clustering to combine multiple constrained partitions in order to generate highquality, robust clusters. Our approach is validated through a series of experiments on six well-studied TREC data sets and on two sets of user requirements.

Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Clustering, Information filtering.

General Terms Clustering, Documentation, Requirements.

Keywords Clustering, requirements, semi-supervised clustering

1. INTRODUCTION Software development projects include a number of human intensive activities that can benefit significantly from automated support. For example, activities such as feature detection, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’08, October 26–30, 2008, Napa Valley California, USA. Copyright 2008 ACM 978-1-59593-991-3/08/10...$5.00.

requirements elicitation [7], and certain types of automated traceability [14] all rely upon a human analyst to organize an extensive set of requirements into meaningful topics and themes. This is illustrated in the requirements elicitation process, where stakeholders document their needs as short unstructured statements which must then be manually reviewed, analyzed, and classified. The challenge of performing these tasks in a large project can be daunting. However data mining techniques such as clustering can be used to organize and manage stakeholders’ feature requests in order to increase efficiency, provide scalable software engineering processes, and improve reliability of the results. Unfortunately, the unique characteristics of this domain, such as high dimensional, sparse, noisy data sets resulting from short and ambiguous expressions of need, present difficult challenges for standard clustering algorithms. Furthermore, the context in which the clusters will be used, dictates the need to create extremely fine-grained, high-quality clusters, sometimes containing as few as 10-20 requirements. Although high quality is clearly a goal of all clustering algorithms, it is especially important in the requirements domain because project stakeholders will directly interact with and scrutinize the generated clusters, and those which lack a clear and dominant theme, or clusters with even a few misplaced requirements may cause project stakeholders to lose trust in the automated approach and will lead to poor adoption of related tools and processes. In prior studies, we broadly investigated the use of standard clustering techniques such as K-means, agglomerate hierarchical clustering, bisecting, and probabilistic techniques to determine if any of these basic approaches could consistently return requirements clusters at the quality needed to support the proposed software engineering activities [15]. Each of the algorithms was evaluated against several requirements datasets, using both standard coupling and cohesion metrics, and also by comparing the generated clusters to known answer sets. We also conducted a subjective user analysis of the answer sets because this provided more insight into the strengths and weaknesses of the clustering process. For example, one of the smaller datasets we evaluated represented a set of 366 feature requests gathered from MS students describing their needs for an Amazon-like student-centric web portal. A subjective analysis found that the generated clusters included very few highly cohesive ones and that almost all clusters contained misfits. Furthermore there were a significant number of clusters containing no obviously dominant theme. As a result of this extensive study which included multiple algorithms and data sets, we concluded that fully automated

single-technique clustering algorithms do not appear to produce sufficiently high quality results to adequately support the targeted software engineering tasks. In this paper, we propose a semi-supervised clustering framework, based on a combination of consensus-based and constrained clustering techniques, which can effectively handle the challenges described above. The new approach takes advantage of the high levels of interactive user feedback expected in the requirements elicitation task to constrain future clusterings in an ensemble clustering framework. The quality of the initial baseline clustering in the ensemble is also significantly improved through a consensus-based approach. Each clustering is generated by selecting a sub-sample of needs and then using the generated clusters to classify remaining needs. This ensemble is then used to identify a set of constraints that maximize the benefits obtained from the costly constraint collection process. The framework is tested against six TREC datasets, which have been used in related work [23], and two sets of feature requests. All of these are discussed in greater detail later in the paper. The remainder of the paper is laid out as follows. Sections 2 and 3 provide a background discussion of constrained and consensus clustering both of which are adopted in our proposed framework. Section 4 then introduces our consensus-based constrained clustering framework, and section 5 reports on a series of experiments we conducted to validate it within the requirements domain. Section 6 concludes with an overall analysis of the results. Notations that are used throughout the remainder of this paper are defined in Table 1.

2. CONSTRAINED-CLUSTERING A number of researchers have investigated the use of semisupervised clustering techniques in which the clustering process is guided by prior knowledge or constraints collected through expert user feedback. These constraints can be generally classified as cluster level or instance level, where cluster level constraints dictate global rules such as not permitting empty clusters, and instance level constraints specify something about the relationships between a pair of elements. The most commonly used instance level constraints are pair-wise Must-Link (ML) and Cannot-Link (CL) constraints, indicating respectively whether a pair of instances must be placed in the same or in separate clusters. Due to inconsistencies when constraints are gathered from real users, for example and , while , the constraints cannot be treated as hard and fast rules. This paper, as with most other papers on semi-supervised clustering, utilizes ML/CL constraints as opposed to other types of instance constraints primarily because from a practical perspective, this simple form of constraint is the easiest one to design and collect from users. Semi-supervised clustering has been investigated across a wide variety of algorithms, including hierarchical clustering [10], nonnegative matrix factorization [21,28], and partitioned clustering, especially the variants based on K-means which has been shown to be very efficient. Constrained K-means variant algorithms can be further categorized as constraint enforcement, learning distance metric, seeding, violation penalty, and hybrid approaches. One representative technique for the constraint enforcement algorithm is COPK-means [25], which strictly enforces both ML and CL constraints during the cluster assignment stage. The

Table 1. Notations Symbol (s)

Description number of instances number of clusters set of instances set of centroids 1st phase clustering ensemble 2nd phase constrained clustering ensemble original similarity matrix between instances co-association matrix derived from clustering ensemble set of must-link set of cannot-link a pair-wise link a window for bounded constraint generation

algorithm proposed by Xing [26] tries to learn a diagonal or full covariance matrix from the constraints, and then calculate the Mahalanobis distance for points to reflect the impact from the constraints. Seeded-KMeans proposed in [2] utilizes cluster labeling information to initialize the centroids and constrain the cluster assignment. PCK-means, which is a violation penalty algorithm [3], modifies the objective function in K-means by adding a penalty for constraint violations in the form of a weighted number of violations. MPCK-means [6] is a metric learning-enhanced violation penalty algorithm, which combines the ideas from [26] and [4]. Additionally, model-based and probabilistic partitioning algorithms incorporating pair-wise constraints have been studied extensively in [4,9,29]. As large scale constraint collection is extremely expensive in practice, it is important to maximize the potential usefulness of the constraint set, namely, the degree to which the set of constraints can improve the clustering quality. Besides the feasibility criteria of finding a partitioning that can satisfy all of the ML and CL constraints discussed in [11, 12, 13], Davidson et al proposed the two utility metrics of informativeness, which measures the amount of information in a constraint set that the clustering algorithm could not have determined on its own, and coherence which represents the amount of agreement within the constraints themselves with respect to a given distance metric. Although much of the early work on constrained clustering focused on low-dimensional data, the current need to effectively cluster large volumes of textual data has led to a new emphasis on using constrained approaches to cluster high-dimensional data. The use of constrained clustering is particularly pertinent in the requirements domain because the high level of interaction with stakeholders makes feedback gathering quite attractive. Tang et al. [23] proposed a hybrid method named SCREEN which, like our framework, was designed specifically for constrained document clustering. SCREEN projects the instance vectors using an orthonormal matrix derived from constraints so that in the new feature space the similarity between instances involved in a ML constraint are maximized and the similarity between instances in a CL constraint are minimized. However, our analysis of this approach indicates that the experiments Tang et al. reported against six TREC datasets used a far from optimal

Figure 1. The framework of consensus based semi-supervised clustering. baseline against which to compare their results. Essentially they did not include an important optimization step in spherical Kmeans. The details and impact of this are discussed further in section 5.5.1. One of the weaknesses of these well-known approaches for constrained clustering is that the pairs of instances for which ML and CL constraints will be generated tend to be randomly selected from an answer set. Typically, for experimental purposes, a pair of documents is randomly selected and a ML or CL constraint is generated according to the true assignment of the documents within the answer set. However, as our results will show, the random approach does not perform well on requirements datasets which are typically characterized by large numbers of finely grained clusters, and composed of documents that tend to be very short and ambiguous. A few researchers have also investigated active learning techniques for constraint generation [3,19], as this approach can improve informativeness of future constraints based on feedback gathered incrementally from the user. In actual practice, active constraint generation techniques assume the existence of an oracle that can respond to incremental feedback, and this may not always be feasible in practice. This paper therefore explores a constraint generation technique that can produce more informative constraints without the benefit of active learning. The following section provides a brief description of consensus clustering methods and then introduces a new technique for constraint selection based upon a co-association matrix generated by a consensus algorithm. As our results will show later in this paper, this approach is particularly well suited to constrained-clustering of software requirements.

3. CONSENSUS CLUSTERING One difficulty of many clustering algorithms such as K-means, spectral clustering, and model-based probabilistic methods, is that their output is not deterministic, and therefore the quality of the generated clustering is dependent upon the initial configuration. Consensus clustering partially addresses this problem by generating an ensemble of multiple clusterings and then combining the results through use of a voting mechanism. In this way, a higher quality and more robust set of final clusters can be generated. There are many ways to build a clustering ensemble. In addition to various initialization parameters, multiple clusterings

can be obtained by using different feature selection methods or instance sampling methods. For ensemble integration, many consensus clustering algorithms are based on the concept of an N by N co-association matrix M. Let be the set of instances to be clustered. A clustering ensemble represents R partitionings of where the partitioning represents a set of clusters such that . Then each element of the co-association matrix M represents a voting score between a pair of instances where is the number of times the instance pair is assigned to the same cluster over the ensemble. The underlying assumption of using a co-association matrix is that instances that should be placed together are very likely to appear in the same cluster across multiple clusterings. Usually either hierarchical clustering or graph partitioning algorithms are used to generate the final partitioning from M. Hierarchical clustering was adopted in [16,18,24], which used single-link or average-link agglomerative hierarchical clustering algorithms [20] to cluster over a coassociation matrix, while graph partitioning was used in [17,22], which transformed the co-association matrix into a weighted graph and then partitioned the graph into K parts through finding the K disjoint clusters of vertices with the objective of minimizing the multi-cut. The results reported in these papers have demonstrated the effectiveness and robustness of consensus clustering.

4. A CONSENSUS BASED CONSTRAINED CLUSTERING FRAMEWORK To address the problem of generating more informative constraints and improving the quality of constrained clustering especially for software requirements, we propose a unified consensus based constrained clustering framework. This framework, which is depicted in Figure 1, is comprised of the two main phases of constraint generation and constrained clustering. In phase 1, an initial clustering ensemble and a related coassociation matrix M are generated, and used to identify a set of constraints. In phase 2, these constraints are used to generate a second improved clustering ensemble against which a third clustering is performed to create the final output P*. In related work, Yan et al. [27] used a consensus-based approach to cluster

genes. They divided the feature space into random subsets and then applied a distance-metric learning-based constrained clustering to each feature subset, producing multiple subspace clusterings. Our approach adopts some of the same techniques as Yan et al, but is customized to our domain through our selection of algorithms for building the ensemble, selecting constraints, and generating the final partitioning. These phases are described in more detail below.

4.1 Constraint Selection The 1st ensemble in our framework is created through randomly selecting and clustering subsets of the documents from the full feature space, and then generating complete partitions through classification. Co-association matrix M is then produced by summing the frequency with which each pair of instances occurs together across all of the clusterings in the ensemble. To generate the 2nd ensemble, our framework does not choose constraints randomly, but pair-wise constraints ML/CL are selected only from the subset of instance pairs in M that exhibit voting scores within a given interval window [a, b]. This bounding method is statistically justified in section 4.2 and empirically validated in section 5, however for now we simply provide an intuitive explanation of the underlying idea. As previously stated, it is important to select constraints that maximize the benefits of constrained clustering. Intuitively speaking, it is likely that the greatest benefits will be obtained by identifying borderline relationships. For example, pairs of instances exhibiting either very high or very low proximity scores, already tend to get correctly placed together or separately by the unsupervised clustering algorithm, and so asking a user to categorize them as ML or CL provides less useful information than asking for feedback on pairs of instances with more intermediate proximity values.

4.2 Bounded constraint selection In this section we provide a probabilistic analysis for bounding the voting score of target constraints in order to obtain more informative constraints. In the general case, a random pair of instances are selected from the entire voting score range of [0,1]. In the restricted case, the lower limit is increased so as to eliminate as many obvious CLs as possible, and the upper range is decreased in order to eliminate as many obvious MLs as possible. The remaining interval is expected to contain a large number of border constraints. Formally, we define a constraint l with a voting score in the co-association matrix, as a border constraint if the probability that l belongs to ML is close to the probability that l belongs to CL, namely given that . Even though the exact forms of these two probabilities are hard to attain, assuming there exists a range where most border constraints occur, we can reason about an important property of such a range. Let be a window over which those border constraints will be drawn, then if an occurrence amounts to drawing a constraint from w, as each drawing has an approximately equal probability of being ML or CL, the distribution of drawing ML times out of total draws is apparently

which has the mode . Therefore the problem of identifying border constraints can be relaxed to finding a window over which there exists an approximate equal probability of drawing either an ML or a CL, namely, , or equivalently, . For a specific data set, we can calculate and within window by counting the numbers of MLs and CLs whose voting scores are between a and b, and then dividing them by the total number of possible constraints. In a purely random generation of constraints, where the window w is Table 2. The probability difference between drawing ML and CL within two windows, [0,1] and [0.1,0.5].

Tr11 Tr12 Tr23 Tr31 Tr41 Tr45

0.6281 0.6663 0.4311 0.7218 0.652 0.5013

0.2450 0.4317 0.5206 0.4699 0.1338 0.2867

STUDENT SUGAR

0.9160 0.9169

0.5599 0.4890

      

set to [0,1], the scores of a,b are listed in the first column of Table 2. For all of them, especially for the SUGAR and STUDENT, the difference between P0,1(l ML) and P0,1(l CL) is significant. In contrast, by narrowing the window w and moving it away from 0, will be usually lower than random. As can be seen in the second data column of Table 1, when we applied a narrower bounded window , and are closer in most data sets. Therefore it can be hypothesized that the bounded constraint selection will generate more informative constraints in most cases. We will validate this hypothesis empirically in section 5.5.3. Finally it should be noted that this bounded strategy based on co-association matrix for constraint generation is generic and should therefore be applicable across different data types and proximity metrics.

4.3 Constrained Clustering In our framework, once constraints are selected and defined as either ML or CL, another set of partitions are generated using a collection of constrained clustering algorithms based on the constraints generated from M. Although it is possible to use constraints to supervise almost any basic clustering technique, Wagstaff [25] and Davidson et al. [11], demonstrated that most constrained clustering algorithms are sensitive to the order in which the constraints are applied and in which instances are assigned to clusters. In fact some orderings may lead to clusterings that are worse than those produced by a fully unsupervised algorithm. By using consensus integration of multiple constrained partitions, incorrect placements in one clustering can be outvoted by correct placements in others. To

enhance the diversity of the clustering ensemble, it is helpful to permute both the order of instances and constraints before applying a constrained algorithm. In our framework, a new coassociation matrix is derived from the constrained ensemble , and then average-link agglomerative hierarchical clustering [20] is applied on the co-association matrix to obtain the final clustering P*. Although several choices exist for clustering the co-association matrix M, such as variants of hierarchical agglomerative clustering, spectral clustering, and graph partitioning algorithms, the average-link algorithm was chosen as it was demonstrated in our experiments to be very stable across ensembles of various sizes and characteristics.

5. EXPERIMENTAL VALIDATION A series of experiments were conducted to evaluate the effectiveness of both consensus clustering and our constrained consensus-based framework for clustering requirements documents. The Normalized Mutual Information (NMI) was used to measure agreement between the consensus or constrained clustering and the reference clustering. NMI measures the extent that the knowledge of one clustering reduces uncertainty of the other. Formally, NMI first calculates the mutual information between two partitions and

where ka and kb are the cluster numbers of two partitions, n is the total number of instances, is the number of shared instances in cluster of clustering and cluster of clustering , and a similar explanation applies to . The result is then normalized using an arithmetic mean, re-scaling it to the range [0,1]:

where [18].

is the entropy of a clustering

Section 5.1 describes the data used throughout the experiments, section 5.2 reports the experimental results of consensus clustering, and then section 5.3, 5.4, and 5.5 describe the setup and results of the experiment using consensus-based constrained clustering.

5.1 Data sets The experiments were conducted using six well-known TREC data sets, Tr11, Tr12, Tr23, Tr45, Tr41, and Tr31 [23], as well as two domain representative sets of requirements named STUDENT and SUGAR respectively. STUDENT represents a small collection of 366 feature requests created by 36 graduate level students for an Amazon-like student web-portal system. A

Figure 2. Results of consensus clustering versus spherical K-means reference set, which created an ―ideal‖ clustering, was developed by two of the researchers in the SAREC lab. SUGAR is comprised of 1000 feature requests mined from SugarCRM, an open source customer relationship management system that supports campaign management, email marketing, lead management, marketing analysis, forecasting, quote management, case management and many other features. The feature requests were contributed by 523 different stakeholders over a two year period, and distributed across 309 threads. For the SUGAR data, a reference set was constructed through reviewing and modifying the natural discussion threads created by the SugarCRM users. Modifications included merging singleton threads with other relevant ones, manual re-clustering of large mega-threads into smaller more cohesive ones, and then manually reassigning misfits to new clusters. In preprocessing, for each of the eight data sets, stop words are eliminated, remaining words are stemmed to their root forms, words that occur less than three times are also eliminated, and then all words are indexed using tf-idf. Each requirement is then represented as a vector of real values in which each position corresponds to a term in the document space. Finally, for better performance in calculating the inner product of vector pairs, each vector is normalized to unit length.

5.2 Experimental Evaluation and Analysis of Consensus Clustering An experiment was conducted to evaluate the improvements obtainable by using consensus clustering on the six TREC datasets and two requirement data sets STUDENT and SUGAR. A clustering ensemble of size R was produced by repeating the following sub-sampling steps R times. A proportion of the whole dataset was randomly extracted and then partitioned into K clusters using spherical K-means (SPK), which is described in section 5.3. The remaining instances were then classified into the most closely related clusters. Based on extensive experiments applied to several TREC document data sets, we determined that a quality ensemble containing viable yet dissimilar clusterings, for data sets with several thousands of data points could be generated by setting R to 200 and to a value in the range [0.5, 0.8]. The experiment compared the NMI scores of each of these datasets when clustered using 2-stage SPK versus the consensus algorithm described in the previous section, which used average linkage hierarchical clustering to partition the co-association matrix. Because SPK generates different results depending upon initial seedings, the algorithm was run 200 times for each dataset, and minimum, maximum, and mean scores are reported. These

Dataset Name (NMI) Distribution of voting scores in ML the coassociation CL matrix (M) Distribution of similarity scores in the original matrix (O)

Tr11 (0.7)

Tr12 (0.67)

Tr41 (0.68)

ML distribution

Tr45 (0.76) ML distribution

ML distribution

300

1500

1500

600

200

100

200

1000

1000

400

100

50

100

500

500

200

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

3000

4000

2000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0

0

0.1

0.2

0.3

4

CL distribution

CL distribution

CL distribution 6000

800

3

600

0.4

0.5

0.6

0.7

0.8

0

0

0

0.1

0.2

0.3

0.4

4

CL distribution

x 10

3

2

0.5

0.6

0.7

0.8

0.9

0

0.1

0.2

0.3

4

CL distribution

x 10

3

0.4

0.5

0.6

0.7

0.8

0.5

0.6

0.7

0.8

CL distribution

x 10

2

2

400

0

1

1000

2000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0

0.2

0.4

0.6

CL distribution

0.8

1

1.2

1.4

0.5

0.6

0.7

0.8

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0

0.1

0.2

0.3

0.4

ML distribution

ML distribution 1500

6000

1000

4000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.2

0.4

4

CL distribution

6000

3000

6

4000

2000

4

6000

500

2000

2000

0

CL distribution

8000

0.4

4000

100

0

0.3

6000

200

0.2

0.2

8000

300

200

0.1

0.1

ML distribution

400

400

0

0 0

ML distribution

ML distribution 600

500

0

0

1

1

200

0

ML distribution

CL

Tr31 (0.52)

ML distribution

150

1000

ML

Tr23 (0.43)

ML distribution

ML distribution 300

0.6

0.8

1

1.2

1.4

0

0.1

0.2

0.3

4

CL distribution

x 10

0

6

0.4

0.5

0.6

0.7

0.8

0.9

1

0

x 10

0

0.2

0.4

4

CL distribution 3

0.6

0.8

1

1.2

CL distribution

x 10

2

4

4000

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0

2

1000

2000

2000

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0

1

2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 3. Score differentiations between ML and CL instances in the co-association versus original matrices. results, which are depicted in Figure 2, show that consensus clustering results were always above the mean obtained using SPK, but usually just below the maximum. An interesting exception to this is that for the SUGAR data, representing a very realistic medium sized project in our target domain, the consensus clustering scores were higher than the maximum obtained using SPK. Additional work is needed to build more domain related answer sets so that we can investigate this phenomenon further. In general, these results are highly significant because they demonstrate that consensus clustering is more consistent and robust than Spherical K-Means. The concept of co-association matrix M was introduced in section 3. For the remainder of this discussion, pair-wise values in M are referred to as voting scores (VS), while the values in the original similarity matrix O, are referred to as similarity scores represented by the cosine direction values between pairs of instance vectors. To understand why consensus clustering can improve the clustering quality, the distributions of voting scores and similarity scores for the six well-known TREC data sets Tr11, Tr12, Tr23, Tr45, Tr41, and Tr31 were analyzed in respect to knowledge of MLs and CLs from the published answer sets. In Figure 3, each of these TREC data sets is represented by a single column. The graphs in the first row show the voting score distributions of pairs of documents assigned together in the Algorithm: Two-stage spherical K-means clustering Input: unlabelled instances , number of clusters K, initial centroids I, convergence condition Output: crisp K-partition . Steps: 1. Initialization: initialize centroids using I: ; 2. Batch instance assignment and centroid update until convergence 2a. assign each instance to nearest cluster i with largest 2b. update each centroid: 2c. 3. Incremental optimization of objective function until convergence: 3a. randomly select an instance , move it to cluster that maximizes the gain of objective function caused by the moving of , therefore after the moving , 3b. update each centroid: 3c.

Figure 4. Two-stage spherical K-means clustering.

answer clustering (namely ML) in M, while the second row shows the voting scores of pairs that are not assigned together in the answer clustering (CL) in M. The third and fourth rows show similar values for the cosine values in O. NMI scores, depicting the similarity between the consensus clustering and the published reference clustering, are shown on the top row in parenthesis next to each dataset name. These results clearly indicate that all of the datasets had similar distributions of ML and CLs for the original proximity matrices, with very high concentrations over the range near 0. In contrast the scores in the co-association matrix provided a clearer differentiation between MLs and CLs by boosting ML scores. This implies that the pair-wise voting scores compiled from the consensus algorithm tend to approach their ―true‖ similarity scores. It also explains why the proximity-based algorithms such as agglomerative hierarchical clustering give poor results on the similarities directly from O but perform much better on M. It can be observed that Tr11 and Tr45 scored the highest NMI values while Tr23 scored lowest, and it is interesting to note that these differences correspond to the different shapes of ML/CL distribution shown in Figure 3, where the easy-to-cluster Tr11 and Tr45 have high concentrations approaching one (i.e. towards the right of the graph) for ML and more concentrations towards 0 for CL, while the harder-to-cluster Tr23 did not exhibit such strong differentiation, meaning that even in the co-association matrix there was still significant disagreement about the MLs.

5.3 Choice of Constrained Algorithms To increase the variability of the constrained partitioning used in phase two of our framework, two consensus-based constrained partitioned clustering algorithms were selected: COPK-means [25] and PCK-means [3]. Since they are both built on basic K-means algorithms, we will describe our edition of spherical K-means first and then point out the different ways in which COPK-means and PCK-means enforce constraints. A two-stage spherical K-means approach was adopted for experimental purposes (Figure 4) which appends an incremental optimization of the objective function after the usual batch instance assignment of K-means. This enhancement is critical, especially when cluster sizes are small. In a series of experiments, this simple optimization consistently demonstrated improved NMI values, as will be shown in section 5.3.1.

Table 3. Performance of SPK and SPK-I. N is the number of documents, K is the number of clusters, |d| is the average number of distinct terms of documents, and Δ is the difference of NMI score between SPK and SPK-I Data set STUDENT

N 366

K 29

|d| 8

SPK 0.42

SPK-I 0.55

Δ 0.13

SUGAR Tr11

1000 414

40 9

27 46

0.43 0.54

0.55 0.70

0.12 0.16

Tr12

313

8

39

0.48

0.67

0.19

Tr23

204

6

34

0.30

0.43

0.13

Tr31

927

7

132

0.52

0.55

0.03

Tr41 Tr45

878 690

10 10

87 69

0.60 0.55

0.67 0.73

0.07 0.18

The original COPK-means and PCK-means are based on the Kmeans algorithm shown in Figure 4 and differ only in their methods for constrained instance assignment, which corresponds to step 2a in the algorithm. COPK-means enforces a hard constrained instance assignment in which an instance from cluster is assigned to the nearest cluster if this assignment results in no conflict, namely, for each , , and also for each , . PCK-means, on the other hand, applies a soft constrained assignment for which instance is assigned to whichever cluster maximizes

where is the cluster label for constraint-involved instance , is a binary function that returns 0 if boolean variable is false and 1 otherwise, and w* is a penalty parameter, which is used to adjust the ―hardness‖ of the constrained assignment and set to 0.001 in our experiments [3]. Unlike many previous experiments [3,23], our implementation of COPK-means and PCK-means not only applies constrained instance assignment to the usual batch assignment stage (step 2a in Figure 4), but also to the incremental optimization stage (step3a in Figure 4). Other clustering techniques such as metric learning enhanced constrained clustering, including HMRFK-means [4] or MPCK-mean [6], were not used to help build the ensemble, as they are computationally expensive and exhibit coarse approximations for clustering high-dimensional data in which using a covariance matrix to calculate Mahalanobis distance is not precise. Furthermore, as the experiments in [23] have empirically shown, when applied to document clustering, these approaches did not introduce significant quality improvement.

5.4 Constraint generation and processing Pair-wise constraints ML/CL were generated using the bounded consensus method described in the previous section for which the selection window [a, b] was set to [0.1, 0.5]. For comparison purposes, the results were also compared to random constraint generation. This choice of window needs some explanation. In a series of experiments we found that narrow windows spanning less than 0.2, or windows approaching either 0 or 1, did not return

good results. These results support our earlier suppositions that performance would improve if constraints were selected from the area of low confidence in the center of the scoring range. Optimal results were found using either a window of [0.1,0.5] or any other intermediate window with a span > 0.2. MLs were preprocessed before being used in the constrained clustering algorithms; specifically, connected components were identified among the instances involved in ML, CL constraints were propagated along these connected components and then each connected component was treated as a weighted input instance to the clustering algorithms. In this way, only CLs needed to be handled by constrained clustering algorithms COPK-means and PCK-means. A total of 20 clusterings were generated, ten using COPK-means and 10 using PCK-means. Constraints were applied in random order and instances were also processed in random order so as to enhance the diversity of the ensemble. The entire experiment was repeated ten times, and the average scores reported.

5.5 Experimental results 5.5.1 Effects of incremental optimization of K-means 200 rounds of the spherical K-means were run with and without incremental optimization. These are labeled SPK and SPK-I respectively, and average performances are shown in Table 3. These results show that a substantial difference exists between the two editions of spherical K-means, and that the post processing was quite effective at improving cluster quality. Moreover, it appears that a smaller average cluster size leads to a more significant difference showing a high negative linear correlation of -0.83. This could explain why the results reported in [23] often show a steep rise of performance even only given 10 constraints, as those constraints gain back some of the improvements that would have been achievable in a more optimal clustering algorithm.

5.5.2 Performance with small numbers of constraints Bounded constraints from 10 to 100 with an interval of 10 were generated and applied to three approaches of COPK-means, PCKmeans, and consensus clustering of constrained ensemble, denoted as B-COPK, B-PCK, and B-Cons respectively in Figure 5. The performance was then compared with the SCREEN approach reported by Tang et al [23]. As there were insufficient details to re-implement SCREEN for additional datasets, the comparison was only made against the six TREC data sets for which results were extrapolated from the published graphs [23]. The results are shown in Figure 5. Consensus integration over a constrained partition ensemble (B-Cons) significantly outperformed base algorithms COPK-means and PCK-means, and was generally comparable to SCREEN. Furthermore, excluding STUDENT and SUGAR, the data sets over which consensus constrained clustering outperforms SCREEN are Tr12, Tr23, and Tr45, which are in fact the three data sets which contain the shortest documents among all six sets. This suggests that our proposed approach might be highly applicable for clustering requirements documents, which tend be relatively short.

5.5.3 Bounded versus random constraints This experiment investigated the difference between bounded consensus based constraint generation and random constraint generation over a much larger number of constraints ranging from intervals of 50 up to 1000 constraints. As COPK-means tends to

Tr11

Tr12

0.75

Tr45

Tr23

0.8

0.55

0.75

0.5

0.8

0.7

0.75

B-COPK 0.6

0.65

B-PCK

B-Cons SCREEN

B-Cons SCREEN

0.55

20 40 60 80 Number of constraints

100

Tr41

B-COPK B-PCK

B-PCK

0.65

B-Cons SCREEN

B-Cons SCREEN 0.3

SPK

SPK 0.6

20 40 60 80 Number of constraints

100

20 40 60 80 Number of constraints

Tr31

0.8

0.7

B-COPK 0.35

SPK 0.5

0.4

B-COPK

0.6

B-PCK

SPK 0.55

NMI

0.65

0.45

NMI

NMI

NMI

0.7

100

20 40 60 80 Number of constraints

STUDENT

100

SUGAR 0.58

0.65

0.57

0.65

0.75 0.6

0.56

0.7

0.6

0.55

0.6

0.55

B-COPK

0.5

0.55

B-COPK B-PCK

B-COPK

B-Cons SCREEN

B-Cons SCREEN

B-PCK

0.45

20 40 60 80 Number of constraints

0.5

SPK 100

0.4

20 40 60 80 Number of constraints

0.54 0.53

B-PCK

SPK 0.5

NMI

0.65

NMI

NMI

NMI

0.55

B-COPK B-PCK

0.52

B-Cons

B-Cons 0.51

SPK

SPK 100

20 40 60 80 Number of constraints

100

0.5

20 40 60 80 Number of constraints

100

Figure 5. Comparison of various constrained clustering algorithms with 10 to 100 constraints. suffer from a feasibility problem when presented with a large number of CLs, and because it has been shown to have comparable performance to PCKmeans, it was not used in any of the additional experiments. The results of this experiment are shown in Figure 6. The results of PCKmeans, consensus clustering using bounded and random constraint generation as well as baseline spherical K-means are denoted as B-PCK, BCons, R-PCK, R-Cons, and SPK respectively in the plots. The TREC datasets returned rather mixed results. For example, the bounded approach appeared to do better for datasets Tr11, Tr41, and Tr45, but did not do well in datasets Tr23 and Tr31. Furthermore, despite a good start in Tr12, there was a significant drop off after about 600 constraints. An initial observation of these results suggested that the bounded approach did best when applied to datasets with a larger number of small clusters, meaning that there could be more borderline cases. For example, datasets Tr11, Tr41, and Tr45 which performed well had 9 or 10 clusters each, while Tr23 and TR31 had only 6 clusters each. TR12, which also started off well and then dropped off had 8 clusters. It should be noted however, that datasets can be clustered at any level of granularity, but that an ideal granularity exists for each dataset in respect to a given task. Our observations are therefore based on fixed levels of granularity. In the software engineering domain, tasks such as feature extraction and requirements management typically require fine levels of granularity, in order to create manageable clusters which can be used by stakeholders to perform their tasks [7,8,14,15]. As reported in our prior work, SUGAR is clustered at a granularity of

40 clusters resulting in average sized clusters of 25, while Student has 29 clusters containing an average of 13 feature requests. These fine grained granularities suggest that requirements documents are potentially highly suited to the bounded constraint generation. In fact, the results for these two requirements datasets, which are also depicted in Figure 5, show a marked improvement obtained through using bounded constraint generation. Figure 7 provides a more subjective view of our results, by depicting twelve feature requests from the STUDENT dataset clustered around the general topic of security. The cluster was generated using the consensus-based constrained framework with 1000 constraints. This particular cluster contained 21 requirements, 20 of which were judged to be security related with only one conceptual misfit that referred to customer transactions but had no reference to security. A subjective analysis of the clusters produced by the constrained consensus-based framework showed a significant improvement in the cohesiveness of each of the clusters. These observations were supported by the increase in NMI scores achieved during the experimental analysis.

6. CONCLUSION AND FUTURE WORK This paper has described a new framework for clustering high dimensional datasets such as requirements documents. The framework adopts a hybrid model which combines both consensus and constrained clustering techniques, and in which constraints are selected that are expected to maximize supervisory potential for improving cluster quality. The reported experimental

Tr12

Tr11

Tr23

Tr45

0.9

0.9 0.7

0.85

0.85

0.85

0.8

0.8

0.6

0.8

0.5

0.75

0.65

B-PCK 0.65

B-Cons R-PCK R-Cons

0.6

SPK

0.55

200 400 600 800 Number of constraints

0.7

0.6

B-Cons

0.55

R-PCK R-Cons

0.3

1000

0.7

NMI

NMI

NMI

0.7

0.6 0.55

R-PCK R-Cons

0.5

B-Cons

0.45

R-PCK R-Cons

0.4

SPK 200 400 600 800 Number of constraints

1000

200 400 600 800 Number of constraints

1000

SUGAR 0.66 B-PCK

B-Cons

0.64

B-Cons

R-PCK R-Cons

0.62

R-PCK R-Cons

SPK

SPK

0.65

0.6 0.58

B-PCK

B-PCK B-Cons

0.4

SPK 0.55

B-PCK

0.65

0.5

R-PCK R-Cons

STUDENT

0.75

0.6

0.6

0.8

0.75

0.7

B-Cons

R-PCK R-Cons

200 400 600 800 Number of constraints

Tr31

0.8

0.65

SPK

0.8 0.9

B-PCK

B-Cons

0.2 200 400 600 800 Number of constraints

Tr41

B-PCK

SPK

0.5 1000

0.7

0.4

B-PCK

NMI

0.7

NMI

NMI

NMI

0.75

NMI

0.75

1000

0.56 0.54

0.55

SPK 200 400 600 800 Number of constraints

0.6

1000

200 400 600 800 Number of constraints

1000

0.52

200 400 600 800 Number of constraints

1000

Figure 6. Comparison of bounded and random constraint generation with 50 to 1000 constraints. results demonstrated the effectiveness of this approach especially for clustering short documents into finely grained partitions. These characteristics closely match those of the targeted requirements domain, and in fact the clustering results were especially promising for the SUGAR and STUDENT datasets. In future work we intend to build a far more extensive set of requirements related datasets and corresponding answer sets, so that we can further assess and fine-tune the usefulness of our framework. The work in this paper was primarily motivated by our research in automating and scaling up components of the requirements process, and our subsequent observations that rudimentary clustering techniques did not produce sufficiently cohesive clusters to support our intended tasks. The clustering improvements obtained through use of the framework described in this paper, have significantly mitigated this problem, to the extent that they are anticipated to support future research and tool development that will enable us to move towards higher levels of automation in the requirements engineering domain.

7. ACKNOWLEDGMENTS The work described in this paper was partially funded by NSF grants CCR- 0306303, CCR-0447594, and IIS-0430303.

8. REFERENCES [1] Banerjee, A. and Ghosh, J. 2002. Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres. In Proc. IEEE Int. Joint Conf. on Neural Networks (Honolulu, Hawaii, May 2002), pp. 1590-1595. [2] Basu, S., Banerjee, A., and Mooney, R. J. 2002. Semisupervised Clustering by Seeding. In Proceedings of the Nineteenth international Conference on Machine Learning (July 08 - 12, 2002). C. Sammut and A. G. Hoffmann, Eds. Morgan Kaufmann Publishers, San Francisco, CA, 27-34. [3] Basu, S., Banerjee, A., and Mooney, R. J. 2004. Active semisupervision for pairwise constrained clustering. In Proc. of the 4th SIAM International Conference on Data Mining (Orlando, FL, 2004), pp. 333-344. [4] Basu, S., Bilenko, M., and Mooney, R. J. 2004. A probabilistic framework for semi-supervised clustering. In Proceedings of the Tenth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Seattle, WA, USA, August 22 - 25, 2004). KDD '04. ACM, New York, NY, 59-68. [5] Bennett, K.P., Bradley, P.S. and Demiriz, 2000. A. Constrained K-Means Clustering. Microsoft Technical Report, May 2000.

(1) The system shall protect stored confidential information. (2) System must encrypt purchase/ transaction information. (3) A privacy policy should be implemented to describe in detail to the users how their information is stored and used. (4) Transmission of personal information should be encrypted. (5) Transmission of financial transactions should be encrypted. (6) The system must use encrypt & decrypt in some fields. (7) Allow the user to view their previous transactions. (8) Databases should use the TripleDES encryption standard for database security. AES is still new and has had compatibility issues with certain types of databases (namely SQL Server express edition) (9) The site should ensure that payment information is kept confidential and any credit card transaction will be encrypted to prevent any hackers to the system from retrieving any information. (10) Because our system will be used buy books, then we should focus on the security part of the system, and we must consider transaction control in the architecture used to build the system. (11) Correct usage of cryptography techniques should be applied in Amazon portal system to protect student’s sensitive information from not just outsiders but from those on staff who could potentially acquire the information if not correctly protected. (12) Sessions that handle payment transactions have to be encrypted. Figure 7. An example of a partial cluster containing security related requirements, generated after gathering 1000 constraints on the STUDENT dataset.

[13] Davidson, I. and Ravi, S. S. 2007. Intractability and clustering with constraints. In Proceedings of the 24th international Conference on Machine Learning (Corvalis, Oregon, June 20 - 24, 2007). Z. Ghahramani, Ed. ICML '07, vol. 227. ACM, New York, NY, 201-208. [14] Duan, C. and Cleland-Huang, J. 2007. Clustering support for automated tracing. In Proceedings of the Twenty-Second IEEE/ACM international Conference on Automated Software Engineering (Atlanta, Georgia, USA, November 05 - 09, 2007). ASE '07. ACM, New York, NY, 244-253. [15] Duan, C., Clustering and its Application in Requirements Engineering, Technical Report #08-001, School of Computing, (DePaul University, February, 2008). [16] Fern, X. Z. and Brodley, C. E. 2003. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proc. of ICML’03, Washington, DC (2003) 186—193. [17] Fern, X. Z. and Brodley, C. E. 2004. Solving cluster ensemble problems by bipartite graph partitioning. In Proceedings of the Twenty-First international Conference on Machine Learning (Banff, Alberta, Canada, July 04 - 08, 2004). ICML '04, vol. 69. ACM, New York, NY, 36. [18] Fred, A. L. and Jain A. K. 2005. Combining Multiple Clusterings Using Evidence Accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 835-850, June, 2005. [19] Greene, D. 2007. Constraint Selection by Committee: An Ensemble Approach to Identifying Informative Constraints for Semi-Supervised Clustering. In Proc. 18th European Conf. on Machine Learning (ECML'07), 140-151, Springer.

[6] Bilenko, M., Basu, S., and Mooney, R. J. 2004. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the Twenty-First international Conference on Machine Learning (Banff, Alberta, Canada, July 04 - 08, 2004). ICML '04, vol. 69. ACM, New York, NY, 11.

[20] Jain, A. K. and Dubes, R. C. 1988. Algorithms for Clustering Data. Prentice-Hall, Inc.

[7] Castro-Herrera, C., Duan, C., Cleland-Huang, J. and Mobasher, B. 2008. Using Data Mining and Recommender Systems to Facilitate Large-Scale, Open, and Inclusive Requirements Elicitation Processes, Short Paper, IEEE Conf. on Requirements Eng., (Barcelona, Spain, Sept. 2008).

[22] Strehl, A. and Ghosh, J. 2003. Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3 (Mar. 2003), 583-617.

[8] Cleland-Huang, J. and Mobasher, B. 2008. Using Data Mining and Recommender Systems to Scale up the Requirements Process, ACM Intn’l Workshop on UltraLarge Software Systems,(Leipzig, Germany, May, 2008), 3-6. [9] Cohn, D., Caruana R., and McCallum, A. 2003. Semisupervised clustering with user feedback. Technical Report TR2003-1892, Cornell University, 2003. [10] Davidson, I. and Ravi, S.S. 2005. Hierarchical clustering with constraints: theory and practice. In: Proc. 9th European principles and practice of KDD (PKDD'05). Porto, Portugal pp 59-70. [11] Davidson I. and Ravi S.S. 2006. Identifying and Generating Easy Sets of Constraints For Clustering, 21st AAAI Conference, 2006. [12] Davidson I., Wagstaff, K., and Basu, S. 2006. Measuring Constraint-Set Utility for Partitional Clustering Algorithms, In Proceeding of ECML/PKDD, 2006.

[21] Li, T., Ding C., and Jordan M.I. 2007. Solving Consensus and Semi-supervised Clustering Problems Using Nonnegative Matrix Factorization. ICDM 2007: 577-582.

[23] Tang, W., Xiong, H., Zhong, S., and Wu, J. 2007. Enhancing semi-supervised clustering: a feature projection perspective. In Proceedings of the 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12 - 15, 2007). KDD '07. ACM, New York, NY, 707-716. [24] Topchy, A., Jain, A. K., and Punch, W. 2003. Combining Multiple Weak Clusterings. In Proceedings of the Third IEEE international Conference on Data Mining (November 19 - 22, 2003). ICDM. IEEE Computer Society, Washington, DC, 331. [25] Wagstaff, K., Cardie, C., Rogers, S., and Schrödl, S. 2001. Constrained K-means Clustering with Background Knowledge. In Proceedings of the Eighteenth international Conference on Machine Learning (June 28 - July 01, 2001). C. E. Brodley and A. P. Danyluk, Eds. Morgan Kaufmann Publishers, San Francisco, CA, 577-584.