Automated Constraint Induction in Semi- supervised ... - Google Sites

1 downloads 91 Views 873KB Size Report
are based on self-training and co-training concepts in semi-supervised learning and leverage the insight that real life
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 3, ISSUE 2, OCTOBER 2010 72

Automated Constraint Induction in Semisupervised Clustering Ákos Dudás and Saikat Mukherjee Abstract—Organizations are collecting enormous amounts of structured and unstructured data with a view towards improving their product and service offerings. The data is often mined to discover interesting patterns and clustering is one of the most frequently used analytical techniques in this process. Due to the impurity and noise present in data collected from diverse heterogeneous sources, traditional completely unsupervised clustering techniques typically do not yield high accuracy. Of late, there has been research on using limited supervision in the form of must-link and cannot-link pairwise data constraints to improve the clustering. However, while such constrained clustering techniques significantly increase the accuracy, their application to real life data has been limited due to the expensive manual effort required in generating these constraints. In this paper, we investigate techniques for automatically inducing must-link and cannot-link constraints from the data. Our techniques are based on self-training and co-training concepts in semi-supervised learning and leverage the insight that real life data often have redundancy in the content. We experimentally demonstrate the effectiveness of these techniques on benchmark datasets as well as a real life customer notification dataset and show that, optionally starting from a small manual seed constraint set, they are able to automatically induce constraints and generate clusters with similar or more accuracy as completely manual constrained clustering. Index Terms— clustering, constrained clustering, semi-supervised learning

——————————  ——————————

1 INTRODUCTION

T

HERE has been significant growth in the amount of data collected by various organizations. Marketing groups are interested in user visits to web sites, customer service groups are collecting data on product complaints and notifications while operational groups are collecting information on functional usage of equipments, to name a few. Driving this trend to gather data are advanced analytical techniques which mine the information to detect patterns and use them for better product and service offerings such as predictive maintenance or customer segmentation and profiling. However, the nature of these data sources present new challenges for analytical techniques. For instance, not only can the data be in loosely structured or completely unstructured free text format, but can also be noisy since they are typically collected from diverse heterogeneous sources. Hence, it is not just the scale of the data, but also its format and quality that present challenges. Clustering has been one of the most common data mining techniques used for analysis. Different kinds of clustering, utilizing a variety of distance metrics, have been used to discover common issues within the data. While it has been widely adopted in industry, primarily due to the unsupervised nature of the technique, the accuracy in segmenting noisy data has been limited. For instance, real life unstructured text data typically contain

typos, ambiguous, incomplete and incorrect sentences, and domain jargons. Furthermore, a significant portion of these text data sets lie on either end of the sparsity spectrum - some are very verbose, where sometimes even human experts have difficulty in agreeing on the topic described in the data, while the others simply contain a few words. While text data is particularly difficult, these challenges are manifested in other non-textual datasets also where vanilla clustering does not result in good performance. In order to address the limitations of vanilla clustering techniques on these datasets, researchers have proposed constrained clustering algorithms where the user provides limited supervision to achieve a higher degree of accuracy. In particular, must-link and cannot-link pairwise constraints specify which data instances should belong together or should not belong together in the same cluster(s), respectively. Numerous techniques on constrained clustering have been proposed [1] and they have been experimentally demonstrated to have significant improvement over completely unsupervised clustering. However, despite these improvements, the application of constrained clustering to real life data mining has been limited. A major stumbling block to widespread adoption is the necessity of providing constraints which is usually manual and, hence, time consuming and expensive. Of———————————————— ten, the number of constraints required is of the same or• Á. Dudás is a Ph.D. student at the Department of Automation and Applied der as the number of data instances, which it makes it Informatics, Budapest University of Technology and Economics, 1111 Bu- very challenging to scale these techniques to practical dapest, Goldmann Gy. tér 3. Hungary. industry data sets. • S. Mukherjee is with Intuit Inc., 2475 Garcia Avenue, Mountain View, CA. © 2010 JCSE http://sites.google.com/site/jcseuk/

73

In this paper, we investigate automatic ways of generating virtual constraints from the data, which along with a small seed set of manual constraints, achieve the twin objectives of a high degree of accuracy as well as limited user supervision. These virtual constraints are also of the form must-link and cannot-link indicating pairwise cluster membership. The key insight to our technique is that even though the overall dataset could be noisy there might be certain projection of the dataset which has redundancy in information and hence could be used to identify constraints. These constraints can then be fed into the process and the entire clustering redone to identify even more constraints. The entire process can be bootstrapped with a small set of seed clusters, manually derived, to guide the initialization after which successive iterations automatically generate virtual constraints. Our techniques are motivated from research in semisupervised learning. Note that the problem of inducing constraints in unsupervised clustering is similar to acquiring labeled training data in supervised classification [2][3][4][5], to cite a few. In particular, the projection of the dataset with redundant information from which new constraints can be induced could be either on data instances or on the feature space. The former is similar to self-training [4] whereby the redundancy in information among certain data instances are used to generate new labels in a classification scenario. The latter, on the other hand, is similar to co-training [6][7][8] whereby the redundancy in information in the feature space is exploited to train dual classifiers which iteratively improve each other's labels. In our work, we have applied these principles from semi-supervised classification and brought them to improve constrained clustering frameworks. The fundamental contribution of our work is in making existing constrained clustering techniques applicable to real life problems where providing a large number of manual constraints is prohibitively expensive. To this end, we have leveraged ideas from semi-supervised learning and demonstrate that, starting from a small seed of manual constraints, it is possible to automatically generate virtual constraints which can be used in existing constrained clustering frameworks. Thus, the goal of a higher clustering accuracy is satisfied without resorting to expensive manual effort in typical constrained clustering techniques. The rest of the paper is organized as follows. Section 2 presents the related literature and gives a brief introduction to the clustering techniques used in this paper. In Section 3, we present our techniques of automatically inducing constraints from data. Section 4 describes the datasets used in our experiments including multiple benchmark datasets from http://archive.ics.uci.edu/ml/datasets/ as well as a real life customer notification dataset. Experimental results are presented in Section 5 and we conclude in Section 6.

2 RELATED WORKS Traditional clustering approaches typically use the vector-space model [9][10] to represent instances. In case of

text documents, where a document can be a web page, or an email or any block of unstructured text in general, the words of the documents (called “features”), are treated like a “bag of words” [9][11], where their order and other syntactic cues such as sentence and paragraph boundaries are ignored. The weight of each feature could be either binary (denoting present or absent) or could be a real value computed from the frequency and other feature weighting techniques such as the tf-idf model [9][12]. The KMeans [13] clustering algorithm is one of the most common clustering techniques used in practice. The instances are represented by their feature vectors and a distance metric is used to measure the similarity between them. KMeans iterates between an assignment and a reestimation step. In the first one, it assigns every item to the closest cluster represented by its centroid item (center of mass vector); and updates the centroids in the second step. The iteration stops if no item is moved to a new cluster in the assignment step, or when a given threshold of maximum number of iteration is reached. KMeans is usually initialized with random items selected as the initial cluster centroids. A variety of modifications to the basic KMeans as well as a host of distance metrics have been proposed in the clustering community. In the family of model based probabilistic clustering, the “EM” (expectation maximization) technique using Gaussian mixture models [14] is widely used. This technique assumes that the items are generated by Gaussian distributions and tries to estimate the parameters of these distributions. Instead of hard assignments (every item belongs to one and only one cluster), this algorithm has a soft-assignment approach, that is, items are assigned to clusters with a certain probability. This algorithm is an iterative algorithm too; it repeats assignment and estimation steps until a certain exit criterion is met. The criterion is usually a maximum number of iterations or no change in the squared error of the model. Similar to KMeans, a number of different kinds of distributions and distance measures have been proposed in the EM family of algorithms over the years. The accuracy of these techniques on noisy data, particularly text, is limited and there is not much improvement even after incorporating various feature selection and weighting techniques. In contrast to being completely unsupervised, we investigate incorporating limited user supervision to improve the clustering accuracy. Constrained clustering is a recent development in the research community (see [1] for a survey) for using background knowledge in a clustering framework. The work in [15] was an early pioneer in this direction and subsequently a whole body of research [1][16][17], to cite a few, developed in this area. Most of these techniques rely on using instance level must-link and cannot-link rules, specifying that certain data items should or should not be, respectively, in the same cluster. While initial techniques interpreted constraints in a hard way (they must be met), later techniques introduced softer constraints with a penalty for not satisfying. It is shown in [18] that satisfying every constraint can be intractable and not strictly enforcing the constraints but rather using them as guidelines to

74

learn the distance metric produces better results. Most algorithms expect all constraints to be provided at the same time; [19] discusses techniques when the constraints are incrementally provided as feedback. Advanced algorithms, such as MPC-KMeans [17], extend the idea of using constraints by learning the distance metric to satisfy the constraints. The algorithm is derived from KMeans, and it incorporates constraints in the distance metric and applies penalties when they are violated. It also updates the distance metric at the end of every iteration to minimize the number of unsatisfied constraints. A similar approach of using constraints for initialization and for metric learning, based on Hidden Markov Random Fields, is presented in [16]. HMRF-KMeans is an EM-based iterative algorithm using a probabilistic framework. This method also allows using non-Euclidean distances, such as the Cosine-metric [20] which has been shown to perform better on text data. Basu et al. in [21] developed a method to select the most informative constraints. Given an oracle that knows the true labels of the instances active learning selects the most informative subset of points which are then turned into constraints. Their experimental results show that actively selecting constraints produces better results than randomly generating constraints by choosing pairs of instances and comparing their labels. Davidson et al. in [22] demonstrated that constraint sets (chosen randomly) vary significantly in how useful they are for constrained clustering and they may even decrease performance. They proposed two metrics, namely informativeness and coherence, as external constraint set measures and showed that gains in performance can be attributed to constraints sets of large informativeness and coherence. Intuitively, informativeness captures the amount of information that the algorithm cannot determine on its own while coherence measures the agreement within the constraint set. Table 1 summarizes the effectiveness of two of these constrained clustering techniques, namely MPC-KMeans [17] and HMRF-KMeans [16], compared to traditional KMeans. The effectiveness is measured in terms of Fmeasure (harmonic mean of recall and precision) and NMI (normalized mutual information). Further details of these measures are described in Section 5. The reported evaluation results compared to traditional KMeans. TABLE 1. MPC-KMEANS AND HMRF-KMEANS: THE REPORTED RESULTS.

algorithm

evaluated on

measure

improvement

MPCKMeans HMRFKMeans

UCI Ionosphere & UCI Iris UCI 20 newsgroups

F-Meas.

0.6 to 0.62 and 0.64 to 0.92 respectively 0.05 to 0.8, 0.04 to 0.34, 0.02 to 0.35 (different data subsets)

NMI

It is worth noting that the above techniques focus primarily on advanced clustering models of incorporating constraints with different distance metrics. The implicit assumption is that constraints are manually provided by users, which in practice, turns out to be a bottleneck for widespread adoption of these techniques. To that end,

our work is complementary to these techniques. We focus on automatically inducing constraints starting from a small manual seed set of constraints, or no constraints at all, and the induced constraints can be subsequently used with any of the above clustering methods. Note that the problem of mining must-link constraints is different from frequent itemset mining [23][24]. In frequent itemset mining, typically a set of transactions each with its feature set is given and the algorithm discovers the frequently occurring feature subset among these transactions. When naively applied to constraint mining, this might discover associations between commonly occurring features which are true but of not much value since they do not reveal any information beyond what can be obtained by the clustering algorithm’s similarity metric. It is also not clear how frequent itemset mining techniques can be modified to mine cannot-link constraints.

3 CONSTRAINT INDUCTION Semi-supervised clustering uses a small amount of supervised data (i.e. constraints) to aid unsupervised learning. It has been shown that semi-supervised algorithms produce clusters of better quality than unsupervised methods. However, most work assume that the constraints are given, while in reality the cost of obtaining constraints might be high; hence clustering should use as few constraints as possible. This section presents two ideas how to induce constraints automatically, without having to resort to manual work.

3.1 Self-training with Constraints Our self-training based constraint induction is motivated by the work of Nigam, McCallum, Thrun and Mitchell in [4] which describes an EM framework of training a classifier from a set of labeled samples and larger set of unlabeled samples. The algorithm trains an initial classifier using the labeled dataset and then probabilistically labels the unlabeled items with this classifier. In successive iterations, the classifier is refined using the original labeled samples as well as high confidence new labeled samples from the previous iteration's classifier. Given a small set of constraints, our self-training approach extracts additional constraints to enhance the clustering. The algorithm starts as a regular constrained clustering algorithm (e.g. MPC-KMeans or HMRF-KMeans) and clusters the data using the initially provided constraints. Using the clusters formed by this initial iteration, new “virtual” constraints are extracted (see below) and the constrained clustering is executed again; but now both the initial constraints and the new virtual constraints are used. The iteration continues as long as there are changes in the assignment of the items (or a maximum limit of iterations is reached). The key to the algorithm is in choosing the “virtual” constraints. The term “virtual” denotes that these constraints do not necessarily represent the ground truth.

75

These virtual constraints are extracted from the clusters created in any given iteration. These constraints will be used to seed the clustering in the next iteration. Items that are in the same cluster can be translated into mustlink constraints, and items in different clusters can be translated into cannot-link constraints. However, not every pairs of points should be used to create constraints. Only decisions that the algorithm is confident enough about should be used: items close to the cluster centroids and assigned to the same cluster are very likely to belong together; while items assigned to different clusters and being close to their own cluster centroids are indeed likely to belong to different clusters. These virtual constraints reinforce the decision of the clustering algorithm, while still allowing changes in successive iterations to the assignment of data points. We present two approaches on selecting pairs of items which can be used as constraints. The first approach to selecting the constraints is based on random probabilistic selection: instance pair is chosen with probability inversely proportional to the distances from their respective cluster centroid, and the weight of the constraint is the inverse of this distance. This weight measures how confident we are about the constraint. Points very close to the centroids will be chosen with high probability, reflecting that these decisions are the most likely to be true, and their weight will be high. The second approach ranks the pairs of points according to the previously mentioned distance and chooses the first n pairs. The weight of the constraint is the same as in the previous case.

3.2 Co-training with Constraints The use of constraints improves the quality of the clusters. However these constrained clustering algorithms depending on the dataset- often require approximately as many constraints as items (based on results shown in [17][25]). If such information is not available, or difficult to obtain, a different approach is required. This section presents a new approach that allows the use of constrained clustering algorithms without a single manual constraint provided in advance. This particular technique is motivated by the cotraining [7] body of work in the semi-supervised classification literature. Co-training uses two classifiers, each of them separately trained on labeled examples from two distinct views, and refines each classifier using results from the other classifier. An example of co-training is described in [6] for classifying web pages where the words occurring on a web page and the words occurring on links that point to it were used as two distinct feature sets on which separate classifiers were trained and iteratively refined. Our technique follows a similar approach. Given the two views of the dataset the first view of the dataset is clustered using KMeans. Using the same constraint extraction methods as the ones presented in Section 3.1, constraints are extracted from the clusters (the number of the constraints is a configurable parameter of the algorithm). Then using these constraints the second view is

clustered with a constrained clustering algorithm. If the initial clustering algorithm is KMeans, the constrained clustering algorithm could be MPC-KMeans or HMRFKMeans. The algorithms should have the same or similar target functions (e.g. the minimization of the distortion) in order for them not to optimize different objectives. This is not an iterative algorithm; constraint extraction and reclustering is executed only once. It could be executed iteratively, just as discussed for self-training, but according to our observations multiple iterations did not change the final result significantly while accounting for additional computational time. This approach works best if the two views of the dataset are complementary in some sense: one of the two views results in clusters that we are not confident enough about to report as the final result, but confident enough to use its decisions to seed the clustering of the second view. This assumption of the views of the dataset and unequal information content, although being specific, occurs in real life datasets frequently. For instance, documents, emails, research papers often have titles and subjects and other metadata which can be filled out by the author of the main text or by others but nevertheless contain information which is sometimes orthogonal to the main text.

4 DATASETS USED FOR EVALUATION We first describe the datasets used in the experimental evaluation before presenting the results. The following datasets were chosen based on their use in cited constrained clustering works. The first dataset is UCI Iris. All UCI data sets [26] are available at http://archive.ics.uci.edu/ml/. This dataset has 150 instances, four numerical features and three true clusters. Simple KMeans creates clusters of good quality on this dataset. The second dataset is the UCI Ionosphere with 300 instances, 34 numerical features and two classes. These two datasets are used for evaluating self-training. The first dataset used for co-training is a subset of the 20 newsgroups dataset. As described in [16], 100 random instances have been selected from each of the following categories: comp.graphics, comp.os.ms-windows.misc, comp.windows.x. Following the removal of very frequent and very infrequent words, a regular tf-idf weighting scheme is applied. This dataset is called 20 newsgroups similar3, has 300 instances, three classes and approximately 3500 features. This dataset is more challenging with a larger feature set and simple KMeans generates clusters of low quality. The second dataset used for co-training, Notifications, is a subset of a real life customer service text log with 89 instances and 633 features. The true clusters for the instances were determined manually and revised by experts resulting in 27 clusters. The labeling process, which is the equivalent of defining constraints manually, gave us a firsthand experience as to how laborious acquiring the constraints could be. This dataset is a text dataset too. Preprocessing includes tokenization, stop words removal,

76

fixing typographical errors using spell checkers, and finally, translation of non-English entries into English, followed by stemming. The feature vectors combine the regular tf-idf scheme and domain specificity of the word assigning higher weights to more domain specific features and lower weights for the rest. (Domain specificity is calculated using a lexicon of “common” words compiled from a large corpus of generic documents.) The challenge in this dataset lies in the large number of true clusters and in the relatively larger size of the feature set. Co-training typically works better when there are complementary feature sets such that decisions from one feature set can be fed to the other for iterative refinement. The 20 newsgroups dataset contains email messages with headers; the first view of the features is the subject and keyword fields extracted from the header and the second view is the body of the email. The Notifications dataset contains customer service entries. Each record has a long text description of an event (second view), and a short, few words summary of the previous field (first view). While these feature sets are not strictly complementary, since the same person writes both the body of the email or the long text of the notification as well as the email subject or notification summary respectively, nevertheless we found them to different enough for applying the co-training principle.

5 EVALUATION The ideas presented in Section 4, both the self-training and the co-training have been implemented and evaluated on the above datasets. The results are discussed in this section. In our experiments, F-Measure and Normalized Mutual Information (NMI) [27] are used to evaluate the clustering. These two metrics were chosen following previous works in field of constrained clustering ([16][17]). Besides evaluating the final clusters, the quality of the virtual constraints is also of interest. A very simple metric was chosen to quantify the “goodness” of the virtual constraints: the number of constraints which represent the ground truth divided by the number of constraints ex-

tracted. Since the true labels are available, this metric can be evaluated simply by checking the labels. For the self-training technique, which is iterative and hence extracts constraints multiple times, the metric is calculated on the total number of constraints extracted over all the iterations.

5.1 Self-training Self-training is evaluated using MPC-KMeans and HMRF-KMeans as the underlying constrained clustering algorithms. Both MPC-KMeans and HMRF-KMeans are compared to self-training with MPC-KMeans and selftraining with HMRF-KMeans respectively. Both the probabilistic virtual constraint extraction (denoted by random) and the method using the best distances (denoted by distance) are tested. Every point along the plotted results corresponds to a different number of virtual constraints that is extracted. The number of initial manually provided constraints in every case is equal to the number of virtual constraints (to reduce the number of test cases the two numbers were chosen to be equal). This means in the result graphs, for instance, when 100 constraints are used an equal number of 100 virtual constraints are extracted and used in the F-measure or NMI numbers for the MPC-Kmeans or HMRF-Kmeans with virtual constraints. Every test scenario is an average over 50 independent executions of the algorithms. Figures 1 and 2 show our results on the Iris, Figures 3 and 4 on the Ionosphere datasets respectively. For Iris a 1%-13% increase can be observed in both NMI and FMeasure in the lower to middle regions when the best virtual constraints were used; the random constraint selection is not considerably different than the original algorithms. For Ionosphere MPC-KMeans shows even more considerable gain in the middle regions; while HMRFKMeans cannot account for any change. In the high number of constraints region no difference can be observed in either case, however, this is more likely to be due to the fact that the high number of initial constraints overshadows the effect of the virtual constraints.

Fig. 1. MPC-KMeans with virtual constraints on the Iris dataset

77

Fig. 2. HMRF-KMeans with virtual constraints on the Iris dataset

Fig. 3. MPC-KMeans with virtual constraints on the Ionosphere dataset

Fig. 4. HMRF-KMeans with virtual constraints on the Ionosphere dataset

These experiments show that self-training using constraint selection based on the distances can achieve measurable and in some cases, significant gain over a constrained clustering algorithm.

5.2 Co-training Co-training is evaluated using MPC-KMeans and HMRFKMeans constrained clustering algorithms with constraint

extraction based on distances. It must be emphasized that no initial constraints are required for this method; hence co-training should only be compared to non-constrained clustering algorithms, such as standard KMeans. The measured points along the plotted results correspond to different number of constraints extracted by our method. In every test scenario an average over 50 independent executions of the algorithms is reported.

78

Fig. 5. Co-training on the Notifications dataset

Fig. 6. Co-training on the 20 newsgroup - similar3 dataset

Figures 5 and 6 show the co-training results obtained on the Notifications and 20 newsgroups - similar3 dataset respectively. Co-training with MPC-KMeans is able to increase F-Measure from approx. 0.4 to 0.5 for 20 newsgroups - similar3 and from 0.07 to 0.15 for Notifications. The change in NMI is less stable, but quite significant especially for 20 newsgroups - similar3. HMRF-KMeans shows varying behavior; mostly it oscillates around KMeans. The performance of the baseline KMeans on Notifications might require some explanation as FMeasure value of 0.07 is really low. The true label of any data point in this dataset is strongly influenced by the presence or absence of certain key words while the data point itself is quite verbose. Hence, there is significant noise in the data. This is expected in verbose customer notification data where typically the problem is described in details and very often touches upon many different issues related to the root cause (true label) but which makes identifying the root cause difficult even for subject matter experts. Consequently, there is significant improvement in performance on this dataset with even a small number of constraints. It is important to notice that the results are, up to a certain point, mostly independent of the number of constraints used for co-training; even 50 constraints help.

The results are significant particularly given that most constrained clustering algorithms, when not used in cotraining, would require about 100-300 constraints for a dataset of these sizes. The low number of constraints is also beneficial when considering the performance of constrained clustering algorithms which can significantly slow down with increasing number of constraints. Thus, co-training is a simple and inexpensive method, which can provide clusters with better quality than standard non-constrained clustering algorithms.

5.3 Quality of the virtual constraints The quality of the virtual constraints --how closely do they mirror valid constraints-- is plotted on Figure 7. We selected the self-training on the Ionosphere dataset and cotraining on the Notifications datasets to illustrate the power of the virtual constraint extraction algorithms presented in this paper. The first conclusion is that the random probabilistic virtual constraint selection performs worse than the one based on the distances. This confirms why the quality of the clusters obtained with the former method (during self-training in Section 5.1 is worse than with the latter one. The second observation is that the quality of the virtual

79

Fig. 7. Quality of the virtual constraints on the Ionosphere (left) and Notifications (right) dataset

constraints extracted from MPC-KMeans is better than from HMRF-KMeans for both test cases. This is because HMRF-KMeans produces clusters of lower quality on these datasets compared to MPCKMeans. Consequently, the constraints obtained from HMRFKMeans are not as good as MPC-KMeans. This experiment also shows the dependency of the automated constraint generation technique on the underlying constraint clustering method. And finally, and most importantly, the quality of the virtual constraints does not change (significantly) with the number of constraints extracted. This means a large number of virtual constraints can indeed be used in many situations

formulation to accommodate cannot-link constraints. Another direction would be adding measures of uncertainties to virtual constraints mined from the data. This will make the techniques more robust to noise in the automatically induced constraints.

6 CONCLUSION

[1]

Despite being a success in the research community, constrained clustering techniques have not yet been widely deployed for real life clustering. One of the principal bottlenecks in a broader adoption is the manual effort required in generating constraints. Often, the number of constraints required to achieve a high degree of accuracy is of the same order as the number of data instances. This makes it prohibitive to apply these techniques to real life large data sets. In this paper, we have addressed this limitation by demonstrating that self-training and co-training algorithms can be used to automatically generate constraints and, thus, achieve the high accuracy of constrained clustering without spending significant manual effort in creating the constraints. We have empirically validated our techniques on standard datasets as well as a real life customer notification dataset and demonstrated the effectiveness of the techniques. In future, we plan on evaluating the techniques on much larger datasets where the scale-up and consequent cost savings in constraint generation can be better observed. It would be of interest to evaluate the quality of the virtual constraints according to [22], namely informativeness and coherence, and investigate modifying the

ACKNOWLEDGMENT This work is connected to the scientific program of the "Development of quality-oriented and cooperative R+D+I strategy and functional model at BUTE" project. This project is supported by the New Hungary Development Plan (Project ID: TÁMOP-4.2.1/B-09/1/KMR-20100002).

REFERENCES

[2]

[3]

[4]

[5]

[6]

[7]

[8]

S. Basu, I. Davidson, and K. Wagstaff, Eds., Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC, 2008. B. Shahshahani and D. Landgrebe, “The effect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon,” Transactions on Geoscience and Remote Sensing, vol. 32, no. 5, pp. 1087–1095, 1994. D. Miller and H. Uyar, “A mixture of experts classifier with learning based on both labelled and unlabelled data,” Advances in Neural Information Processing Systems, vol. 9, pp. 571–577, 1997. K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell, “Text classification from labeled and unlabeled documents using em,” Machine Learning, vol. 39, no. 2-3, pp. 103–134, 2000. N. Chawla and G. Karakoulas, “Learning from labeled and unlabeled data: An empirical study across techniques and domains,” Journal of Artificial Intelligence Research, vol. 23, no. 1, pp. 331–366, 2005. A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” Proceedings of the eleventh annual conference on Computational learning theory, 1998, pp. 92-100. X. Zhu, “Semi-supervised learning literature survey,” Technical Report, 2008. Available: http://www.cs.wisc.edu/_jerryzhu/ pub/ssl survey.pdf S. Goldman and Y. Zhou, “Enhancing Supervised Learning with Unlabeled Data,” in Proceedings of the Seventeenth International Conference on Machine Learning, 2000, pp. 327–334.

80

[9] [10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

G. Salton and M. J. Mcgill, Introduction to Modern Information Retrieval. New York, NY, USA: McGraw-Hill, Inc., 1986. G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, vol. 18, no. 11, pp. 613–620, November 1975. G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” in Information Processing and Management, vol. 24, 1988, pp. 513–523. K. Sp¨arck-Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of Documentation, vol. 28, no. 1, pp. 11–21, 1972. S. Lloyd, “Least squares quantization in pcm,” Information Theory, IEEE Transactions, vol. 28, no. 2, pp. 129–137, January 2003. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. K. Wagstaff, C. Cardie, and S. Schroedl, “Constrained kmeans clustering with background knowledge,” in Proc. 18th International Conf. on Machine Learning, vol. 2001, 2001, pp. 577–584. S. Basu, M. Bilenko, and R. J. Mooney, “A probabilistic framework for semi-supervised clustering,” in KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2004, pp. 59–68. M. Bilenko, S. Basu, and R. J. Mooney, “Integrating constraints and metric learning in semi-supervised clustering,” in ICML ’04: Proceedings of the twenty-first international conference on Machine Learning. New York, NY, USA: ACM, 2004, p. 11. I. Davidson and S. S. Ravi, “Intractability and clustering with constraints,” in ICML ’07: Proceedings of the 24th international conference on Machine learning. New York, NY, USA: ACM, 2007, pp. 201–208. I. Davidson, S. S. Ravi, and M. Ester, “Efficient incremental constrained clustering,” in KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2007, pp. 240–249. S.-H. Cha, “Comprehensive survey on distance/similarity measures between probability density functions,” Int. Journal of Mathematical Models and Methods in Applied Sciences, vol. 1, no. 4, pp. 300–307, 2007. S. Basu, A. Banjeree, E. Mooney, A. Banerjee, and R. J. Mooney, “Active semi-supervision for pairwise constrained clustering,” in Proceedings of the 2004 SIAM International Conference on Data Mining SDM-04, 2004, pp. 333–344. I. Davidson, K. L. Wagstaff, and S. Basu, “Measuring constraint-set utility for partitional clustering algorithms,” in Proceedings of the Tenth European Conference on Principles and Practice of Knowledge Discovery in Databases. Springer, 2006, pp. 115–126. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A. Verkamo, and Others, “Fast discovery of association rules,” Advances in knowledge discovery and data mining, vol. 12, pp. 307–328, 1996. J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns without candidate generation: A frequent-pattern tree approach,” Data mining and knowledge discovery, vol. 8, no. 1, pp. 53–87, 2004. K. Wagstaff and C. Cardie, “Clustering with instance-level constraints,” in Proceedings of the Seventeenth International Conference on Machine Learning, 2000, pp. 1103–1110.

[26] A. Asuncion and D. Newman, “UCI machine learning repository,” 2007. Available: http://archive.ics.uci.edu/ml [27] B. E. Dom, “An information-theoretic external cluster-validity measure,” Technical Report, 2001. Ákos Dudás is a Ph.D. student at the Department of Automation and Applied Informatics, Budapest University of Technology and Economics, Hungary. Saikat Mukherjee is currently with Intuit Inc. Mountain View, CA; he previously worked for Siemens Corporate Reserch in Printeton, NJ.

Suggest Documents