within-class and unsupervised clustering improve ... - CiteSeerX

WITHIN-CLASS AND UNSUPERVISED CLUSTERING IMPROVE ACCURACY AND EXTRACT LOCAL STRUCTURE FOR SUPERVISED CLASSIFICATION BY DMITRIY FRADKIN

A dissertation submitted to the Graduate School—New Brunswick Rutgers, The State University of New Jersey in partial fulfillment of the requirements for the degree of Doctor of Philosophy Graduate Program in Computer Science Written under the direction of Casimir Kulikowski and approved by

New Brunswick, New Jersey January, 2006

c 2006

Dmitriy Fradkin ALL RIGHTS RESERVED

ABSTRACT OF THE DISSERTATION

Within-Class and Unsupervised Clustering Improve Accuracy and Extract Local Structure for Supervised Classification

by Dmitriy Fradkin Dissertation Director: Casimir Kulikowski

Deterministic clustering methods at different levels of granularity such as within classes, at the class level and across classes are investigated for their effect on classification performance in a series of empirical studies. Specifically I have found that clustering within classes, by extracting local structure, can improve supervised learning performance in many cases. This approach, and unsupervised clustering across entire data sets are usually better for prediction than clustering or grouping within-class clusters across classes, or even global classifier approaches. These conclusions are supported by more than 3000 experiments on four benchmark datasets, using some of the most powerful automated classification methods such as regularized logistic regression and Support Vector Machines (SVMs). I have used a simple combination of unsupervised clustering and classification methods to search for local structure in the data and detect locally significant features. The approach is illustrated by analysis of lung cancer survival data from records of 200,000 patients.

ii

Acknowledgements

I would like to thank my advisor, Dr. Casimir Kulikowski, for support and advice that he provided over the years, and for his patience in imparting it. I am extremely grateful to Dr. David Madigan for his guidance in my first successful research project and for involving me in the Monitoring Message Streams (MMS) project at DIMACS, where I have gained a lot of experience, particularly from working with Dr. Kantor. Both Dr. Kantor and Dr. Madigan also generously gave their time and energy as members of my dissertation committee, and my thesis is significantly better for it. I also thank Dr. Tong Zhang for agreeing to be the external committee member and for his valuable feedback. This dissertation would not have been possible without guidance of Dr. Ilya Muchnik who has played an invaluable role in my graduate education and growth as a researcher. That, and the life wisdom he has tried to impart to me, will always remain in my memory. My work on the MMS project, though not directly related to my thesis work, was a valuable and pleasant experience, and I would like to thank all involved in it whom I did not already mention: Dr. Fred Roberts, Dr. Alex Genkin, Dr. David D. Lewis, Dr. Michael Littman, Dr. Muthu Muthukrishnan, Andrei Anghelescu, Suhrid Balakrishnan and Aynur Dayanik. I would like to thank my fellow students who contributed to many of my learning experiences. In particular, Andrei Anghelescu, while peacefully sharing an office with me for more than 3 years (no mean feat in itself), has delivered me from many hours of misery with his programming skills and Linux knowledge. I am grateful to all members of the Machine Learning Reading group (Suhrid Balakrishnan, Paul Kry, Lihong Li, Ofer Melnik, Chris Mesterharm, Akshay Vashisht) for sharing a journey through the jungle of advanced topics on the subject. My thanks also go to may others who has helped me along the way: students, staff and faculty of the Computer Science Department and visitors and members of DIMACS. Last, but definitely not the least, I want to thank my family and my friends, who always were, and still are, my source of comfort and strength.

iii

Dedication

To my parents, with deep appreciation of their love, patience and support.

iv

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1. Two Uses of Classification and Prediction Methods . . . . . . . . . . . . . . . . . . . . .

3

1.2. Classification Methods and Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3. Fuzzy and Deterministic Clustering in Classification . . . . . . . . . . . . . . . . . . . .

4

1.4. Our Approaches

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.5. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.6. The Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.7. Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2. Clustering and Hierarchical Classification . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.1. Clustering Inside the Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.1.1. The Main Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.1.2. Relevant Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.1.3. Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.2. Error-Based Class Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.2.1. The Main Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.2.2. Existing Work on Error-Based Class Aggregation . . . . . . . . . . . . . . . . . .

17

2.2.3. Other Relevant Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.2.4. Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Generalized K-Means Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Cluster Analysis of Confusion Matrix

. . . . . . . . . . . . . . . . . . . . . . . .

22

Constructing a Hierarchical Classifier Based on Meta-Classes . . . . . . . . . . .

26

v

2.3. Combining Clustering Inside Classes with Error-Based Class Aggregation . . . . . . . .

26

2.4. Input-Partitioning Hierarchical Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.4.1. Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.4.2. Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3. Experimental Work on Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . .

35

3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

3.2. Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.3. Feature Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.4. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.4.1. Multiclass Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.5. Statistical Significance Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

3.6. Empirical Evaluation of the Clustering Inside Classes . . . . . . . . . . . . . . . . . . . .

41

3.7. Clustering Inside Classes on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . .

44

3.8. Empirical Evaluation of the Error-Based Class Aggregation . . . . . . . . . . . . . . . .

47

3.8.1. Experiment on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.8.2. Experiment on the Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . .

48

3.9. Empirical Evaluation of the CIC+EBCA . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.10. Empirical Evaluation of the HGC Method . . . . . . . . . . . . . . . . . . . . . . . . . .

53

3.11. A Brief Look at Class-wise Performance . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

3.12. Evaluation of Factors Affecting CIC and HGC performance . . . . . . . . . . . . . . . .

59

3.12.1. CIC results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

3.12.2. HGC results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

3.13. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

3.14. Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

4. Classification Methods in Epidemiology . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

4.1. Model Design and Analysis in Epidemiology . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.2. Feature Significance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.3. Our Approach: Interesting Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

4.3.1. Related Data Mining Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

vi

4.3.2. Analyzing Final Classifiers

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

4.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

5. Constructing a Global Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

5.1. SEER Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

5.1.1. SEER Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

5.1.2. Feature Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

5.1.3. Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

5.1.4. Descriptive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

5.1.5. Missing Value Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

5.2. The Baseline Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

5.3. Effect of Changing Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

5.4. Capturing Variable Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

5.5. Comparison with SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

5.6. Summary of Classifier Effectiveness Comparisons . . . . . . . . . . . . . . . . . . . . . .

86

5.7. Analysis of Variable Importance in the Global Classifier Model . . . . . . . . . . . . . .

87

5.8. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

6. Local Models in SEER data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

6.1. Predictive Quality of the Hierarchical Models . . . . . . . . . . . . . . . . . . . . . . . .

91

6.2. Analysis of Local Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

6.3. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7. Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.2. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.3. Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.3.1. Future Work on Local and Global Models . . . . . . . . . . . . . . . . . . . . . . 104 7.3.2. Future Work on Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.3.3. Future Work on Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Appendix A. K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 A.1. Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

vii

A.2. K-Means Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 A.3. Batch, Iterative and Adaptive K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.4. Theoretical View of K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 A.4.1. Convergence of K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 A.4.2. Quality of Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 A.4.3. Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.5. Addressing Weaknesses of K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A.5.1. Improving Efficiency of K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 A.5.2. Role of Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 A.5.3. Other Algorithms for Optimizing K-Means criterion . . . . . . . . . . . . . . . . 117 A.5.4. Alternative Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Appendix B. Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 B.1. Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 B.2. Regularization

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

B.3. Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 B.4. Multinomial Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 B.5. Bayesian Logistic Regression

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Appendix C. Support Vector Machines for Classification . . . . . . . . . . . . . . . . . . 127 C.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 C.2. Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 C.2.1. Margin and VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 C.2.2. The Maximal Margin in Separable Case . . . . . . . . . . . . . . . . . . . . . . . 131 C.2.3. Extension for the non-separable case . . . . . . . . . . . . . . . . . . . . . . . . . 132 C.2.4. Multi-class classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 C.2.5. Computational Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 C.3. The “Kernel Trick” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 C.4. Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Appendix D. Logistic Regression and SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

viii

Appendix E. Complete Results of Experiments on Benchmark Datasets . . . . . . . . 138 E.1. Complete Results of the CIC and EBCA Approaches . . . . . . . . . . . . . . . . . . . . 138 E.2. Complete Results of the HGC Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 E.3. Other Measures of Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Appendix F. SEER Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Appendix G. SEER Data Clusters Descriptions . . . . . . . . . . . . . . . . . . . . . . . . 158 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

ix

1

Chapter 1 Introduction

The idea of using unsupervised clustering as a supplement to supervised classification seems obvious and has been used informally by the pattern recognition community since early 1960’s [118]. Intuitively, clusters which consist of examples of only one class suggest a classification rule. On the other hand, clusters containing data points from several different classes might point to ”difficult” regions in the classification space. However, a literature review has shown that systematic approaches to combination were largely ignored until the early 90s. Only then did the machine learning community start actively investigating local learning, classifier ensembles, mixtures of experts or input partitioning [59, 124, 98, 110]. Related approaches such as clusterwise regression [108, 23] have appeared in the statistics literature shortly before or at about the same time. Moreover, in almost all cases these methods propose grouping training data into several clusters by fuzzy clustering (or soft partitioning methods), and not by well-known deterministic clustering methods 1 . In other words, the kinds of improvements or benefits that might be obtained by combining deterministic clustering and supervised learning have not been reported. Our goal in the present work is to fill in this omission by systematically examining ways of using deterministic (hard) clustering methods together with supervised classification methods. We experiment with clustering at four different levels: 1. examples belonging to one class; 2. examples belonging to different classes; 3. clusters belonging to different classes; and 4. classes as a whole. 1

In deterministic clustering each point is assigned to exactly one cluster. Thus the cluster membership function for a point takes values {0,1}, with only one cluster having value 1. In fuzzy clustering, each point is associated with every cluster in the set with different weights. The weights are usually constrained to sum to 1 for every point. In other words, a point may ”partially” belong to a clusters.

2 While many methods for clustering individual points are widely used [60], methods of clustering classes are unusual and involve rather different measures of similarity, such as frequencies of errors made by a given classifier [44]. In this work we will use such an approach for clustering classes.

2

The four types of object groups we will be dealing with are: • classes - groups into which the training data is partitioned, and to which we want to assign new points using the classification methods; • clusters - groups of similar objects that are obtained by applying a clustering method to the data; • metaclusters - groups of clusters; • metaclasses - groups of known classes. The process of clustering the data is followed by the process of constructing and evaluating classifiers on different elements of the partition. We also need to construct a method for assigning a new point to an appropriate region in the space. Thus the approach proposed here is essentially hierarchical. We demonstrate that combinations based on clustering inside of classes and unsupervised clustering result in better predictive performance than (a) those based on grouping classes or inside-class clusters via confusion matrix analysis or (b) than the global classifier approach. These conclusions are supported by more than 3000 experiments on 4 benchmark datasets, using some of the most powerful current classification methods such as regularized logistic regression and Support Vector Machines (SVMs). Using deterministic clustering has a number of advantages over using fuzzy clustering methods. It can lead to classifiers that are easier to apply and to interpret. For many applications, particularly in health-related fields, such properties can be as or more important than the predictive power of the classifier alone. We suggest a particularly simple combination of unsupervised clustering and classification to search for local structure in the data and detect locally significant features. The approach is illustrated with analysis of lung cancer survival data containing records for 200,000 patients. It is necessary to emphasize that this investigation is primarily empirical. We rely on experimental validation of the proposed procedures. In Section 1.1 we discuss two uses of classification methods. The potential for applying classification methods in data analysis is described in Section 1.2. In Section1.3 we discuss the relation of our approach 2 Of course, similarity between classes can also be estimated by values of a distance function calculated between all possible pairs of samples from a pair of classes, or between class centroids, as in [116], but the measure that utilizes the numbers of mistakes is of particular interest because it is strongly dependent on the particular classification method and is meaningful in real applications.

3 to hierarchical and piece-wise classification. Section 1.5 describes clustering and classification methods used in this work. Section 1.6 describes the structure of this dissertation. Finally, Section 1.7 introduces the notation used.

1.1

Two Uses of Classification and Prediction Methods

There are two distinct goals that users of classification method typically aim to achieve: (i) obtaining an accurate predictor and (ii) obtaining a good descriptive model. These goals are different and their relative importance is largely application dependent. Consider the task of automated character recognition for sorting mail based on its destination. While humans can do this well when alert, possibly with better accuracy than most automated systems, they are slow and tire easily, becoming prone to errors. So an automated system is likely to be more accurate and less expensive in the long run. This is a particular example of a situation where the purpose of an automated system is to relieve humans of routine and tedious tasks. With such a goal in mind, it does not matter exactly how the system works, provided that it does so with sufficient speed and accuracy. The system may be a ”black box”, as are many automated character recognition approaches (see examples in [114]). On the other hand, consider an epidemiologist fitting a logistic regression model to data of disease occurence. While logistic regression can be used as a tool for prediction, the goal is to quantitatively describe how characteristics of patients are related to their susceptibility to a disease. Finding and controlling important factors in the future may prevent an outbreak. An “oracle” with a 100% accuracy that does not explain how its predictions are made would not be satisfactory in this case. Of course, a model is of interest only if it has high accuracy, but interpretability is the driving force in the choice of method or model type. The proposed combination of clustering with supervised classification methods aims to address both of these goals at the same time. The combination can improve on simple global classifiers in terms of accuracy while retaining, and improving the interpretability of such classifiers.

1.2

Classification Methods and Data Analysis

The amount of data becoming available in many scientific disciplines significantly outstrips our ability to “digest” it. There are not sufficiently many specialists to provide interpretations and find patterns in the data, or even to provide numbers of annotated examples that would allow development of validated

4 automated classification methods. Methods capable of extracting (or “learning”) information from the data in an unsupervised way in addition to the feedback provided by humans, would be of great value to the scientific community and for commercial application. In the application part of this dissertation we describe an example of applying our methods to the analysis of epidemiological data. In epidemiological analysis a classifier such as logistic regression is constructed not with the aim of classifying new data, but as a model that describes, in numerical form, relations between the input features and the outcome. Such a model may have a causal interpretation. We use classifiers constructed by our methods in a similar fashion, with the added benefit of being able to observe and analyze clusters in the data. An alternative view of our approach is that the classifiers are used to validate the results of the cluster analysis. Interesting, or “high quality”, clusters are those where a good classification rule is different from yet has no less predictive power than a global model. Alternative models may suggest new hypotheses and offer novel insights into the data. Since the classifier is used for developing hypotheses or providing explanations of phenomena, there is a concern about being able to interpret the results and the classifier behavior. Interpretability is an important reason for focusing on simple classification methods.

1.3

Fuzzy and Deterministic Clustering in Classification

Classical methods of supervised classification view a classifier for two classes as given by a single separating surface of pre-selected parametric form, such as linear or quadratic. However, in the early 1990’s it became clear to the machine learning community that more complex architectures can perform better in practice. Classical decision tree methods such as [12, 95] partition the data at each node based on the value of one attribute, and at the leaves use majority rule (assign a new point to the majority class of training points at that leaf). Newer variants [13, 40] can use more complicated tests at the nodes or classifiers at the leaves. However, the ways in which the data is partitioned by such methods aim to decrease entropy or optimize some other measure with respect to the class labels and ignore the spatial structure of the data. As a result, the partition does not produce homogeneous groups of points. The “ensembles of classifiers” approaches explicitly [98] or implicitly [59, 124] use a soft membership function to describe similarity between points and regions. This naturally leads to weighted combination of classifiers, where predictions or scores for a point representing a new case are combined with weights

5 proportional to that point’s proximity (given by cluster membership function) to the classifiers’ regions of expertise. Researchers have not used deterministic clustering methods for this purpose, even though deterministic methods are more developed than fuzzy ones. The primary reason for this is the main underlying assumption about the nature of deterministic clustering methods - that they will work only with data distributions that consist of several well-separated regions with large data density. This assumption is frequently violated by real data. Fuzzy clustering, on the other hand, assumes that the structure of the data arises from a mixture of distributions. Therefore the boundaries between clusters are relative, and can be set, when needed, by specifying a threshold on constructed “cluster membership functions”. The advantage of using fuzzy or overlapping clustering methods is that they provide an intuitive way of classifying points in the central areas of the clusters as well as in the cluster overlap areas. However, the fact that every point is associated with every cluster can be a disadvantage in practice, since interpretations of such structures require introduction of thresholds on the membership function values, and criteria for estimating the quality of the resulting partitions often are not specified. At the same time, deterministic clustering methods can work with real data, with automatically constructed boundaries between the clusters removing the need for thresholds. We now argue that a deterministic clustering approach can lead to improved classification. Let us assume that the space chosen for a classifier design is sufficiently informative to enable a standard supervised classification method to obtain a reasonably good predictive accuracy. We will refer to the classifier constructed on all the available training data as “global”. While such a classifier may not have the best achievable performance, it can capture the trends of class separation in the space at whole. Such classifier may be good enough “on average”, but in some particular regions may work with low accuracy. If these regions are related to the regions with high density of observations (that is, modes of the distribution of observations), one could try to build special ”local” classifiers to “zoom in” on these regions. We can ask whether a global classifier can be improved by replacing its predictions with the predictions of local classifiers in the high-density areas. This takes us back to the idea of using deterministic clustering methods for hierarchical classifiers design. Indeed, deterministic methods, by optimizing their criteria, find separation boundaries around high density regions. Unlike with fuzzy methods, we don’t need to manually select or adjust thresholds. The properties of the clustering method guaranteed that building an intermediate classifier to assign new points to appropriate clusters is an easy problem: such a classifier has to demonstrate a high accuracy since the clusters are as far from each other as possible

6 according to the criterion optimized by the clustering method. Thus our proposal is an alternative to ”mixture of classifiers” approach. We will develop hierarchical classifiers using deterministic clustering methods.

1.4

Our Approaches

Our approach is closely related to various hierarchical and piece-wise classification approaches. The idea of piece-wise approaches is well-known in machine learning and statistics. It allows for a combination of simplicity and flexibility, expressed in methods such as splines [22, 54], decision trees [12, 95] and mixtures of local experts [59, 124]. Fitting a number of simple models, each in its own region, may be much simpler computationally, and more effective, than trying to fit a more sophisticated global model. This has to do both with the number of parameters to be estimated and with the mathematical optimization problem to be solved. We use deterministic clustering methods to partition the training data into homogeneous groups of closely located objects, and, indirectly, the space into regions. (This assumes that the clustering method is in principle capable of finding some local structure in the feature space. Otherwise the resulting clusters will not be meaningful.) After that, classification models are built separately in each region, using only the data points belonging to the region. We call such models “local”. The final classifier therefore has hierarchical structure: a new point has first to be assigned to an appropriate region, and then the local model assigns to it a label. It is possible that for some reason (lack of data, too much noise) a local classifier cannot be trained or has poor performance. In such a case we discard the local classifier and use the global classifier for that cluster. We use a “first-level” classifier to assign new points to clusters. If a local classifier is defined on a cluster, we use it to obtain a label. Otherwise, a global classifier is used. We will call our approach ”hierarchy with a global classifier” (HGC). An important problem for getting high predictive accuracy is to recognize when it is best to use a global classifier, and when to use a cluster-specific local instead. We will describe several different approaches to realize this. The modularity of such a scheme can have additional benefits. For example, if additional training data for a region lacking a local classifier becomes available, it can be incorporated by introducing a new local classifier. This clearly will not affect other local classifiers, or the global classifier. Two of the kinds of hierarchies that are examined in this work are shown in Figure 1.1 and Figure 1.2. In both schemes a new point is first assigned to some cluster i using a first level classifier R1 and then

7

R1

X X

X

X XX X

Cluster 1 X

X X

X O X X O O O

Cluster 2

OO O O OO

X XX O OO

Cluster 4

R24

Cluster 3

R22

R23

R21

Figure 1.1: Hierarchical Scheme with clusters: Training data is partitioned into clusters and classifiers are trained locally in each cluster. At classification time, a point is first assigned to a cluster, and then labeled using a classifier trained on the cluster.

R1

Class 1 Class 2 Class 3

Metaclass 1

R21

Metaclass 2

R22

Class K−1 Class K Confusion Table

Figure 1.2: Hierarchical Scheme with metaclasses: Classes as a whole are grouped together into metaclasses (based on the analysis of a confusion table for a global classifier). New classifiers are built inside each metaclass. At classification time, a point is first assigned to a metaclass, and then labeled using a classifier trained to distinguish classes in the metaclass.

8 is labeled with an appropriate local classifier R2i . The difference is in the partition. The scheme in Figure 1.1 uses clusters obtained by clustering objects, while the scheme in Figure 1.2 has metaclasses created on the basis of a confusion matrix by clustering classes. (We will also later demonstrate how the second scheme can be extended by clustering inside-class clusters instead of classes).

1.5

Methods

Finally, a few words need to be said about specific methods used in our procedures. In order to keep the work manageable, we restricted ourselves to using only two types of classifiers, Support Vector Machines (SVM) and Bayesian Multinomial Regression (BMR), and one clustering method, K-Means. Since our work does not involve modifying these well-known approaches, the detailed description of theoretical background and common use of these methods has been placed in the Appendices for convenience of the presentation. Here we briefly mention the main reasons for the selection of these methods. K-Means is a well-known and widely used method. It produces deterministic partitions, is computationally efficient and straightforward to implement. SVM have been used in many applications in recent years. They are powerful classifiers, giving state-of-the-art performance on many problems. They also have strong theoretical foundation in the form of statistical learning theory [114]. Linear SVM, which are the particular form of SVM that we will use, produces a decision rule that is linear in the input features. It is therefore interpretable in terms of weights assigned to the individual feature. On the other hand, the decision rule can be seen as a linear combination of support vectors - points of both classes from the training set that lie on the decision boundary. These points can be interpreted as “borderline” cases, while points that are not support vectors are more typical of their classes. Finally, reliable packages implementing SVM are publicly available. BMR3 software (together with its two-class version, BBR) was recently been developed at DIMACS4 . With different setting of parameters it is equivalent to regularized logistic regression with an L1 or L2 penalty on the parameters. Thus, the choice of SVM, BMR and K-Means was based on: • reliability and power of these methods, • potential for providing interpretable models and results, and 3 4

http://www.stat.rutgers.edu/˜madigan/BMR/ http://dimacs.rutgers.edu

9 • availability of software or ease of developing or modifying it. These points are also important in the light of the potential applications of our procedures. If they are to be used for analysis of epidemiological data, it is important for the individual components to be well-understood and accessible.

1.6

The Dissertation Structure

This dissertation is structured in the following way. Chapter 2 discusses the methods we propose together with the current state of the art as described in the literature. Chapter 3 describes and analyses the results of experimental evaluation of our methods on four benchmark datasets. It also contains a discussion of potential future work on these methods. Chapters 4-6 will discuss an epidemiological application of our methods, in an analysis of risk factors for survival of lung cancer patients based on the data provided by the Surveillance, Epidemiology and End Results (SEER) Program of the National Cancer Institute 5 . This work was done in collaboration with Dr. Dona Schneider. The data analyzed includes information about 200,000 patients. Chapter 7 concludes the dissertation by providing a summary of significant results and future work. The dissertation is augmented with several appendices providing an overview of areas that are actively used but not directly investigated in our work. In particular, Appendices A,B and C discuss K-Means clustering, logistic regression and Support Vector Machines (SVM) respectively. Appendix D briefly mentions interesting connections between SVM and penalized logistic regression methods. Full experimental results and data preparation are similarly presented in the appendices in order to avoid cluttering the main text.

1.7

Notation

Before proceeding further we would like to specify our notation. We describe some of the more important symbols in more detail here, while a more complete listing is given in Table 1.1. Capital case K is the number of classes (identified by labels) in a training set W , and is therefore a constant for a given set. The class labels are denoted by L1 , . . . , LK . Lower case k denotes the number of clusters and is usually an argument to K-Means procedure. In a number of methods discussed here, the value of k is also important for constructing classifiers. 5

http://seer.cancer.gov/about/

10 Symbol W N = |W | K, k L 1 , . . . , LK S, Si s, si d x, y ∈ X lj lj (x) ||x|| w, wh Rij (x) v(Rij ) m At

Meaning Dataset, i.e. set of all points Size of the dataset Number of classes, clusters in W Classes of W A cluster in W : a set of points of W center of cluster S, Si dimensionality of the data data points, in a space X label of the jth point an indicator function of class j L2 norm of vector x a vector of parameters or feature weights (for class h) classifier j at level i of hierarchy number of support vectors used by Rij Number of meta-classes or meta-clusters A matrix transpose Table 1.1: Notation

The letter R is used to denote classifiers. Essentially, a classifier is a function, that, given a point x, returns a label. R0 will be used to denote a global classifier. It always returns a label belonging to the set of class labels of W : R0 (x) ∈ {L1 , . . . , LK }. The notation R1 is used to denote first level classifier. This classifier is used to assign points to the appropriate clusters (i.e. regions of the space or groups of classes). Therefore the labels returned by this classifiers do not necessarily correspond to the class labels. Another notation is R2i - ith classifier on second level. The output of a second level classifier is either a class label, or a label than can be uniquely mapped to some class label. We use the words “vector” and “point” interchangeably, depending on which is more suitable in the context of the discussion. Therefore, unless otherwise specified, the phrases “point x” and “vector x” refer to the same case x, in the same space.

11

Chapter 2 Clustering and Hierarchical Classification

In this Chapter we describe a number of algorithms that use cluster analysis to build piece-wise classifiers. The resulting classifiers usually have a hierarchical (2 level) structure. Let us consider cluster analysis of a set of points belonging to several classes. The clustering algorithm can be applied at four levels: 1. examples belonging to one class, 2. examples belonging to different classes, 3. clusters belonging to one class, and 4. classes as a whole. Below we shall consider different schemes involving clustering different levels. While supervised learning methods involving cluster analysis at each of these levels have been previously proposed in multiple publications (see below), this is the first work to unify these approaches in a single conceptual framework.

2.1 2.1.1

Clustering Inside the Classes The Main Idea

An intuitive motivation for Clustering Inside Classes (CIC) approach can be seen in Figure 2.1. This is a well-know example of a dataset that is not linearly separable - we can’t separate these two classes using a single linear classifier in the plane. Here clearly each class consists of two “subclasses.” If we knew which training point belongs to which subclass, we would be able to construct accurate simple classifiers for each subclass (indicated by the lines in Figure 2.1), and therefore to be able to classify new points with respect to the original two classes. The apparent problem is that the class labels do not contain information about the subclasses. We can attempt to extract such information by applying methods of cluster analysis to find groups of similar points in each class (clusters, or “subclasses”). Once such

12

Figure 2.1: The two classes, represented on a plane with red and white circles, each consist of two clusters/subclasses. The lines demonstrate that each cluster/subclass can be easily separated from the others, while the classes themselves are not linearly separable.

Figure 2.2: The two classes (white circle and red shape) are not linearly separable. If we cluster points of the red class, we’ll get two parts, each of which is easily separable from the white circle. Clustering points of both classes together will likely produce clusters with a mix of points from both classes.

clusters are found we can treat them as distinct classes for the purpose of constructing classifiers. Thus, surprisingly, by increasing the number of classes we can get more accurate classification performance. Figure 2.2 gives another reason to use cluster analysis inside the classes: here applying cluster analysis to the whole dataset is likely to produce clusters with a mix of points from different classes, while we want to have clusters belonging completely to one class. These examples suggest that two kinds of situations where CIC may be particularly useful are when (a) a class consists of disconnected components, or (b) a class has an odd shape. The experiments with synthetic data (Section 3.6) suggest that this is true. However, we will also see that even when we cannot be certain that these conditions hold (for example due to high dimensionality of the data), the CIC approach can improve performance.

13

2.1.2

Relevant Literature

CIC is sometimes used in initialization of supervised learning algorithms, such as Generalized Learning Vector Quantization (GLVQ). GLVQ [105] finds a number of prototypes (representative vectors) for each class. Let such prototypes (i)

for a class Li be denoted by {sj }, j = 1, ki . The number of prototypes for each class, ki , is a userspecified parameter. For a point x of class Li define the measure: µ(x) =

||x − w1 || − ||x − w2 || , ||x − w1 || + ||x − w2 ||

(2.1)

(i)

where w1 = argminj ||x−sj || is the prototype of its class nearest to x, and w2 = argminm6=i,j=1,...,km ||x− (m)

sj

|| is the prototype of a class other than its own nearest to x. Intuitively, µ(x) has a small value

when it is close to a prototype of its own class, and is far from a prototype on another class. To find good prototypes, the GLVQ method minimizes the following criterion: N X

f (µ(xi ))

(2.2)

i=1

where f (µ) =

1 1+exp(−µ)) .

This is done by a gradient descent procedure: 1. First, a user-specified number of prototype vectors are initialized in each class. This can be done with k-Means clustering inside each class [29], using the cluster centers as initial values for the prototypes. 2. Then the procedure iterates over the points in the training set and adjusts the prototypes. Let x be the point considered at time t. Also let w1t and w2t be the prototype of its own class nearest to x, and the prototype of a different class nearest to x respectively. These prototype are updated as follows: df ||x − w2t || (x − w1t ), dµ (||x − w1t || + ||x − w2t ||)2 df ||x − w2t || ← w2t − αt (x − w2t ), dµ (||x − w1t || + ||x − w2t ||)2

w1t+1 ← w1t + αt

(2.3)

w2t+1

(2.4)

where αt is the learning rate at time t. All other prototypes are not adjusted at time t. This procedure converges to a set of final prototypes [105]. These prototypes are representative points for corresponding classes. A new point can be classified by finding the nearest prototype and assigning its label to the point.

14 The purpose of using CIC to initialize the prototypes is to obtain good starting positions for the prototypes that will lead to faster convergence and a better final classifier. Another CIC method is discussed in [61] under the name of “unsupervised output separation” for binary classification: • During the training stage these steps are repeated 5 times: 1. Each class is independently partitioned into specified number of clusters k (k = 3, 5 are considered). 2. A multi-class classifier (decision tree C5.0) is trained with the cluster labels (rather than with class labels) to distinguish between the clusters. • In the classification stage, the predictions from classifiers built at each repetition of the training stage are combined. Note that while the classifiers predict one of the cluster labels, these are converted into class labels. In other words, if class 1 has clusters 1,2 and 3 and class -1 has clusters 4,5 and 6, then regardless of which of 1,2 or 3 is predicted by C5.0, the point is assigned to class 1. The two combination methods considered are non-weighted combination, where the number of times a particular class was predicted is counted, and weighted combination, where the probability estimates produced by C5.0 trained on different partitions are added together for each class. Experiments on 5 datasets from the UCI Repository[9] showed improved classification accuracy on 4 of them. According to [61] the motivation for having multiple training stages and combining them is to compensate for possibly poor clustering. Our method is related to the approach of [61] as will be seen from the discussion below. However, in our work we aim to obtain good clustering by selecting the best clustering result out of multiple runs, and then building a single multiclass classifier. Our results on multiclass problems, described in Chapter 3 show that this approach, which is simpler than that of [61], leads to improved performance.

2.1.3

Our Approach

Figures 2.1 and 2.2 show simple cases where partitioning the classes would lead to better classification. While it does not follow that such an approach is always beneficial, it deserves consideration. Our CIC classification scheme, inspired by the above discussion, is described in Algorithm 1. During

15 Algorithm 1 Training and Classification with CIC Require: A set W , an integer k ≥ 2. {Training with CIC} 1: for j = 1, . . . , K do 2: Partition class Lj into k clusters. 3: end for 4: Train classifier R1 using all training data to recognize all clusters. Require: A point x. {Classification with CIC} 1: Let i = R1 (x), i = 1, k · K. 2: Return class of cluster i. the training stage each class is partitioned into k clusters (lines 1-3), and then classifier R1 is trained to classify a new point into one of the resulting clusters. At classification time, a new point is assigned to some cluster i (line 1 or classification stage) which corresponds to a single class label that is returned as prediction (line 2 of the classification stage). Note that we partition each class into the same number of clusters. This is clearly not necessary. Using too many clusters for a class may lead to artifical clusters that have few points and for which a good classifier cannot be built (due to lack of training examples and proximity to other clusters). Using too few clusters may not adequately partition the class. In fact, it may be better to try to automatically determine an appropriate number of clusters in each class. However, this is a well-known and difficult problem for which a solution is yet to be found.1 Therefore in the present work we restrict ourselves to the simpler scheme where the number of clusters to be used in every class is a free parameter, to be specified by the user. As our experimental results show (Section 3.6), using a slightly greater value for this parameter does not hurt performance.

2.2 2.2.1

Error-Based Class Aggregation The Main Idea

Let us assume that a multi-class classifier R0 for a K-class problem is built and then tested. The result of testing can be represented as a K × K matrix A (a confusion table), where the entry aij contains a fraction (or exact number) of points of class i in the validation set that were classified as belonging to class j. Assuming that the classifier was tested on a representative sample of the data space, we know what classes are confused by the classifier we built, and to what degree. In the event that all classes are confused with all other classes with approximately the same error rate, this information is not of much use. However, it is reasonable to consider the situation (which often 1

A great many heuristics or methods developed within specific frameworks are described in the literature, for example [80, 123, 97, 35, 49, 104].

16

00 11 11 00 00 11 00 11 00 11 00 11 00 11 00 11 00 11 0011 11 00 00 11 00 11 11 00 11 00 00 00 11 0011 00 11 00 11 11 00 11 00 11 00 11 00 11 00 11 00 11 11 00 00 X 11 00 11 00 11 X Metaclass 1

X Metaclass 2

Figure 2.3: There are 5 classes. They can be partitioned into a group of 3 and a group of 2 without any mistakes between the groups. The two resulting problems may become easier - for example the points marked with X next to them are no longer in danger of being misclassified.

occurs in practice [45]), where each class is confused only with a few other classes, or where all classes are recognized well except for a few that are confused with all others. In other words, it is reasonable to expect a confusion matrix to have some structure that reflects relations between the classes. If we could combine classes that are confused with each other into a small number of groups (“metaclasses”), it is possible that our classifier R0 would make no or few mistakes distinguishing between these metaclasses, and almost all mistakes would be inside them. Figure 2.3 illustrates such a situation. Here there are potential mistakes between a different pairs of classes. However, using the classifier indicated with a bold line we can partition the data into two groups of classes (metaclasses) without any mistakes. Furthermore, the resulting two subproblems may become easier as the result. Thus, while our classifier possibly makes a lot of mistakes distinguishing between all the classes, it can be very good in distinguishing between metaclasses. If that is the case, it makes sense to try to build classifiers that would distinguish only between the classes belonging to the same metaclass. Each new classifier is likely to be better at this task than the initial classifier because it is solving a simpler problem, with fewer classes and points. We thus arrive at a two-level architecture for classification, which is graphically represented in Figure 2.4. It remains to specify a method for grouping classes based on the confusion matrix. We use the name Error-Based Class Aggregation (EBCA) to refer to the approaches following this idea. It is worth noting that the success of this approach depends on the extent to which the structure of the confusion matrix accurately reflects the difficulty of separating different classes. If the confusion matrix is not indicative of the distribution of errors on unseen data, new points would frequently be assigned to wrong metaclasses decreasing the quality of the classifier. Another difficulty is that separating metaclasses may be a difficult problem in itself. We will propose

17

R1

Class 1 Class 2 Class 3

Metaclass 1

R21

R22

Metaclass 2 Class K−1 Class K Confusion Table

Figure 2.4: Error-Based Class Aggregation (EBCA) Hierarchical Scheme: analysis of confusion matrix leads to partition of classes into metaclasses. A new point is first assigned to a metaclass, and then labeled using a classifier trained to distinguish classes in the metaclass. (The number of metaclasses is of course not limited to just two.) a way of avoiding this issue by utilizing the global classifier.

2.2.2

Existing Work on Error-Based Class Aggregation

The EBCA approach has been suggested in [45, 44]. The authors work with text classification problems. They use confusion matrix analysis to create a two-level hierarchical classifier. First they train a Naive Bayes (NB) classifier on all classes and obtain a confusion matrix of its performance on a validation set. Each class is represented by a normalized version of the corresponding row in the confusion matrix A. Then Ward’s method [33] is used to obtain a hierarchical clustering (a tree) of the classes. Ward’s method is an agglomerative clustering method, where at each step two closest clusters are merged, with the distance function between clusters Si and Sj defined as: d(Si , Sj ) =

|Si ||Sj | ||si − sj ||, |Si | + |Sj |

(2.5)

where |S| is the number of elements in the cluster S, and si is the average vector of Si . It can be shown, directly from (2.5), that at each step the clusters to be merged are chosen so that the increase in the inside-cluster dispersion is as small as possible [83]. Once the tree is constructed, it is clipped at a point where distances between clusters begin to increase sharply [44]. Thus the number of clusters is determined automatically. The resulting partition gives the desired metaclasses. Then, the SVM classifier is trained to distinguish between metaclasses.

18 11111111111 00000000000 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111

A

000000000000 111111111111 11111111111 00000000000 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 00000000000 11111111111 C 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 B 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 D 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111

Figure 2.5: There are 4 classes. Analysis of the confusion matrix of a multiclass classifier will construct one metaclass by combining A with B, and another by combining C and D. However, the two metaclasses are not linearly separable, and thus a new linear classifier trained to distinguish between them would have poor performance.

Also, lower level SVM classifiers are trained to separate classes inside each metaclass. The authors also explore several other two-level combinations (NB on both levels, NB on first level and SVM on the second one) and compare them with global NB and SVM classifiers. Experiments are also done to determine the accuracy on each level separately (for the second level this refers to accuracy on the data classified correctly by the first level). While the accuracy at each level of the two-level architecture (with SVM on second level) was higher than for any of the global classifiers, the compounding of errors led to worse results than for the global “one-vs-all”2 SVM approach (which showed the best accuracy among all examined methods). We would like to make note of several factors. The goal of constructing metaclasses is to find groups with fewer errors between them. Since the metaclasses are built using a confusion table generated by a global classifier R0 , it is important for the first-level classifier in the final construction to make mistakes in the same way as R0 would (i.e. the distribution of mistakes should be approximately the same as in the confusion matrix), so that most of the errors still fall into the same metaclass and can be corrected on the second level. But this can be achieved simply by using R0 as a first-level classifier - there is no need to train a new classifier to distinguish between metaclasses. In fact, the resulting metaclasses may be difficult to separate, as Figure 2.5 demonstrates. The fact that this reasoning was not followed in 2

In one-vs-all classification with K classes, K binary classifiers are constructed, each treating one class as positive and the rest as negative. At classification time, the new point is assigned to the class corresponding to the classifier with the highest score.

19 the approach of [45] may be responsible for its unsatisfactory performance. In Section 2.2.4, we suggest a ReUsing First Level (RUFL) approach that addresses this concern. Another comment concerns the method used to derive meta-classes. Treating the rows of A as vectors representing classes takes into consideration only one-way errors, i.e. the row representing the class Li contains the points misclassified as Lj (in cell aij ), but no information on how many points of Lj were misclassified as Li , since this information is in another row (in cell aji ). Values aij and aji need to be combined to obtain similarity between classes Li and Lj . Also, minimizing distances between rows in a cluster does not directly minimize the number of errors between metaclasses. It could group together classes that are both mistaken for another class, but never with each other. A method for computing metaclasses that takes these factors into account is desirable, and suggested further below.

2.2.3

Other Relevant Literature

Many methods for constructing hierarchical structures from unlabeled data are described in the cluster literature [126]. A number of papers suggest methods for taking advantage of known hierarchical class structure by partitioning a single multi-class classification problem into several problems with fewer classes [71, 32]. For example in [32], a classifier is constructed to distinguish first level categories, and then second-level classifier is trained to distinguish second-level categories inside each first-level category. This approach is shown to be more effective than trying to build a classifier to distinguish between second-level categories directly. For applications such as text classification this also allows the classifiers to use much fewer features [32]. There are fewer papers discussing automatic generation of class hierarchies for classification. Clearly, analysis of confusion matrix [45] is one possible approach. A recent work [116] uses cluster analysis to reformulate a multi-class classification problem with K classes as a binary decision tree with K − 1 nodes, each corresponding to a binary classification problem. The authors refer to this approach as Divide-by-2 (DB2). One method (referred to as Method 1 in [116]) is to represent each class by the centroid of the points in the class, and partition the centroids into two clusters using K-Means. An SVM classifier is then constructed to distinguish between these two groups of classes. This method is applied separately for each of the clusters, until all leaves contain only one centroid. [116] also discusses some other variants of this approach.

20

2.2.4

Our Approach

Above we noted some problems in the analysis of confusion matrix and in construction of the classifiers as described in [45]. Here we suggest two methods for analysis of confusion matrix and a different way of obtaining the first-level classifier in the EBCA scheme. Grouping classes based on the distances between the rows of the confusion matrix does not directly minimize the number of errors between the metaclasses. Additionally, since the confusion matrix A is not symmetric, the information on similarity between classes i and j is spread between different rows (entries aij and aji ). The latter problem can be easily amended by considering a symmetric matrix B, ||bij || = B = 1 2 (A

+ At ) = || 12 (aij + aji )||. Here the entry bij describes the similarity between classes i and j and is

proportional to the number of mistakes between the two classes. We would like to group the classes into metaclasses so as to minimize the number of mistakes outside of metaclasses. There exist many methods that can be applied to analysis of the symmetric matrix B (or even the original matrix A), such as spectral clustering [66, 88, 127, 24] and combinatorial optimization [85, 84]. It would be impossible to consider all of them here. Therefore, as in the rest of this work, we focus on K-Means-like methods.3 We note that standard K-Means algorithms cannot be applied in this case because the notion of the centroid is not defined, and thus the two-stage process of K-Means algorithm can not be executed. One of the proposed methods maximizes a particular criterion E1 - the total mean within cluster similarity (i.e. the average number of mistakes per class). As we will show, for a positive definite B, E1 is exactly kernel K-Means criterion [42] for some unknown vector representation of the classes. We will therefore refer to that method as Kernel K-Means (KKM). However E1 is meaningful even if B is not positive definite. The other method attempts to directly maximize the similarity of classes in a metaclass to a prototype class (and thus to reduce the number of mistakes outside). We will refer to it as Representative-Based Clustering (RBC). Generalized K-Means Criterion Let B be a K × K symmetric matrix with entries bij = bji . Consider the following criterion function: m X 1 X E1 (S1 , . . . , Sm ) = bij . |Su | u=1

3

(2.6)

i,j∈Su

It is interesting to note that recently connections have been discovered between K-Means and spectral clustering methods [127, 24]. Some detailes are provided in Appendix A.

21 This function can be described as total mean within-cluster similarity. If B is positive-definite, E1 is equivalent to the criterion for Kernel K-Means, described in [42]. A more general form of E1 (not limited to a symmetric positive-definite matrix) was also mentioned in [82] in the context of single-cluster partitional algorithms (where one cluster at a time is identified and removed from further consideration). It is well known that the entries of a K × K positive-definite matrix can be interpreted as inner products of vectors φi , i = 1, K in some Euclidean space H: bij =< φi , φj > .

(2.7)

We now show that maximizing E1 (S1 , . . . , Sm ) in this case is equivalent to minimizing the K-Means criterion in the space H, given by: E(S1 , . . . , Sm ) =

m X X

u=1 i∈Su

||φi −

1 X φj ||2 |Su | j∈Su

This is a direct consequence of the following theorem. Theorem Given any two partition of indices i = 1, . . . , K into m clusters: {S1 , . . . , Sm } and ′ }: {S1′ , . . . , Sm ′ ′ E(S1 , . . . , Sm ) > E(S1′ , . . . , Sm ) ⇐⇒ E1 (S1 , . . . , Sm ) < E1 (S1′ , . . . , Sm ).

(2.8)

The proof consists of demonstrating that: E(S1 , . . . , Sm ) =

K X i=1

bii − E1 (S1 , . . . , Sm ),

for any partition into S1 , . . . , Sm . Since the quantity

K P

(2.9)

bii does not depend on the partition, the

i=1

conclusion follows.

The correctness of (2.9) follows from this sequence of transformations: E(S1 , . . . , Sm ) =

m X X

u=1 i∈Su

=

m X

X

u=1 i∈Su

=

K X i=1

(< φi , φi > −

< φi , φi > −

m X

(

u=1

||φi −

1 X φj ||2 |Su | j∈Su

X 2 X 1 < φi , φj > + ( < φh , φj >)) |Su | |Su |2 j∈Su

j,h∈Su

X 1 X 2 X < φi , φj > − < φh , φj >)) ( |Su | |Su |2 i,j∈Su

i∈Su

j,h∈Su

22

=

K X i=1

=

K X i=1

m X X 2 X 1 ( < φi , φi > − < φh , φj >)) ( < φi , φj > − |Su | |Su | u=1

i,j∈Su

m X

j,h∈Su

1 X < φi , φi > − < φi , φj >= |Su | u=1 i,j∈Su

K X i=1

=

m X 1 X bii − bij |Su | u=1

K X i=1

i,j∈Su

bii − E1 (S1 , . . . , Sm )

This completes the proof. A similar argument was used in [42]. Cluster Analysis of Confusion Matrix Above we introduced criterion E1 and showed that maximizing E1 on a symmetric positive-definite matrix B is equivalent to performing K-Means clustering on vectors in some feature space. However, the heuristic algorithm for maximizing E1 that we propose below does not require that B is positivedefinite, and the criterion E1 itself is meaningful for any symmetric square matrix. For matrices with a small number of rows and a small desired number of metaclasses, the optimal partition may be computed by a brute force algorithm that examines all possible partitions. However, this rapidly becomes implausible as the number of rows and the number of metaclasses increase. Below we describe an efficient heuristic approach, KKM. Surprisingly, we did not find a description of this method in the literature, despite its simplicity and the fact that criterion E1 is well known [82]. This method starts with a random initial partition of the points (rows of the matrix B) into m clusters and cycles through all points i = 1, . . . , K. In each cycle, for each point i ∈ Su it virtually moves i to each Sv , v = 1, . . . , m,v 6= u. It finds Sv such that moving i would lead to the greatest increase in value of E1 . If a move that increases E1 cannot be performed, i remains in Su . If no point is moved in a cycle, the algorithm terminates. Multiple random initial partitions can be used to find a better final partition. Description of KKM: It is necessary to put some effort into implementation of KKM in order for it to to be computationally efficient. Let us make the following definitions: ψ(i, Su ) = bii +

X

bij , u = 1, m, i = 1, K.

(2.10)

j∈Su ,j6=i

δ(Su ) =

1 X ψ(i, Su ) u = 1, m |Su |

(2.11)

i∈Su

Using these definition we can re-express E1 as:

E1 =

k X u=1

δ(Su ).

(2.12)

23 We note the following properties of ψ(i, Su ): ψ(i, Su − j) = ψ(i, Su ) − bij , ∀i 6= j.

(2.13)

ψ(i, Su + j) = ψ(i, Su ) + bij , ∀i 6= j.

(2.14)

Our algorithm works by pre-computing a K × m matrix of partial sums of the initial partition: C = ||cij || = ψ(i, Sj )

(2.15)

The u and v columns of C are easily updated every time a point is moved via equations (2.13) and (2.14). This matrix C, together with B, is used to evaluate utility of moving points to avoid computing E1 for each case. Let ∆(j, Su , Sv ) be the change in value of E1 resulting from moving j from Su to Sv . It is easy to see from (2.12) that: ∆(j, Su , Sv ) = δ(Su − j) + δ(Sv + j) − δ(Su ) − δ(Sv ).

(2.16)

Algorithm 2 Partitioning Confusion Matrix: Method KKM Require: A symmetric K × K matrix B with non-negative values, m ≥ 2 - desired number of metaclasses 1: Obtain an initial partition of B into m clusters S1 , . . . , Sm . 2: Compute K × m matrix C = ||cij || = ||ψ(i, Sj )||. 3: while At least one object changed clusters do 4: for i = 1, . . . , K do 5: if |Su | > 1, where i ∈ Su then 6: Find cluster j such that: j = argmaxv=1,...,m ∆E1 (i, Su , Sv ) 7: if i 6= j then 8: Move i to Sj 9: Update cxu and cxj , ∀x = 1, . . . , K,x 6= i. 10: end if 11: end if 12: end for 13: end while We can now describe our implementation of KKM (the pseudocode is given in Algorithm 2). First we obtain an initial partition (line 1) and precompute matrix C (line 2). Then we cycle through rows of B (lines 3-12). For every point (row of B), unless it is the only point in the cluster, we consider the effect of moving it to every cluster (line 6). A move that leads to the greatest increase in the value of E1 is made (line 7-10), if it exists, updating matrix C in the process (line 9) using (2.13) and (2.14). Note that the effect of the move can be efficiently computed, as described above. If no point can be moved, the algorithm terminates.

24 We now show that the proposed algorithms converges in a finite number of steps. Theorem The computational complexity of KKM is O(cmK 2 ), where c is a finite positive integer. In order to prove this theorem we will need the following Lemma. Lemma 1 ∆(j, Su , Sv ) can be computed from C and B in O(K) time. Proof: The four terms on the right side of (2.16) can be computed from matrices C and B. The negative terms are computed on the basis of equation (2.10) in O(K). As for the positive terms: δ(Su − j) = = =

1 |Su |−1

P

i∈Su ,i6=j

ψ(i, Su − j) =

P P 1 bij ) ψ(i, Su ) − |Su |−1 ( i∈Su ,i6=j i∈Su ,i6=j

P 1 ψ(i, Su ) − |Su |−1 ( i∈Su

ψ(j, Su ) −

Similarly:

δ(Sv + j) =

1 |Su |−1

P

i∈Su ,i6=j

(ψ(i, Su ) − bij )

(2.17)

P 1 ψ(i, Su ) − (ψ(j, Su ) − bjj )) |Su |−1 ( i∈Su ,i6=j P (ψ(j, Su ) − bjj )) = |Su1|−1 ( ψ(i, Su ) − 2ψ(j, Su ) + bjj ). i∈Su =

X 1 ψ(i, Sv + j) = |Sv | + 1 i∈Sv +j

1 ( |Sv | + 1

X

i∈Sv

ψ(i, Sv ) + 2ψ(j, Sv ) − bjj )

(2.18)

Thus both δ(Su − j) and δ(Sv + j) can be computed via entries of matrices C and B in O(K). QED. We can now prove the above Theorem: Proof: Matrix C can be precomputed in O(K 2 ), since all partial sums for a point together require O(K). Each of the four terms in (2.16) can be computed from the matrix C in O(K) summations (Lemma 1). In order to find the best move for point i we examine m clusters. Updating matrix C after a move requires updating entries in two columns: O(K). Therefore, the complexity of a single cycle is K(O(mK) + O(K)) = O(mK 2 ). During each cycle, the value of E1 increases, i.e ∆(j, Su , Sv ) > 0. Since the matrix B is finite, there can be only be finitely many moves resulting in a positive ∆(j, Su , Sv ). Therefore, if ǫ is the infimum of ∆(j, Su , Sv ) > 0 over all possible Su ,Sv and j, it follows that ǫ > 0. Since 0 < E1 ≤

k X

bij .

(2.19)

i,j=1

and each cycle of KKM increases E1 by at least ǫ, there can be no more than c = So KKM is guaranteed to terminate in a finite number of steps: O(cmK 2 ). QED.

1 ǫ

P

bij cycles.

i,j=1,...,K

Therefore we have an efficient method for updating the partial sums ψ, and, consequently, for estimating the effect of moving a point ∆(j, Su , Sv ). These lead to an efficient method, KKM, described in Algorithm 2, for maximizing E1 .

25 Since the final partition depends on the initial condition, the algorithm KKM converges to a local optima, just as standard K-Means does. RBC Method:

We mention a different approach to analysis of matrix B that is inspired by

optimal matching literature. If the entries bij are pay-offs, or rewards, we would like to group rows and columns so as to maximize the sum of the rewards received. The optimization criterion to be maximized can be described as: E2 =

K X u=1

max i∈Su

X

bij .

(2.20)

j∈Su

Let us call su a center (or a representative class) of cluster Su if it satisfies: su = argmaxi∈Su

X

bij .

(2.21)

j∈Su

Algorithm 3 Partitioning Confusion Matrix: Method RBC Require: A symmetric K × K matrix B with non-negative values, m ≥ 2 - desired number of metaclasses 1: Obtain an initial partition of B into m clusters 2: for u = 1, . . . , m do 3: compute su - center of Su 4: end for 5: while At least one object changed clusters do 6: for i = 1, . . . , K do 7: if |Su | > 1, where i ∈ Su then 8: Find cluster j = argmaxv=1,...,m bisv 9: if i 6= j then 10: Move i to Sj 11: Recompute centers of Su and Sj 12: end if 13: end if 14: end for 15: end while An algorithm RBC for optimizing E2 works like an iterative K-means algorithm (see Appendix A). The pseudocode for RBC is given by Algorithm 3. We start with a random initial partition (line 1). We find centers of each cluster (lines 2-4). Then for each point (that is not the only point in its cluster), we find the cluster with the closest center (line 8). If the point does not belong to that cluster, we move it, and update the centers (lines 9-12). The algorithm terminates when no point can be moved. The algorithm can be run multiple times with different initial conditions to find a better solution. The present section described methods of analyzing a confusion matrix in order to obtain metaclasses. Once the partition into meta-classes is available, we can construct a classifier as described earlier in section 2.2.2.

26 Constructing a Hierarchical Classifier Based on Meta-Classes As noted previously in Section 2.2.2, the approach of [45] involves building a training a new classifier to distinguishes between the metaclasses. We’ll call this ReTraining First Level (RTFL) approach. Consider instead a ReUsing First Level (RUFL) approach. In this approach the classifier used to obtain the confusion matrix is used at classification time as the first level classifier. The errors that this classifier makes are more likely to correspond to the confusion matrix based on which the metaclasses were constructed, than the errors of the new classifier. The metaclass will be uniquely determined by the class prediction of this classifier. Thus we expect this approach to give better accuracy than the RTFL approach. Our experiments confirm this (Chapter 3). This approach also has the benefit of making the hierarchy construction process less expensive, since a new first level classifier does not have to be constructed. Algorithm 4 Training and Classification with EBCA: RUFL Method Require: A set W , number of metaclasses m ≥ 2. {Training with EBCA: RUFL} 1: Train a K-class classifier R0 . 2: Use cross-validation to obtain confusion matrix A. 3: Partition the classes into m metaclasses based on B = 21 (A + At ) 4: for i = 1, . . . , m do 5: Train R2i to separate classes that fall into metaclass m 6: end for Require: A point x. {Classification with EBCA: RUFL} 1: Let i = R0 (x), i = 1, k · K. 2: Let j be the metaclass to which class i belongs. 3: Let u = R2j (x), u = 1, K. 4: Return u. Training and classification pseudocode for RUFL approach is given in Algorithm 4. The training stage consists of training classifier R0 (line 1) and obtaining the confusion matrix A using cross-validation (line 2). The classes are then partitioned into metaclasses (line 3) and a separate classifier is trained inside each metaclass (line 4-6). A new point is first assigned to a metaclass based on prediction of R0 (lines 1-2 of classification), and is the labeled with a corresponding second level classifier (lines 3-4).

2.3

Combining Clustering Inside Classes with Error-Based Class Aggregation

The two approaches that we discussed (Clustering Inside Classes and Error-Based Class Aggregation) seem to work in exactly opposite ways. While one approach attacks the classification problem by partitioning the classes, the other method tries to simplify the problem by combining the classes into groups.

27 Despite of, or possibly because of this, CIC and EBCA can be combined in a very natural way. We partition each class into k clusters, training first-level classifiers on k · K clusters and using crossvalidation to obtain a cluster vs. cluster confusion matrix. Ideally, this approach will cause each class to split into different regions based on their similarity to other clusters, possibly those belonging to other classes, and the resulting clusters will be easier to group than the original classes. We then train classifiers to separate clusters that fall into the same group. Classifying a new point proceeds in the same way as in the usual EBCA scheme, except that the final label is not that of a class, but of a subclass and has to be mapped to one of the original classes (which is straightforward). Algorithm 5 Training and Classification with Combination of CIC and EBCA: RUFL Method Require: A set W , number of clusters k ≥ 2, number of metaclasses m > 0. CIC+EBCA: RUFL} 1: for j = 1, . . . , K do 2: Partition class Lj into k clusters. 3: end for 4: Train a k · K-class classifier R0 . 5: Use cross-validation to obtain cluster confusion matrix A. 6: Partition clusters into m metaclusters based on B = 21 (A + At ) 7: for i = 1, . . . , m do 8: Train R2i to separate clusters that fall into metacluster m 9: end for Require: A point x. {Classification with CIC+EBCA: RUFL} 1: Let i = R0 (x), i = 1, k · K. 2: Let j be the metacluster to which cluster i belongs. 3: Let u = R2j (x), u = 1, K. 4: Return class of cluster u.

{Training with

These ideas are summarized in Algorithm 5 for the RUFL approach. Note that when k = 1, the classifier is the same as the EBCA classifier described in Section 2.2.2, in Algorithm 4. In fact, Algorithm 5 is exactly the same as Algorithm 4 except for lines 1-3 in the training stage which partition each class into clusters. After that a global classifier is trained to separate the clusters, and cross-validation is used to obtain a confusion matrix (line 4-6). The clusters are grouped into metaclusters based on confusion matrix analysis (line 6). Finally, a classifier is trained inside each metacluster (lines 7-8). Classification of a new point is done exactly as in Algorithm 4.

2.4

Input-Partitioning Hierarchical Classifiers

In this section we discuss methods for building classifiers that involve partitioning the data based on the distances or similarities between the individual points (i.e. cluster analysis).

28

Cluster 1

Cluster 3

Cluster 2

Figure 2.6: Here partitioning the data into three cluster results leads to 3 easy classification problems. The lines denote the local classifiers.

Many hierarchical approaches (also known as ensembles of classifiers, local experts, etc.) involve partitioning training data and training separate classifiers on different parts of training data. The intuition behind this is that use of several “local” classifiers (classifiers that are trained on points in a small part of the input space) will result in simpler decision rules and better performance than would be seen from a globally trained expert. Once local classifiers are built, classification of a new point proceeds as follows. A new point is first assigned to one or more clusters, and classified separately by classifiers trained on these clusters. The results from different classifiers are then combined using majority vote or some form of weighted voting where a vote of local classifier is weighted by the distance of a point to that classifier’s cluster. Examples of voting rules used in this context are given in [34, 98]. Figure 2.6 shows an example of a situation where such an approach is clearly beneficial. The original classification problem (separating the two kinds of circles) is clearly non-linear. However, partitioning the data into 3 clusters results in 3 easy linear classification problems. Notice that using clustering inside classes here would be more complicated - it would require at least three clusters for each class, and, in the case of K-Means and many other clustering algorithms, an appropriate initial conditions in order to obtain the kind of partition we would like. After that we would still have to construct a classifier for the 6 classes. With input partitioning, we have only 3 clusters, and in each we construct a single binary classifier.

2.4.1

Existing Work

Here we discuss in some detail a typical paper [98] dealing with input partitioning. The approach consists of several stages.

29 During the first stage a fuzzy clustering of the data is obtained. The input density is assumed to be a mixture of k Gaussian densities and estimates their parameters from the data. The number of resulting clusters k is taken to be the number of components in the mixture. Once this is determined, the mean mj and covariance matrix Σj together with probability µj of a cluster are estimated for each cluster j, The local input density function is assumed to have the form: gj (x) =

exp(− 21 (x − mj )′ Σj −1 (x − mj )) 1

(2π)n/2 |Σj | 2

The global input density is: g(x) =

k X j=1

µj gj (x), µj ≥ 0,

k X

µj = 1.

(2.22)

(2.23)

j=1

During the second stage, the sets for training local classifiers corresponding to each cluster are formed. Each point x in the training set is assigned to all clusters j such that gj (x) > Θ, where Θ is a threshold parameter. A local classifier fj (x) is trained on jth group points. The resulting classifier F (x) combines the output of the local classifiers f in the following way: F (x) =

1 k P

αj gj (x)

j=1

k X

αi gi (x)fi (x).

(2.24)

i=1

The parameters αi can be chosen in a number of different ways. Setting αi = µi leads to weighted averaging. A different combination method, called adaptive, adjusts αi to minimize some error on a subset of the training data that was not used in training of the local classifiers (sometimes called the validation set [98]). The local classifier in the above scheme can be a Support Vector Machine (SVM). SVM finds a linear classifier with discriminant function (f (x) = wx + b, where w is the vector of parameters computed on the training set) that maximally separates the two classes. SVM are discussed in detail in Appendix C. Another choice of local classifier examined in [98] is Multi-Layer Perceptron (MLP) [8]. The performance of the hierarchical methods described above was compared with that of a single global classifier (also an SVM or MLP) and of a stacked combination [122] of global classifiers.4 Experimental results on two benchmark datasets (Phonemes from ELENA project, and the Vowel dataset [9]) indicate that hierarchical classifiers may perform better than either a single global classifiers or a stacked combination. 4

In stacked combination the training set is randomly partitioned into k equal subsets and classifiers of the same type are built on each subset. The final classifier has the form F (x) =

k X i=1

λi fi (x),

k X

λi = 1, λi ≥ 0∀i = 1, k

i=1

where the weights λ are computed on a validation set to minimize an error measure [98].

(2.25)

30 Work in [29] and [34] has a similar structure to that described in [98]. First a (fuzzy) partition of the training set is obtained, then the training sets corresponding to each cluster are formed and local classifiers are trained on such sets. These local classifiers are combined using a weighting scheme based on the distance from a point to the cluster centers. We also note that in addition to fuzzy partitioning approaches, [34] considered a deterministic partitioning scheme. Each training point was assigned to exactly one cluster. A decision tree classifier (C4.5) was trained on each cluster. During the testing, a point was first assigned to the cluster with the closest center, and the classifier with the corresponding local classifier. According to results reported in [34], this approach did not improve on the global classifier. We make note of several drawbacks of such approaches: • When fuzzy assignment is used, all local classifiers are needed at classification time, increasing computational complexity • The methods used to assign weights to a point are based on distance to the clusters and do not directly take into account separation between the points. • A particular cluster/local training set may be inadequate for the purpose of training a local classifier due to lack of training data or presence of noise.

2.4.2

Our Approach

We suggest a new architecture that addresses some of the drawback of the approaches described in the literature. First, we use only deterministic clustering algorithms - each point belongs to exactly one cluster. This allows the clustering and classification stages to proceed more efficiently and simplifies interpretation of the behavior of the resulting classifier. Second, we use a classifier at the first level to assign new points to regions. This takes into account actual separation between the points of different clusters that is ignored by distance-based methods. For example in the case of K-Means, assigning a new point to a cluster based on its distances to centroids ignores the distribution of the points in the clusters - it is possible that points of a cluster with a more distance centroid are much closer than those of the cluster with a closer centroid. A classification method would detect such situation (Figure 2.7 illustrates the situation). Finally, the method that we suggest is capable of choosing when to use local classifier or a global one based on the quality of the region and/or of the local classifier. More specifically, this method uses

31

11111111 00000000 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111

11111111111111111 00000000000000000 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 X 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111

Figure 2.7: Point X is closer to the center of the small cluster, and a centroid-based method would assign it to that cluster. A classifier such as SVM however would place a separating line (shown) between the boundary points of the clusters rather than the centers, assigning X to the large cluster. The latter decision seems more appropriate, given the dispersion of the two clusters. cross-validation [70] to estimate the quality of local and global classifiers in a region. If the expected quality of the local classifier is worse than of the global one, then the global classifier is used instead of a local one. Note that in our approach only two classifiers are active when labeling a particular point (first, a point is assigned to a cluster, and then classified with an appropriate local classifier). This can be an advantage in situations where fast decision-making is needed. A graphical illustration of this scheme is in Figure 2.8. We now discuss several ways to determine the quality of a local rule. First, if the cluster contains very few points, we may not even be able to construct a rule. In such a region we have to resort to using the global classifier. A general way of estimating quality of a classifier is by using cross-validation [70]. Let α(R|S) be the estimated accuracy of classifier R (trained on a set W ) on set S ⊆ W (i.e. the fraction of points in S that it classifies correctly). Also, let α(R|S, L) be the accuracy of classifier on set S with respect to a particular class L (i.e. a probability that a point x ∈ S belonging to class L is classified correctly by classifier R). One approach is to compare quality of the local classifier to the global one in the region: q1 (i) = α(R2i |Si ) − α(R0 |Si )

32

Cluster 5 R1

X Cluster 4

Cluster 1 X

X X

X O X X O O O

X

OO O O OO

X X XX X

X

Cluster 2 R21

R22

OX X OX OXO

R24

Cluster 3

R0 Figure 2.8: Hierarchical Scheme with clusters: a point is first assigned to a cluster, and then labeled using a classifier trained on the cluster. For clusters with poor local method quality (Cluster 3) or insufficient data (Cluster 5) a global classifier is used. A more sophisticated approach would be to take into consideration the probability that a particular point does not reach cluster Si because of a mistake of the first-level classifier R1 , or that it will reach cluster Si incorrectly. In such a case we need to weight local accuracy and accuracy of the global classifier (since other local classifiers should be at least as good in their regions) by the probability of correctly assigning a point to the cluster if it belongs to it (or to its complement, if it doesn’t). The difference due to using the local classifier for cluster Si is then given by: q2 (i) = p(Si )α(R1 |W, Si )α(R2i |Si ) + (1 − p(Si ))α(R1 |W, W/Si )α(R0 |W/Si ) +p(Si )(1 − α(R1 |W, Si ))α(R0 |Si ) + (1 − p(Si ))(1 − α(R1 |W, W/Si ))α(R2i |W/Si ) −α(R0 |W ) Here the first term is the probability of the local classifier correctly labeling a point that was correctly assigned to the cluster, the second term is the probability of the global classifier correctly labeling a point that was correctly assigned to the complement of the cluster, while the third and fourth terms are probabilities of correct classification by the local and global classifier on points that were mislabeled at the first level (i.e. incorrectly assigned to the cluster or its complement). It follows from the above formula that as the accuracy of the first level classifier R1 approaches 1, i.e. α(R1 |W, Si ) → 1 and α(R1 |W, W/Si ) → 1, the third and forth terms disappear and the value of

33 q2 (i) approaches q1 (i). In this case the difference between the hierarchical scheme and the global scheme reduces to the difference in the accuracy of local and global classifier inside the cluster. Since “actual cluster labels” do not exist for a new point, it is impossible to determine whether such a point actually belongs to a particular cluster. Thus the first level classifier in some sense cannot make a mistake. However, given a number of examples from the training set, it is possible to build a classifier for the task and to speak of its accuracy, as estimated on the training set (for example by cross-validation). High accuracy of the first level classifier then indicates the presence of well-separated groups of points. Low accuracy indicates that there the clusters found by the clustering method are artificial and do not correspond to actual structure of the data. Algorithm 6 HGC Training and Classification Require: A set W , number of cluster k ≥ 2, qs {Training with HGC} 1: Train a global classifier R0 2: Partition W into k clusters 3: Train a global classifier to distinguish between the k clusters: R1 4: for i = 1, . . . , k do 5: train a local classifier to recognize only classes present in the cluster Si : R2i 6: Compute quality for R2i : qs (i) 7: end for Require: A point x, {Classifying with HGC} 1: Let i = R1 (x), i = 1, k. 2: if qs (i) < 0 then 3: Let j = R0 (x). 4: else 5: Let j = R2i (x). 6: end if 7: Return j. Training and classification using the HGC approach (with any tests for local classifier usage) are described in Algorithm 6. Briefly, the global classifier is trained using all the data (line 1) and the dataset is partitioned into clusters (line 2). Then a first-level classifier R1 is trained to distinguish between the clusters (line 3). Finally, a local classifier is trained inside each cluster, and its quality is estimated with cross-validation (lines 4-7). A new point is first assigned to a cluster (line 1), and then, depending on the quality of the local classifier in that cluster is labeled with a global classifier or a local one (line 2-6).

2.5

Summary

In the preceeding sections we proposed four ways of using deterministic clustering in building hierarchical classifiers.

34 CIC EBCA RUFL RTFL HGC KKM RBC

Clustering Inside Classes Error-Based Class Aggregation (grouping classes together on the basis of the number of errors between them) ReUsing the First Level Classifier (the classifier used to obtain the confusion matrix is re-used as the first level classifier in EBCA scheme) ReTraining the First Level Classifier (when classes are grouped together, a new classifier is trained to distinguish these metaclasses) Hierarchy with a Global Classifier (a hierarchical approach where the global classifier may be used in some regions instead of local classifiers) Kernel K-Means (K-Means clustering on a matrix of similarities) Representative Based Clustering (grouping data based on similarity to a point selected as a cluster representative) Table 2.1: Abbreviations of names of methods introduced in this chapter.

One approach, CIC, involves cluster analysis of the points belonging to the same class. EBCA on the other hand combines together whole classes. These two approaches can be combined in a simple fashion: first classes are partitioned into clusters, and the the clusters are aggregated into “metaclusters”, which can span across several classes. The approach of Section 2.4 obtains such class-spanning clusters directly, by performing cluster analysis on points of all classes together. Thus the approaches discussed in this chapter involve clustering on all four levels. In the next chapter we discuss the results of experimental evaluation of these approaches.

35

Chapter 3 Experimental Work on Benchmark Datasets

3.1

Introduction

This Chapter covers experiments conducted to assess the improvements in classifier accuracy that can be obtained with our methods when compared to a single global classifier. In a number of cases we also demonstrate that our approach is better than related methods proposed in the literature. These experiments were mainly conducted on a set of 4 benchmark problems from the UCI repository [9], with several additional experiments on synthetic data. To evaluate the proposed methods we constructed more than 3000 classifiers for these 4 problems, by varying the parameters of the methods, the ways of combining local classifiers and amount of data in the training set. The detailed tables of results are presented in Appendix E, while here we provide the summary and highlights of this set of experiments. Briefly, our experiments show the practical usefulness of clustering within or inside the classes (CIC), and of the “input partitioning” schemes. As will be shown below, the Error-Based Class Aggregation (EBCA) did not yield better results than those from a global classifier (and combining it with CIC did not improve on results obtained with CIC alone). This despite the fact that our suggestions have led to improvements over previously described EBCA approaches. We find that clustering inside classes and unsupervised clustering lead to better results than using global classifiers or combining clusters or classes together. In this Chapter we consider accuracy as a single-number measure of performance. Appendix E contains a selection of additional results demonstrating that the improvements achieved by CIC and HGC approaches are obtained by an overall improvement in the accuracy with respect to each class, or by strong improvement for several classes at the cost of slight losses to the other classes. This Chapter has the following structure. We first describe the datasets used in our experiments and preprocessing applied to them. We then discuss the software and settings that we used. After that we analyze the results of each suggested approach. We end with conclusions and a discussion of future

36 Class Training Set Test Set

1 1,072 461

2 479 224

3 961 397

4 415 211

5 470 237

6 1,038 470

Table 3.1: Class Sizes in the Satimage Dataset work.

3.2

Datasets

For the experiments we used four well-known datasets from UCI [9]. These particular datasets were chosen because they are all multi-class classification problems with real-valued variables, and represent a reasonable set for a first evaluation of our methods. 1. The Image Segmentation dataset involves 2,310 observations (210 train and 2,100 test), evenly divided between 7 classes. The points correspond to 3x3 pixel regions randomly drawn from 7 outdoor images (with the name of the image serving as a label), and are represented with 19 numeric features, computed based on color intensities and coordinates of the pixels in the regions. The images are brick-face, sky, foliage, cement, window, path and grass. 2. The Pendigit dataset consists of 10992 points represented by 16 features and a label (a digit from 0 to 9). This dataset was created by collecting 250 samples of digits written on a pressure-sensitive tablet from each of 44 writers. The features were obtained by spatial resampling of the points in the plane and are integers in the range [0,100]. The training set has 7,494 points (collected from 30 writers), while the test set has 3,498 points (from the other 14 writers). The class sizes are approximately equal both in the training and in the test sets. 3. The Satellite Image (Satimage) consists of 6 classes, 36 features and 6,435 (4,435 train and 2,000 test) observations. Points correspond to multi-spectral values of pixels in a 3x3 regions of a satellite image (4 spectral bands for each of the 9 pixels). The features are integers, in the range [0,255]. The labels are soil types. The class sizes vary, with class 4 being the smallest and class 1 – the largest. Table 3.1 gives the training and test set sizes for each class. 4. The fourth dataset was a Vowel dataset. There are 990 points, corresponding to 11 English vowel sounds, represented by 10 features and a label. The sounds are those from the words “heed”, “hid”, “head”, “had”, “hard”,“hud”, “hod”, “hoard”,“hood”, “who’d” and “heard”. The features are derived from analysis of sample windowed segments of the speech signal and are real-valued. The data was gathered from 15 speakers (8 males and 7 females), each repeating a particular sound 6 times. The training set consists of 528 points (data from 4 males and 4 females), and the test set has 462 points

37 Dataset Image Pendigit Satimage Vowel

Classes 7 10 6 11

Dimensions 19 16 36 10

Training Set Size 210 7494 4435 528

Test Set Size 2100 3498 2000 462

Table 3.2: Summary of the datasets (the other 4 males and 3 females). The classes are represented equally in the training and test sets. The properties of these datasets are summarized in Table 3.2. It is worth noting that the training/test splits used here are unusual from the standpoint of pattern recognition. For example in the Image dataset, the training set is 10 times smaller than the test set. In the Vowel dataset, the training set contains only slightly more data than the training set. Furthermore, the sizes of the training sets in the Image and Vowel datasets are small (30 and 48 points per class respectively). For initial experiments we decided to use the splits described above, as given in the UCI repository [9], for two reasons. Firstly, this would make our results comparable to those described in the past literature and would allow future comparisons. Secondly, from practical point of view, a user rarely has control over the size of the available training data, and it would not be surprising if a classifier has to handle hundred times more data than was available for training. A labeled training set with even 30-50 examples per class may rather expensive to obtain for a real problem. Further on we will evaluate selected methods (those that perform well in the initial experiments) using multiple training/test splits, while controlling for the size of the training set.

3.3

Feature Normalization

In machine learning, data is frequently preprocessed to address possible problems with features, including those of scaling and heterogeneity of distributions. The scales of individual features can differ drastically. Such disparities are often caused by use of specific units of measurement and are not an inherent characteristic of the data. However, this may present problems for many machine learning methods since features with different scales require very different weights. Another possibility is that, while the features may have the same approximate scale, the distribution of their values can differ considerably in their means, variances and possibly higher order moments. This may lead to similar problems to those of as having features on different scales. For clustering algorithms the issue of feature weighting or rescaling becomes even more important than

38 for supervised learning methods. There are many methods for addressing these issues (for example [81, 43]). The choice of a particular method may have a great effect on the performance of the machine learning methods. The best method is frequently problem-specific. However, since exploration of different normalization methods is not the subject of this work, we used the conventional statistical normalization where each feature is independently transformed to have zero mean and unity variance: xij =

xij − xj σj

(3.1)

where xj is the mean value of jth coordinate and σj is its standard deviation, estimated on the training set. The normalization is applied both to the training and the testing sets, using xj and σj computed on the training set. We conducted all the experiments on data that underwent feature normalization according to (3.1).

3.4

Methods

In Chapter 2 we proposed several schemes requiring partitioning the data (using a clustering algorithm) and building global and local classifiers. In striving for generality, we deliberately did not specify particular methods for doing these in some of the proposed schemes. Now however we need to make these choices in order to proceed to the experimental evaluation of our methods. The problem of partitioning a set of points into clusters or decomposing it into a hierarchical structure is covered in the clustering literature ([60] provides an overview). Clustering methods differ in the assumptions they make about the nature of the data, the partitioning and clustering constraints and in the similarity measures that they use. Different methods also may lead to rather different results. Thus the choice of an appropriate clustering method to be used on a dataset depends on the nature of the data, and the goals of the user. In discussing the Hierarchy with a Global Classifier (HGC) scheme we stressed that the partitioning/clustering method has to be deterministic, or “hard”. Similarly for the CIC and CIC+EBCA approaches, a deterministic partitioning method, where each point assigned to exactly one cluster, is a natural approach. For these reasons, and because of our desire for simplicity in implementation and interpretation, we decided to use a popular and well-studied deterministic clustering method, K-Means. (Various aspects of K-Means are discussed in the Appendix A). We developed specific implementation to better control its application. The choice of the number of clusters, a parameter that has to be specified for K-Means and other

39 Dataset Image Pendigit Satimage Vowel

C (SVM) 24 2 2−1 23

σ 2 (BMR) 36 25 2.25 20.25

Table 3.3: Hyperparameter values selected with cross validation on the training sets. clustering algorithms, is a well-known problem. While many methods have been proposed [97, 104], there is no standard approach since it is usually a problem-specific choice. Therefore, in our work we experiment with a number of different values for this parameter. We experimented with two classifiers, Support Vector Machines (SVM) and Bayesian Multinomial Regression (BMR). SVM is a popular and powerful method. The background of SVM is discussed in the Appendix C. The particular implementation that we used was LIBSVM v2.71 [14], with a linear kernel (parameter settings “-s 0 -t 0”). We modified LIBSVM to allow repeated stratified cross-validation using the code provided on the LIBSVM website1 . The value of the hyperparameter C to be used for each dataset was selected by 5-fold cross-validation on the training set. The selected values are shown in Table 3.3. The values considered were of the form 2p , with p = {−2, −1, 0, 1, . . . , 12}. (Such a set of values was previously used in [55]). BMR2 was recently developed at DIMACS. The binary classification version of BMR has shown results competitive with SVM on text classification problems [41]. Appendix B discusses logistic regression in more detail. In our experiments we used a Gaussian prior, which leads to ridge logistic regression (i.e. the penalization term is the L2 norm of the vector of parameters). BMR has an internal cross-validation procedure for selecting the hyperparameter (prior variance, σ 2 ) from a list of possible values based on maximizing log-likelihood of the training data. We specified 10-fold cross-validation to select a parameter from the following list of values: 0.1, 1, 2.25, 4, 6.25, 9, 12.25, 16, 20.25, 25, 30.25, 36, 42.25, 49, 56.25, 64, 100 We had to write a script for external cross-validation to obtain predictions on the training set.

3.4.1

Multiclass Classification

We would like to note that LIBSVM and BMR have different approaches to multiclass classification. BMR implements a natural extention of binary logistic regression to multiclass problems which can be 1 2

http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/ http://mms-01.rutgers.edu/˜ag/BMR/index.html

40

R0 Correct count R0 Incorrect count

R Correct count n11 n01

R Incorrect count n10 n00

Table 3.4: Agreement between two classification results, produced by algorithms R0 and R. thought of as one-vs-all approach with a constraint on the scores to sum to one (i.e. the score for each class is a probability). (See Appendix B for details). LIBSVM on the other hand implements one-vs-one approach [38], where classifiers are constructed for each pair of classes and vote for a particular class. A test case is assigned to the class that received the largest number of votes. Because LIBSVM and BMR use different approaches to multiclass classification, they may be differently affected by a change in the number of classes in a problem. However, as will be seen from the discussion that follows, the results of proposed schemes (CIC, EBCA, HGC, etc.) are qualitatively similar in almost all of the experiments regardless of whether LIBSVM or BMR is used as a global classifier. (There is only one minor difference, with EBCA approach, that will be discussed below). For this reason we will not analyze the effect of the differences between these approaches on performance of our methods.

3.5

Statistical Significance Testing

As mentioned above, in these experiments we focus on accuracy - the percentage of cases in the test set that are assigned to the correct class - as the main measure of performance. In particular, we would like to know if the proposed methods give better accuracy that using a standard global classifier. One question that arises when comparing accuracy (or other measures of performance) of different methods is whether the observed differences are important or whether they can be due to chance alone. Statistical significance testing is one way of answering such questions. We formulate the null hypothesis H0 , which is that, given a training and a test set, the performance of a particular algorithm (with parameter settings fixed) R is not different from the performance of the global classifier R0 . The H1 hypothesis is that the two performances are different. The literature [25] recommends McNemar Test as having low probability of Type I error (i.e. rejecting the null hypothesis when it is correct) when doing multiple resampling or cross-validation experiments would be too expensive. The McNemar test looks at the cases where only one of R0 and R makes a mistake - entries n10 and n01 in Table 3.4. The following statistic then is approximately distributed as χ2 with 1 degree of

41

Figure 3.1: Accuracy (y-axis) of CIC with SVM (left) and BMR (right), as a function of number of clusters per class, k (x-axis), for all four datasets.

freedom: s=

(|n10 − n01 | − 1)2 , n10 + n01

(3.2)

where n10 and n01 are numbers of points where the classifiers disagree. If the null hypothesis is correct, then P (s > χ21,α ) < 1 − α. Therefore, if s > χ21,α , we will then say that the difference is significant at the α level. We considered three levels of significance: results that are different at 0.95, 0.99 and 0.999 confidence levels. Appendix E contains the results of comparisons for the CIC, EBCA and HGC approaches. In this Chapter we refer to some results described there in the course of discussion.

3.6

Empirical Evaluation of the Clustering Inside Classes

In Clustering Inside Classes (CIC) each class is partitioned into a number of clusters, k, and the clusters are treated as separate classes when training a classifier. During classification, each classifier produces a cluster label for each point which is then converted to corresponding class label. The graphs in Table 3.1 show the accuracy on each dataset as a function of the number of clusters per class, k = 1, . . . , 4. It is clear from the graphs that on all datasets except for Image this method leads to improvement in classification accuracy. With the SVM, the results on the Image dataset with k = 3, 4 are significantly worse at 0.999 level than those of the global classifier. On the other hand, the accuracy of CIC is significantly better at 0.999 level for all values of k on the Pendigit dataset and for k = 4 on the Satimage dataset; and is somewhat better (0.95 level) for k = 3, 4 on the Vowel dataset. The results with BMR are qualitatively similar (significant improvements on the Pendigit and Satimage

42

Class Diagonal Rand

1 0.875 0.777

2 1 1

3 1 1

4 0.92 0.844

5 0.813 0.689

6 0.813 0.689

7 0.75 0.617

8 0.75 0.617

9 0.875 0.777

10 0.979 0.958

11 0.813 0.689

Table 3.5: Diagonal and Rand [96, 57] measures of matching between the class partition (k = 2) of the training data and partition by the sex of the speakers.

Class Diagonal Rand

1 0.917 0.972

2 0.833 0.961

3 0.833 0.961

4 1 1

5 1 1

6 0.979 0.990

7 0.875 0.968

8 0.917 0.972

9 0.958 0.981

10 1 1

11 0.958 0.982

Table 3.6: Diagonal and Rand [96, 57] measures of matching between the class partition (k = 8) of the training data and partition by the speakers.

datasets, no significant difference on the Vowel dataset, and worse results on the Image dataset). Notice that the value of k for which the best results are obtained appears related to the number of points per class in the training set. On the Image dataset this number is on average 30, and the best results are for k = 1 (global classifier). For the other datasets, which are larger, the best results are obtained with k = 4. Intuitively, increasing the number of clusters leads to smaller clusters; and when the clusters are small they cannot be representative of the data distribution, making it difficult to train good classifiers for identifying such clusters. However, as we will show later, these differences in performance are not completely due to the training set size, but are also affected by the intrinsic structure of the classes. In other words, Image dataset has classes that consist of a single component, while the other datasets have classes with a more complicated structure. We also would like to comment on possibility of learning about the data from such an analysis. The Vowel dataset is a particularly appropriate for demonstrating this since we know that it has additional structure besides the class labels. Specifically, each class can be partitioned into 2 groups (based on the sex of the speaker) or into 8 groups (corresponding to the individual speakers. Indeed, looking at Table 3.7, we can see when clustering into two clusters the points corresponding to a single person tend to stay together (Persons 0 and 1 are the main exceptions - their points are split in several of the classes). The tendency of the points corresponding to one person to stay together is retained when increasing the number of clusters, k. To measure this quantitatively, we computed an optimal matching between the clustering solution and the known partition, and check the fraction of the points that lie on diagonal, and also the Rand

43

Person Sex Class Cluster heed 1 0 hid 1 0 head 1 0 had 1 0 hard 1 0 hud 1 0 hod 1 0 hoard 1 0 hood 1 0 who’d 1 0 heard 1 0

0 1 M M Number 6 6 6 6 6 6 6 6 5 4 1 2 3 6 3 6 6 6 6 6 6 6 5 1 6 3 3

2 3 M M of Points 6 6 6 6 6 6 6 2 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 -

4 F in 6 6 6 6 6 6 6 6 6 6 6

5 F the 6 6 6 6 6 6 6 6 6 6 6 -

6 7 F F Cluster 6 6 6 6 6 6 6 6 6 - 6 6 6 6 6 6 6 6 6 6 6 6 6

Table 3.7: Cluster Structure of the Vowel dataset, produced with CIC k = 2.

44

Cluster 1

00000000 11111111 11111111 00000000 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111

Cluster 2

Cluster 1

Cluster 3

00000000 11111111 11111111 00000000 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111

Cluster 2

Cluster 3

Figure 3.2: There are two classes (solid white and shaded). In the figure on the left, partitioning white class into three clusters (with separation denoted by lines) leads to an easily solvable multiclass problem. However, moving all points closer to each other would result in partition of the white class with a non-convex cluster 3 (figure on the right). This makes it impossible to linearly separate cluster 3 from the other class without mistakes.

criterion [96, 57]. The Rand criterion is the fraction of object pairs that are correctly assigned to the same or different clusters. A completely random partition of two classes into two clusters compared to the true partition will have a Rand criterion of 0.5. With k = 2 the clusters match the partition based on the sex of the speaker (Table 3.5) in most of the classes. With k = 8 the partitions inside the classes closely (with the Rand index always above 0.9) match partitions by individual speakers (Table 3.6). The accuracy of the CIC scheme with k = 8 is however lower (51.52) than with k = 4 (56.71) or even k = 2, 3, probably due to the small amount of data in each class and/or overlap between the clusters and classes. This example suggests that analysis of the partitions obtained by CIC can correspond to the intrinsic underlying structure of the data.

3.7

Clustering Inside Classes on Synthetic Data

We conducted a small experiment on synthetic data to study relationships between k, the structure of the data and classification accuracy. The idea for the experiment is described in Figure 3.2. We construct a dataset with two classes, where one class has three distinct components of non-spherical shape. By moving these clusters closer together we make the problem more difficult because K-Means will construct clusters that overlap the components. This setup also allows us to examine the role of the parameter k in greater detail. The data was generated as follows. Class 1 consists of three normal distributions, with means

45

d 0.5 1.0 1.5 2.0 2.5 3.0

s=1 80.57 44.13 46.57 63.33 66.43 66.67 Global

d 0.5 1.0 1.5 2.0 2.5 3.0

s=1 79.83 87.13 95.40 96.07 96.23 96.50 k

s=2 s=3 72.73 70.47 50.90 53.40 43.83 50.70 55.87 47.87 62.80 58.23 65.90 62.57 (k = 1)

s=2 82.83 86.17 92.53 94.90 95.63 96.27 =3

s=3 87.17 86.07 88.63 92.60 94.97 95.70

d 0.5 1.0 1.5 2.0 2.5 3.0

s=1 82.67 85.27 87.43 89.17 88.73 88.50 k

s=2 83.60 84.30 85.93 88.10 88.87 89.37 =2

s=3 84.37 83.17 85.33 86.23 88.77 89.43

d 0.5 1.0 1.5 2.0 2.5 3.0

s=1 76.07 87.33 94.43 95.27 95.40 95.20 k

s=2 82.67 85.07 90.57 94.37 95.73 96.23 =4

s=3 85.40 87.63 89.60 93.23 94.67 96.47

Table 3.8: Average Results of CIC on synthetic data for different values of k. µ11 = (−d, d), µ12 = (d, d) and µ13 = (0, 0), The covariance matrices for these distributions were:   1 + s2 1 − s2  Σ11 =  2 2 1−s 1+s Σ12

 1+ −1  = 2 2 s −1 1+s 

s2

s2

and Σ13 = I. Here s is a parameter controlling the ”stretchingness” of the ellipsoids in Figure 3.2, while d controls the positions of the means and thus the distance between the components. Another way to think about components 1 and 2 is that they come from a standard normal distribution that has been stretched along the y-axis by factor s and rotated by 45 or −45 degrees, and then moved based on the value of d. Class 2 has µ2 = (0, 1) and Σ2 = I. The training set had 100 points for each component of class 1 and 300 points for class 2, i.e. the classes were balanced. The test set had the same distribution of points. The experiments were repeated 10 times for each combination of s and d, with the data randomly generated every time. For simplicity, in these experiments we used only SVM with c = 1, and only class 1 was partitioned into clusters. Preliminary experiments showed that partitioning class 2 into the same number of clusters leads to similar results. The results of the global classifier and of CIC with k = 3 clusters for class 1 are in Table 3.8. Initially, for d = 0.5 there is a large overlap between classes. In such situation having larger s makes

y 0.0

0.0

0.5

0.5

y

1.0

1.0

1.5

2.0

1.5

46

−2

−1

0

1

2

−1.5

−1.0

−0.5

x

0.0

0.5

1.0

1.5

x

Figure 3.3: Positions of the cluster centers for 10 randomly generated datasets with d = 2, s = 2 (left) and d = 1, s = 3 (right). In the left plot the cluster centers are positioned around the real centers (-2,2) and (2,2). In the right plot, the cluster centers are further away from the origin that the real centers (-1.42,1.42) and (1.42,1.42) because of the influence of the center at the origin.

the problem easier for CIC by moving some points farther from the second class. For a global classifier the situation is reversed, since larger s moves more points of class 1 above the boundary between the classes (somewhere between class 2 and cluster 3). However, as d increases, the overall quality of CIC results also increases, and having larger s makes the resulting clusters non-convex and more difficult to separate from class 2. The performance of the global classifier on the other hand decreases as the classes become less linearly separable. In all cases CIC performs better than the global classifier, as one would expect. It is also intuitively clear that CIC does so well because it is capable of finding (at least approximately) the true structure of class 1. Figure 3.3 shows the distribution of the centers across 10 experiments for d = 2, s = 2 and d = 1, s = 3. It is interesting to compare the results obtained with CIC, k = 3, to those obtained with CIC using k = 2, also given in Table 3.8. For d = 0.5 and s = 1, 2, the results with k = 2 are better, since the 3 components of class 1 are difficult to separate, and using two clusters better describes the data. As d increases, the three components start to separate. The results improve somewhat but much slower than with k = 3, though they are still better than the global results. The results with k = 4 are better than with k = 2 and global but are comparable to those with k = 3. This suggests that having somewhat greater number of clusters than the number of intrinsic components does not hurt the performance (provided that there is sufficient amount of data). In fact, when s is large, k = 4 results are best on average though the difference is not significant. Having fewer clusters than intrinsic components however may lead to poor results.

47

7

7 line 1 line 2 line 3 line 4

6

7 line 1 line 2 line 3 line 4

6

5

5

5

4

4

4

3

3

3

2

2

2

1

1

1

0

0

0

-1

-1

-1

-2

-2 -1

0

1

2

3

4

5

6

line 1 line 2 line 3 line 4

6

-2 -1

0

1

2

3

4

5

6

-1

0

1

2

3

4

5

6

7

Figure 3.4: Plots of the simulated data for d = 1, 1.5, 2.

3.8

Empirical Evaluation of the Error-Based Class Aggregation

In Error-Based Class Aggregation (EBCA), a first-level classifier is used to produce a confusion matrix between classes. This matrix is converted into symmetric form and analyzed using one of the two algorithms described in Chapter 2 to produce a partition of the set of classes into metaclasses. Then second-level classifiers are trained to distinguish between the classes falling into the same metaclass. At classification time, the first level classifier is used to assign a point to a metaclass (based on the label assigned to it), and then a classifier trained on this metaclass produces the final label. Our approach differs from that of [45] which trained a new first-level classifier to distinguish between metaclasses rather than reusing the one that was already built. Here we compare approach of [45] with ours with both KKM (Kernel K-Means clustering) and RBC (Representative-Based Clustering) approaches, varying the number of metaclasses m = 1, 2, 3, 5. In order to obtain the confusion matrix A we used 5-fold cross-validation on the training set.

3.8.1

Experiment on Synthetic Data

Before proceeding to analysis of results on the benchmark data, we discuss a simple experiment on the synthetic data illustrating the advantage of the approach we propose. For this experiment data was generated from 4 normal distributions, each corresponding to a class. The means are: µA = (0, 2), µB = (2, 0), µC = (d, 2 + d) and µC = (2 + d, d). The covariance matrices are: 

ΣA = ΣC = 

1

0

0

s2



ΣB = ΣD = 

 

s2 0 0

1

 

48

d 0.5 1.0 1.5 2.0 2.5 3.0

Global 79.95 93.23 97.32 98.00 97.78 98.08

RUFL 79.92 93.25 97.32 98.00 97.78 98.08

RTFL 79.83 90.85 91.70 96.20 97.28 97.97

Table 3.9: Average Results of EBCA, with KKM and m = 2. Figure 3.4 shows plots of data for d = 1,1.5 and 2. The experiments, with number of metaclasses m = 2, were repeated 10 times, using SVM. The average results of using global classifier, and EBCA with RUFL and RTFL approaches for different value of d are given in Table 3.9. The results are the same with KKM and RBC. For d = 0.5, 1 there are errors between all classes. The metaclasses are {A, C} and {B, D}. For d ≥ 1.5, the metaclasses in almost all cases are {A, B} and {C, D}. When d = 0.5, all three approaches perform comparably. However, for d = 1, 1.5, 2 global and RUFL perform better than RTFL. The reason is that RTFL attempts to train a new classifier to separate the metaclasses, but they cannot be well separated for these values of d. When d = 2.5, 3, the classes C,D move sufficiently far from A and B and RTFL becomes comparable to the global classifier and the RUFL approach. This experiment also suggests that RUFL approach is often going to give similar results to the global classifier, since that is how it will make first-level predictions. The difference in predictions will come from the second-level classifier. This is not the case for RTFL, which could make very different predictions from the global classifier and tended to perform worse.

3.8.2

Experiment on the Benchmark Data

The graphs of accuracy for each benchmark problem using SVM (left) and BMR (right) are presented in Table 3.5. Several things can be observed from these graphs. One is that reusing the first level classifier (RUFL) usually leads to better performance than retraining it (RTFL). The results on the Image dataset with m = 5 turn out to be an exception. Also, on the Vowel dataset, which is somewhat larger than the Image dataset but smaller than the other two, RTFL does better than RUFL with SVM for m = 2, 3 and with BMR for m = 5. This is likely because the small size of the training data makes the original confusion matrix a poor estimator of the actual confusion between the classes. On the Pendigit and

49 Satimage datasets RUFL outperforms RTFL in almost all cases, except for Satimage with m = 5. These observations once again highlight the importance of using a value of k that would results in sufficiently large clusters. Altogether, out of 24 experiments with SVM (4 datasets, 2 partition methods and m = 2, 3, 5), RUFL does better than RTFL in 14 cases, while performing worse in 6. When using BMR, RUFL is better in 13 cases and worse in 7. Note also that RTFL results can be much worse than RUFL results, for example for Pendigit with SVM, while the opposite does not happen. In fact, in all the EBCA experiments, the RUFL results are never significantly different from those of a global classifier (both with SVM and with BMR), while the RTFL results are significantly worse in a number of cases on Pendigit and Satimage datasets For these reason we will not consider RTFL further. Another thing to note is that the results obtained by our method are mostly comparable to the global classifier with SVM. However, using EBCA with BMR leads to small improvements (not statistically significant) over the global classifier on Pendigit and Vowel datasets, and gives comparable results on Image and Satimage datasets. This is the only time we encountered qualitative differences in the behavior of SVM and BMR. We believe the reason for the varying effect of EBCA is the difference in the approaches to multi-class classification implemented by LIBSVM and BMR. As mentioned previously, LIBSVM implements onevs-one approach, while BMR essentially uses one-vs-all approach since it builds a model for each class. The former is not very sensitive to the presence of additional classes (since each individual classifier is only concerned with a particular pair of classes). The later however seems to benefit from reducing the number of classes since this leads to smaller problems with a smaller possibility of mistake for each classifier. This reasoning is supported by the results we reported in [37], where using EBCA with a different kind of one-vs-all multiclass SVM (Crammer and Singer method [18]) also resulted in some improvement over the global classifier. As suggested before, the metaclasses determined by EBCA method may reveal the similarity between certain classes. Consider for example the Image dataset. Based on the class names only we would expect images of foliage and of grass to be similar to each other. Also, we would expect cement to be distinguishable from the other classes. The results of EBCA conform to our expectations. Table 3.10 shows how the classes were assigned to metaclasses using KKM and RBC methods with m = 5. Table 3.11 shows the same for the Vowel dataset.

50

Figure 3.5: Accuracy (y-axis) of EBCA with SVM and BMR, as a function of the number metaclasses, m (x-axis).

51

Name

Class

brick-face sky foliage cement window path grass

1 2 3 4 5 6 7

Metaclass Labels SVM BMR KKM RBC KKM RBC 1 1 1 3 3 1 5 5 3 3 5 5 4 4 3 4 5 5 2 2 2 2 4 1 3 3 5 5

Table 3.10: Images Dataset: Metaclasses produced by EBCA, m = 5. Classes “cement” and ”path” are always alone in the metaclass, while “foliage” and “grass” are always in the same metaclass (with class “sky” in the same metaclass in 3 cases).

Word/Sound

Class

heed hid head had hard hud hod hoard hood who’d heard

1 2 3 4 5 6 7 8 9 10 11

Metaclass Labels SVM BMR KKM RBC KKM RBC 1 1 3 2 3 1 3 2 3 1 3 2 2 4 5 1 2 5 1 4 2 4 1 1 4 5 1 4 5 2 4 3 3 2 4 3 3 3 4 3 2 4 2 5

Table 3.11: Vowel Dataset: Metaclasses produced by EBCA, m = 5. Note how some classes group together despite different classifier and partition method (one group is classes 8-10, another is 1-3).

52

Figure 3.6: Accuracy (y-axis) of CIC+EBCA with SVM and BMR, as a function of the number metaclasses, m (x-axis). Note that setting m = 1 is equivalent to the CIC scheme. Therefore the comparison of interest is with the m = 1 results.

3.9

Empirical Evaluation of the CIC+EBCA

The combination of EBCA and CIC was described in Section 2.3 of Chapter 2. We conducted experiments with k = 2, 3, 4 and m = 2, 3, 5, 10. (Note that setting k = 1 or m = 1 would result in the same classification as either EBCA or CIC scheme alone). The tables with all results are given in the appendix. Here we describe only few of these. As noted before, CIC provides an improvement over the global classifier. Therefore the question of interest here is whether CIC+EBCA combination performs better than CIC alone. Based on our experiments we have to conclude that the answer is “no”. Table 3.6 shows results on Pendigit and Vowel datasets for k = 4. Results on other datasets and with k = 2, 3 are similar: the combination performs worse than CIC (m = 1) alone. Another observation that can be made on the basis of these plots is that RBC tends to give better results than KKM for m ≥ 3 (i.e. when there are more classes).

53

3.10

Empirical Evaluation of the HGC Method

In the Hierarchy with Global Classifier (HGC) approach we first train a global classifier on all the data. We then partition the data into pre-specified number of clusters, k, and train a classifier on each cluster. It remains to specify in where a local classifier is used and where a global one is used, and also how to assign a new point to a cluster. In the deterministic partitioning method of [34] a new point was assigned to the cluster with the closest center. We suggest training a classifier to distinguish between the clusters. There are also many possible ways to determine whether a local or a global classifier should be used. In Section 2.4.2 we described two ways, q1 and q2 , of comparing global and local classifiers based on their estimated accuracies. A simple alternative is to always use a local classifier if there are sufficiently many points in the cluster. In our experiments we tried this approach, q0 , picking a local classifier if the cluster has 4 or more points: q0 (i) = |Si | − 4 While a cluster of four points appears to be small, should all of the points belong to the same class, the cluster membership would become a good indicator of the class membership. For larger datasets, all clusters have more that four points, and the above criterion effectively leads to using local classifiers everywhere. Thus, we experiment with three criteria for selecting a local classifier in a region Si : • f1 = 1 iff q0 (i) ≥ 0 • f2 = 1 iff q0 (i) ≥ 0 and q1 (i) > 0 • f3 = 1 iff q0 (i) ≥ 0 and q2 (i) > 0 where fi = 1 indicates that a local classifier rather than a global one will be used in the ith cluster; q1 is the difference in performance of the local and global classifiers in the clusters, while q2 takes into account mistakes at the first level (i.e. not assigning a point to the cluster when it should be there, and vice-versa). We first compare our approach with that of [34] for assigning a point to the cluster. The results of using these two methods, with all local classifiers (SVM and BMR) selected according to f1 , are given in Table 3.7. In most cases using a special first level classifier is somewhat better than assigning points to clusters based on the distances to cluster centers, though the differences are small. Both methods however tend

54 to do better than the global classifier, except on the Image dataset. The results are significantly worse (at least at 0.95 level) on Image dataset with k = 5, 10 for both SVM and BMR. They are significantly better (at least at 0.99 level) for all k on Pendigit (both SVM and BMR), and Satimage with BMR. The differences are not significant on the Vowel dataset with either SVM or BMR for any k. Overall, a special first level classifier is at least competitive with the nearest-center assignment, and is slightly better in a number of cases. In the following discussion therefore we shall focus on the former approach. We now turn to examining the effect of using global classifier in the regions where we expect the local classifier to be inferior. The results are plotted in Table 3.8. Almost all results on Pendigit dataset, and many on the Satimage dataset are significantly better (at 0.99 level) than those of the global classifier. On the Image dataset, the results become significantly worse as k increases. The number of clusters has a strong effect on the results. For each dataset, a particular value of k gives better performance with almost all the methods: for Image it is k = 5, for Pendigit it is k = 10, for Satimage it is k = 20, for Vowel it is k = 5 with SVM and k = 5, 10 with BMR. Our experiments show that using the global classifier in certain regions can lead to a much better performance than using local models everywhere. For example on Satimage and Vowel datasets, with SVM and k = 10 methods f2 and f3 perform better than f1 . In other cases however, such as Satimage with k = 20 (both SVM and BMR) f1 performs better. Therefore we do not have a clear answer to the question of which is the best approach. This will have to be settled by additional experimentation. The intermediate results of HGC (partitions into clusters) can be used look for structure in the data, just as with the other methods we discussed. The information on what classes have points falling into the same clusters can be used in a way similar to EBCA to obtain metaclasses. Points of the same class that fall into different clusters can be interpreted as different sub-classes, as in CIC approach. So unsupervised clustering can provide us with similar type of information to that provided by CIC and EBCA approaches. Additionally, we can use the first level and local classifiers to determine what features distinguish clusters, and classes in the local models. In Chapter 6 we will illustrate such an approach by using a simplified HGC scheme to analyze lung cancer survival data.

55

Figure 3.7: Accuracy (y-axis) of HGC with SVM (left) and BMR (right), as a function of the number of clusters. Comparison between the global classifier, and using HGC with f1 and f1 − center rule, where “-center” indicated that a point is assigned to a cluster based on the distances to center, as in [34].

56

Figure 3.8: Accuracy (y-axis) of HGC with SVM (left) and BMR (right), as a function of the number of clusters. Comparing global classifier with 3 approaches for choosing between local and global models.

57

Figure 3.9: Class-wise accuracy (y-axis) of several schemes with SVM (left) and BMR (right).

3.11

A Brief Look at Class-wise Performance

Above we have discussed and compared all the results exclusively in terms of the accuracy on the test data. The accuracy alone however does not always provide the full picture. It gives no indication as to whether all points of one class are assigned to another class, or whether the errors are evenly spread out across the classes. In the cases with strong disbalance in the class sizes, the high accuracy scores may be particularly misleading. Tables E.17 —E.20 (Appendix E) show the accuracy with respect to each class for a subset of methods discussed. The Figures in Table 3.9 show these numbers for the Pendigit and Vowel datasets graphically. These results demonstrate that the when improvement is achieved it is either by improving the accuracy over most of the classes as on the Pendigit or Satimage datasets, or by strongly improving some classes at the cost of slight deterioration of accuracy on the other classes as on the Vowel dataset. (There is little if any improvement over the global classifier on the Image Dataset).

58

k f 0.7 0.5 0.3 0.1 f 0.7 0.5 0.3 0.1

1

2

95.56 0.53 95.28 0.69 94.86 0.71 91.98 1.05

95.53 0.64 95.31 0.57 94.56 0.62 90.73 1.13

94.49 0.78 94.40 0.76 93.90 0.61 91.59 0.91

94.59 1.03 94.33 0.83 93.67 0.42 90.30 1.61

3 SVM 95.54 0.59 95.32 0.63 94.72 0.54 90.55 1.19 BMR 94.54 0.98 93.72 1.13 93.36 0.69 89.78 1.19

4

5

95.40 0.65 95.13 0.60 94.27 0.77 89.68 0.84

95.89 0.56 95.33 0.91 94.08 0.66 89.85 0.86

94.24 1.14 93.48 1.10 92.82 0.77 89.08 1.23

94.44 0.71 93.91 0.78 93.04 0.80 89.20 1.17

Table 3.12: Image Dataset, CIC Results: sample mean and standard deviation of accuracy (from 10 random splits, using fraction f of data for training). k f 0.7 0.5 0.3 0.1 0.03 f 0.7 0.5 0.3 0.1 0.03

1 98.22 0.20 97.97 0.10 97.68 0.17 96.65 0.26 94.04 0.58 95.41 0.25 95.16 0.14 94.93 0.20 93.60 0.38 90.78 0.97

2

3 SVM 99.19 99.36 0.08 0.10 99.07 99.35 0.08 0.07 98.95 99.14 0.12 0.11 98.14 98.22 0.20 0.28 95.85 95.73 0.68 0.65 BMR 98.18 98.30 0.27 0.22 98.07 98.30 0.24 0.20 97.66 97.77 0.24 0.22 96.70 96.77 0.28 0.39 94.18 94.23 0.77 0.73 Pendigits

4

5

99.49 0.09 99.39 0.09 99.19 0.11 98.27 0.22 95.72 0.57

99.45 0.06 99.36 0.09 99.12 0.11 98.29 0.24 95.67 0.55

98.20 0.22 98.17 0.28 97.73 0.27 96.92 0.36 94.14 0.64

98.19 0.27 98.06 0.25 97.63 0.36 96.94 0.35 94.01 0.53

k f 0.7 0.5 0.3 0.1 0.03 f 0.7 0.5 0.3 0.1 0.03

1 86.74 0.62 86.78 0.40 86.32 0.37 85.17 0.31 83.18 0.53 85.48 0.67 85.41 0.43 85.20 0.34 83.89 0.61 81.53 1.06

2

3 SVM 86.61 87.68 0.69 0.61 86.52 87.63 0.50 0.45 86.22 87.14 0.58 0.53 84.60 85.46 0.53 0.54 82.85 83.82 0.78 0.77 BMR 85.77 86.61 0.88 0.65 85.75 86.45 0.43 0.42 85.24 85.86 0.35 0.60 83.53 84.09 0.57 0.73 81.79 81.89 1.05 1.42 Satimage

4

5

88.61 0.95 88.75 0.31 88.08 0.43 86.44 0.58 83.70 0.68

88.86 0.73 88.87 0.40 88.52 0.39 86.59 0.38 83.72 1.19

87.50 0.79 87.48 0.21 86.72 0.45 84.61 0.99 81.97 1.38

87.66 0.70 87.63 0.31 87.23 0.49 84.72 1.13 81.64 1.63

Table 3.13: Pendigit and Satimage Datasets, CIC Results: sample mean and standard deviation of accuracy (from 10 random splits, using fraction f of data for training)

59

3.12

Evaluation of Factors Affecting CIC and HGC performance

The two approaches that performed best in the initial experiments were CIC and HGC. In this section we describe experiments that aimed to show that the CIC and HGC results described above were not artifacts of the partitions used, and to examine the role of the training set size (both in absolute terms and as a proportion of the dataset). Of different HGC methods we focused on the simpler one, f1 , that uses local models in all clusters with 4 or more points. In initial experiment this approach frequently performed as well as more complicated methods. For each dataset we repeated the following steps 10 times, for values of f = 0.03, 0.1, 0.3, 0.5, 0.7: 1. select a fraction f of available data as a training set (maintaining class proportions) 2. train global classifier, CIC with k = 2, 3, 4, 5 and HGC with k = 5, 10, 20 and with f1 local model selection (i.e. local models used in all clusters with 4 or more points) on this subset 3. evaluate the resulting classifiers on the remaining points We did not use f = 0.03 on the Image dataset because of its small size. We also did not conduct these experiments on the Vowel dataset, because it is small (it has only 90 points for each class making experiments with small f and large f uninterpretable) and because the data has internal structure that would be difficult to maintain (there are 15 distinct sources of the data, i.e. speakers, and each should be kept completely either in the training or the test sets). Note that the Image dataset (with 2310 point and 7 classes) has 330 points per class, Pendigit dataset has approximately 1099 points per class, and Satimage dataset (6435 points and 6 classes) has on average 1073 points per class, with the smallest class of 626 points. This means that the number of points per class is approximately the same for Pendigit and Satimage datasets for the same values of f . For the Image dataset however, with f = 0.3 number of points per class is approximately the same as for Pendigit and Satimage with f = 0.1; and f = 0.1 is comparable to Pendigit and Satimage with f = 0.03.

3.12.1

CIC results

The CIC results on the datasets, both with SVM and BMR, are given in Tables 3.12-3.13. CIC improves accuracy on Satimage and Pendigit datasets when f = 0.3, 0.5, 0.7 of the data is used for training. The improvement is small on Satimage for f = 0.03, 0.1, though it is still large on the Pendigit dataset. The

60

k f 0.7 0.5 0.3 0.1 f 0.7 0.5 0.3 0.1

Global 1

5

With R1 10

95.56 0.53 95.28 0.69 94.86 0.71 91.98 1.05

95.94 0.80 95.58 0.55 94.83 0.53 90.93 1.41

95.51 0.36 94.72 0.88 93.66 0.39 89.27 1.14

94.49 0.78 94.40 0.76 93.90 0.61 91.59 0.91

95.39 0.54 95.28 0.76 94.43 0.72 90.94 1.16

95.36 0.49 94.42 0.78 93.26 0.63 89.40 1.15

20 SVM 95.76 0.47 94.98 0.66 93.88 0.61 89.07 1.04 BMR 96.06 0.51 95.08 0.68 93.70 0.58 88.66 1.42

Distance to Centers 5 10 20 95.91 0.68 95.61 0.53 94.73 0.60 90.77 1.37

95.59 0.40 94.64 0.79 93.47 0.35 89.17 1.17

95.47 0.53 94.78 0.55 93.64 0.87 88.84 1.04

95.29 0.53 95.20 0.68 94.24 0.77 90.49 1.25

95.30 0.46 94.45 0.53 93.14 0.67 89.22 1.00

95.81 0.57 94.67 0.85 93.62 0.79 88.72 1.30

Table 3.14: Image Dataset, HGC Results (local models used in all clusters with more than 4 points: sample mean and standard deviation of accuracy (from 10 random splits, using fraction f of data for training). best results for both datasets are obtained with k = 5. The results on Satimage dataset take a slight dip at k = 2 before improving, just as in the previous experiments. For the Image dataset, there is little effect with f = 0.3, 0.5, 0.7, and the performance clearly deteriorates when f = 0.1. The best results, both with SVM and BMR, are obtained with k = 2, 3 for f = 0.7, 0.5, and with the global classifier, k = 1, for smaller f . Another, not very surprising, observation is that with smaller f the results are less stable, as indicated by increases of the standard deviation of the accuracy. This effect becomes more pronounced on the smaller datasets. There results confirm observations made previously. CIC does better than the global classifier on two datasets and does comparably on another one (Image), as long as the training set is large. Since Image dataset with f = 0.1, 0.3 is comparable (in terms of points per class) to Pendigit and Satimage with f = 0.03, 0.1, but CIC performs worse that the global classifier on the Image and not on the other datasets for these values of f , the training set size is not issue. It would appear that for the Image dataset the intrinsic number of components in each class is on average close to 1 (and so CIC gives no improvement), while for the other two datasets it is larger.

61

k f 0.7 0.5 0.3 0.1 0.03 f 0.7 0.5 0.3 0.1 0.03

Global 1

5

With R1 10

98.22 0.20 97.97 0.10 97.68 0.17 96.65 0.26 94.04 0.58

99.12 0.13 98.99 0.08 98.76 0.07 97.63 0.35 95.08 0.54

99.16 0.11 99.02 0.11 98.74 0.13 97.61 0.26 94.85 0.71

95.41 0.25 95.16 0.14 94.93 0.20 93.51 0.37 90.78 0.97

98.60 0.13 98.37 0.15 98.05 0.15 96.84 0.21 94.43 0.57

98.87 0.19 98.75 0.13 98.32 0.20 97.07 0.21 94.07 0.93

20 SVM 99.25 0.08 99.07 0.10 98.87 0.12 97.82 0.12 94.90 0.69 BMR 98.91 0.12 98.81 0.10 98.39 0.21 97.09 0.22 94.37 0.59

Distance to Centers 5 10 20 99.12 0.14 98.96 0.07 98.69 0.10 97.52 0.39 94.59 0.65

99.13 0.16 98.95 0.10 98.56 0.14 97.20 0.22 94.06 1.10

99.13 0.17 98.87 0.12 98.50 0.14 97.03 0.26 93.66 0.50

98.62 0.14 98.41 0.15 98.04 0.16 96.75 0.27 93.99 0.65

98.89 0.19 98.74 0.15 98.28 0.21 96.86 0.23 93.76 1.03

98.97 0.12 98.80 0.13 98.39 0.18 96.84 0.42 93.90 0.61

Table 3.15: Pendigit Dataset, HGC Results (local models used in all clusters with more than 4 points: sample mean and standard deviation of accuracy (from 10 random splits, using fraction f of data for training).

62

k f 0.7 0.5 0.3 0.1 0.03 f 0.7 0.5 0.3 0.1 0.03

Global 1

5

With R1 10

86.74 0.62 86.78 0.40 86.32 0.37 85.17 0.31 83.18 0.53

86.87 0.71 87.13 0.38 86.67 0.52 85.20 0.57 83.45 0.81

87.67 0.88 87.78 0.57 87.20 0.57 85.73 0.69 83.77 0.48

85.48 0.67 85.41 0.43 85.20 0.34 83.89 0.61 81.53 1.06

86.96 0.74 87.02 0.55 86.39 0.43 84.85 0.54 82.34 1.57

87.29 0.89 87.49 0.59 86.92 0.54 85.22 0.79 83.20 1.28

20 SVM 88.53 0.80 88.66 0.54 87.78 0.50 85.98 0.64 83.70 0.84 BMR 88.13 0.66 88.36 0.47 87.53 0.52 85.48 0.97 82.14 1.89

Distance to Centers 5 10 20 86.93 0.69 87.18 0.41 86.63 0.49 85.07 0.63 83.49 0.71

87.64 0.82 87.74 0.53 87.16 0.60 85.71 0.54 83.42 0.77

88.34 0.84 88.70 0.52 87.61 0.62 85.92 0.66 83.61 0.70

86.90 0.70 86.96 0.52 86.35 0.41 84.61 0.85 82.12 1.69

87.29 0.80 87.64 0.49 87.01 0.40 85.38 0.70 82.79 1.13

88.56 0.52 88.59 0.50 87.94 0.50 85.92 0.81 83.34 1.19

Table 3.16: Satimage Dataset, HGC Results (local models used in all clusters with more than 4 points: sample mean and standard deviation of accuracy (from 10 random splits, using fraction f of data for training).

63

3.12.2

HGC results

The HGC results on these datasets, both with SVM and BMR, are given in Tables 3.14-3.16. Using first-level classifier R1 leads to somewhat better results that using distance to centers in almost all the cases on all datasets. (Results on Satimage with BMR are an exception). While the differences are usually small (less than 1% accuracy), they appear consistently for different values of k and f . On the Image dataset, the best results for f = 0.7, 0.5, 0.3 are with k = 5, and with k = 1 (global classifier) for f = 0.1. On the other datasets, the best results are with k = 20 for larger values of f and with k = 5 for f = 0.03. The HGC approach results in improvements over the global classifier on Pendigit and Satimage datasets for all values of f and k. On the Image dataset, the results of HGC are slightly better than those of the global classifier with large training sets (f = 0.7, 0.5) and become worse for f = 0.3, 0.1. These observations suggest that the best value of k (one leading to the best results) depends partly on the amount of the available training data. In other words, with less data smaller values of k are appropriate.

3.13

Conclusions

We have shown that Clustering Inside Classes (CIC) is a good way of improving classification accuracy when the intrinsic number of components in the classes is greater than one. Using Error-Based Class Aggregation to build a two-level classifier improves accuracy in some cases. However, the improvement was small and did not appear consistently. When using EBCA it is necessary to reuse the first level the classifier from which the confusion matrix was obtained rather than train a new classifier to separate the metaclasses. We experimentally demonstrated that reusing approach leads to better performance. Our attempt to combine CIC with EBCA did not meet with success. While the combination in many cases performs better than the global classifier, it usually does worse than CIC alone with the same parameter k. Therefore this approach cannot be recommended. The HGC approach outperforms the global classifier on all four dataset (with different values of k and local classifier selection methods). The value of k does have a strong effect on the absolute performance, and seems to depend on the amount of training data available. This method is well suited for parallelization, since after the clustering is done the training of local classifiers is independent, and only one local classifier is required at classification time. Trying to determine where a global classifier

64 should be used instead of local can be beneficial, but the approach also works well if local classifiers in all clusters are used. Using a classifier on the first level of HGC does provide a slight improvement over determining cluster membership based on distance to the center.

3.14

Further Work

The hyperparameter values selected for the a particular dataset were used for all classifiers on that dataset. In other words we did not tune hyperparameters for the local methods because of the large number of additional computations that would be required. Looking for heuristic ways for selecting appropriate hyperparameters is one direction further improving performance of our methods. As previously mentioned, the choice of parameter k has a strong effect on the results of both CIC and HGC approaches. Thus, one direction for further work is to experiment with clustering methods capable of automatically choosing the appropriate number of clusters. Similar work can be done for automatically determining parameter m (the number of metaclasses or metaclusters) in the EBCA scheme. While using KKM and RBC partitioning methods for analysis of confusion tables (EBCA scheme) did not result in improvements over the global classifier, these methods are of interest for clustering data based on the matrix of similarities alone. Our results suggest that using some method to remove poor local classifiers is often advantageous, as expected. However it is not clear whether there is a single method for identifying poor local classifiers that is best in most situations. Rather than comparing accuracies of local and global classifiers, statistical significance tests should be used to determine if the local classifier improves on the global one.

65

Chapter 4 Classification Methods in Epidemiology

The purpose of a classification method is usually to construct a rule for assigning a point to a particular class. This is particularly important for automating various tasks and processing large amounts of data. However in a number of applications, and in particular in epidemiology, the ability to make a prediction is less important than understanding the phenomenon. Being able to determine and understand the risk factors for a disease may be more important than predicting its occurrence. For such applications the goal of machine learning, and the related field of data mining, is to generate hypotheses about the importance of various factors. The number of possible hypotheses extracted from data is frequently very large. Many such hypothesis are trivial or mistaken. The ones that appear interesting are still subject to real-world validation before their usefulness is confirmed. The goal of computer scientists should be to provide a small number of high quality hypotheses for testing. We propose using our Hierarchy with Global Classifier (HGC) methods for finding local models that may be more appropriate than global ones, while simultaneously finding clusters of related instances. The comparison between selected local models and the global model, and identifying the characteristics of the clusters may allow epidemiologists to obtain new insights and directions for further work. Such information could be used to conduct risk factor analysis for the local model exclusively, providing improved understanding of the relations between the factors and the disease. In this chapter we first describe how models are constructed and analyzed in the epidemiological literature (Section 4.1). We also discuss some measures of feature significance (Section 4.2). Section 4.3 proposes a simplified hierarchical scheme that can be used to conduct a more sophisticated analysis, and briefly mentions related data-mining approaches. We illustrate our approach in the subsequent chapters by analyzing a dataset describing lung cancer survival in the USA in the years 1988-2001. 1

1

We are very grateful to Dona Schneider for her help in data preparation and interpretation.

66

4.1

Model Design and Analysis in Epidemiology

In epidemiology, a standard way of constructing models is to conduct univariate analysis of independent variables, followed by manual selection and refinement of the feature set by fitting a multivariate logistic model to the data [28]. The model constructed on the final set of variables is seen as the final model. The main measure driving the model selection process is therefore a goodness of fit criterion on a given dataset. While this measure indicates how well the model fits the data, it has little relation to the predictive accuracy of the model and therefore may not generalize beyond the given dataset. However, from the point of view of prevention, it is extremely important to identify the factors responsible for the presence or absence of a disease or a symptom on data that has not yet been observed. Model validation (estimation of predictive performance) is not frequently done in human epidemiology and medicine, as noted for example in [100, 117], despite it being recognized as an important part of establishing the usefulness of a model. One reason for this is that epidemiologists frequently do not try to predict occurrence of a disease. This is something that can be affected by many random factors outside of human control and for which good predictors may not be known. Rather, epidemiologists try to determine the risk factors: properties that contribute to the spread or severity of the disease. For that purpose they build descriptive (rather than predictive) models of the data and study the significance of individual factors. However, cross-validation is important for obtaining reliable estimates of the coefficient weights, just as it is for estimating the predictive accuracy of the model.

4.2

Feature Significance Estimation

Machine learning classification is the task of estimating a dependence between a set of available input features and the output feature (the label). This dependence is derived from the training data - a set of examples of the relation between values of the set of features and the label of the corresponding class. The quality of the classifier (and thus of the estimated dependence relation) can be evaluated based on the quality of its predictions. The input features define the classification space in which the classifier will be represented as a decision surface which splits the space into regions associated with different classes. The classifier can be constructed in the full feature space. Alternatively, the classifier may be constructed in a subspace, either implicitly (when the separating surface does not involve some features or has lower dimensionality) or explicitly, when some of the features have been removed from consideration prior to constructing the

67 model. It is frequently the case that a “better” classifier can be built using only a subset of the features (ex. [47]). The problem of finding the best classification subspace is known as “feature selection”. Many different approaches to this problem are described in the literature ([46] provides an overview). A somewhat different, though related, problem is estimating “feature significance” (or feature importance) for a specific classifier. The purpose is not to reduce the number of features or to construct new ones, but to assess how the values of features in the model affect the predictions. There is a large literature on the subject (see discussions in [107, 63]). Feature selection in general provides a rather coarse measure of significance: features are either retained or removed. Some feature selection methods involve computing intermediate feature weights, which can be interpreted as measures of feature significance. When the variable of interest takes real values, feature significance is often estimated by partial derivative of the prediction with respect to the feature. This is the standard meaning in automation control theory [30] and mathematical economics[99, 103]. Such a view emphasizes the effect of a small change of the feature on the model prediction when all other features are fixed. This means that such a significance coefficient, generally speaking, is a function of the point at which the partial derivative is measured. If one wants to estimate an average significance over all regions of consideration, one can, for example, integrate the magnitude of the derivative over the region. When the region boundary is not given analytically, but is represented by a set of training samples, the integration can be approximated by a summation of the derivative values across the whole training data. It is easy to adapt feature significance based on partial derivatives to our case, when predictions takes values in a small set {1, . . . , K}. The significance would then be indicative of the effect on classification accuracy. Indeed, let us consider a decision function f (x1 , . . . , xd ), defined on feature (classification) space X, x = x1 , . . . , xd ∈ X, and a classification function lf (x) ∈ {1, . . . , K} that assigns a point x to a class based on the values of f (x). If ∆xi is a chosen differential approximation for the feature xi than the corresponding differential ∆f can be defined by the formula:   0, if lf (x1 , ..., xi , ..., xd ) = lf (x1 , ..., xi + ∆xi , ..., xd ) ∆f (x1 , ..., xd |∆xi ) =  1, otherwise

68 If W is the set of training samples we can define the average significance as |W |

i C∆ =

1 X ∆f (xp1 , ..., xpd |∆xi ) |W | p=1

(4.1)

Alternatively, because in classifier design it is customary to use not only training data but also a i can be calculated based on the test data. particular test data, the coefficient C∆

Note that if the feature space is boolean, then the ∆xi = |xi − x′i | represents the substitution: xi = 1 → x′i = 0 and xi = 0 → x′i = 1,

(4.2)

i straightforward to compute. making the coefficient C∆

It is possible to derive a simpler form of a significance measure in (4.1) for the case of linear discriminant functions, i.e. functions of the form d X wil xi + w0l }, f (x) = max{ l∈L

where Ml = l.

n P

i=1

(4.3)

i=1

wil xi + w0l is a “membership” function which shows how similar point x is to the class

The simple significance coefficient is obtained directly from the weights ||wil || as il

C =

k X l=1

|wi |.

(4.4)

In the case of two classes, this significance coefficient is exactly the absolute value of the feature coefficient: C i = |wi |. The main difference between the two feature significance measures discussed above, (4.1) and (4.4), in this case is that the latter can be see as measuring the effect of perturbation of feature i on the classification score: δfi = |f (x1 , ..., xi , ..., xn ) − f (x1 , ..., xi + ∆xi , ..., xn )|

(4.5)

while the former measures a change in accuracy of the prediction: ∆fi = |lf (x1 , ..., xi , ..., xn ) − lf (x1 , ..., xi + ∆xi , ..., xn )|.

(4.6)

Such measure of feature significance is related to feature selection methods that focus on accuracy of prediction, rather than on feature’s influence on the outcome.

69 Standard Approach

Single Cluster Hierarchy Si

Data

R1 Data

Si R2i

Global Classifier

Global Classifier

Figure 4.1: Global Classifier approach and SCH

4.3

Our Approach: Interesting Subsets

An intuitive understanding of what constitutes an “interesting” subset of data (with respect to some property) is that it is such a subset where a general model of the data fails to capture the target relation or does so in an overly complicated way or where an alternative explanation is possible. Such subsets are of particular interest when they have (relatively) simple descriptions. We adopt our hierarchical approach to finding such regions. Recall that in our Hierarchical schemes (in Chapter 3) we partition the data into clusters and then use classifiers trained locally on these clusters. Here we are going to use a similar approach, except that, for simplicity of presentation and interpretation, we will only consider one cluster at a time. We will partition data into clusters and, for each cluster, will evaluate performance of local classifier on the cluster, comparing it to the performance of the global classifier on the same points. If using a local model leads to better performance than that of a global classifier or if local models is different from the global one (and its performance is comparable), we argue that this cluster is “interesting”. We call this scheme a Single Cluster Hierarchy 1 approach (or SCH). The pseudocode for SCH is given by Algorithm 7. Figure 4.1 may be helpful in understanding this scheme. Let us use the notation α(R|S, Lh ) to denote the estimated probability that classifier R correctly classifies a point x ∈ S, x ∈ Lh . For each of the classifiers R0 , R1i and R2i this value can be estimated by cross-validation or on the test set. The pseudocode for our experiments is given in Algorithm 8. Note that addition to identifying interesting clusters Si , we also obtain corresponding classifiers R1i and R2i . The classifier R1i describes how Si differs from the rest of the data, while R2i is the local model which we estimate to be better than (or at least different from) the global model. Once we find interesting clusters, the next step of analysis is to extract from our classifiers information

70

Algorithm 7 SCH Pseudocode Require: A set W , cluster Si ∈ W . {SCH, training stage} 1: Train a global classifier R0 to distinguish between classes of W . 2: Train classifier R1i to separate Si (class 1) and W/Si (class 0). 3: Train classifier R2i on points in Si . 4: Return R0 ,R1i and R2i . Require: Classifiers R0 ,R1i and R2i ; a point x. {SCH, test stage} 1: Let c = R1i (x). 2: if c = 1 then 3: Return R2i (x). 4: else 5: Return R0 (x). 6: end if

Algorithm 8 Algorithm for Finding Interesting Clusters Require: A training set W , validation set W ′ ,number of clusters k > 0. 1: Train a global classifier R0 2: Evaluate R0 on W ′ . 3: Compute α(R0 |W, Lh ), h = 1, . . . , K. 4: Partition W into k clusters using K-Means procedure 5: for i = 1, . . . , k do 6: Construct SCH using cluster Si 7: Evaluate SCH on W ′ . 8: Cluster Si is interesting if SCH has performance not worse than R0 and the local model R2i is different from R0 . 9: end for

71 on what features distinguish these regions from the rest of the data, and in what way the classification model becomes different in these regions. Cross-validation estimates of the coefficients can play an important role, by indicating the stability of the models constructed.

4.3.1

Related Data Mining Approaches

The use of machine learning methods for interpreting the data or finding patterns in the data has received a lot of attention in recent years, under the name of data mining ([121] provides a good introduction). An approach that is closely related to our proposal was discussed in [120]. It consists of three stages: (i) using an unsupervised clustering algorithm to partition the data and obtain cluster labels for each point, (ii) using supervised learning (C4.5 or decision rules) to obtain descriptions of the clusters, and finally (iii) examining each cluster and its description with the aim of finding those of importance for a particular task. The evaluation can be seen as feedback from domain experts, or some measure of quality of the cluster description. As shall be seen from Section 4.3, there are strong similarities between our approach and that of [120]. However, the latter was intented to be applied to unlabeled data. We, on the other hand, make use of the labels in stage (iii) by evaluating clusters based on the difference between quality of the global classifier and local classifier the particular cluster. In some applications local models are used for regression prediction. For example [125] examines use of clustering and local models for predicting auto insurance claim costs. K-Means was used to partition insurance policy holders into clusters of size less than 20,000, based on 13 variables. (The clusters that were larger than 20,000 were repeatedly partitioned until satisfying the size requirement). The resulting partition had 30 clusters. The cost of insurance claims in each cluster was estimated with the average costs of the cases in that cluster. This approach was compared against a heuristic method of grid partitioning with 3 variables, each split into 5 classes. Using clustering approach gave more accurate predictions while using fewer clusters. Clearly, linear predictive models can be considered in each cluster.

4.3.2

Analyzing Final Classifiers

As mentioned previously, the main goal of building a model in epidemiology is to obtain insight into feature interactions and the effect of the features on the outcome. Here we focus on linear models for a two-class dataset. Some measures appropriate for such data have been discussed in Section 4.2. One is the absolute value of the coefficient corresponding to a

72 particular feature: |wi |. It is convenient to work with relative coefficient weight, given by: |wj | rj = P . |wh |

(4.7)

h

Intuitively, this indicates how much effect compared to the other variables a particular variable has on i describing the effect of perturbing the feature on the classification prediction. Another measure, C∆

accuracy, was given by (4.1). Friedman and Popescu [39] suggested the following measure of variable relevance: Ij = |wj |σj .

(4.8)

Here the feature weight is scaled by the standard deviation of the variable. One way to interpret this measure is that between the two features with the same weight, the one with larger standard deviation is more useful since it would tend to be more informative. Another interpretation comes from regression analogy - in a linear regression model the value Ij would correspond to the coefficient of variable j if it was normalized as a preprocessing step. In other words, if all variables are scaled to have the same variance, then their coefficients would be ordered in the same way as values Ij for respectful j. Another approach is to examine the effect of changing variable values in the test set, as described in i . If a variable is not significant, then changing Section 4.2. This involves computing the coefficient C∆

its value will not have an effect on the prediction, but if it is then the prediction should change. The difficulty of this approach lies in its computational complexity - for a dataset with d features, a classifier has to be run d times. i . However, we will focus on r as Thus, for a global classifier we will discuss measures rj , Ij and C∆ j

the computationally simpler and more traditional one when discussing local models.

4.4

Summary

We propose an approach, Single Cluster Hierarchy, for finding and interpreting interesting patterns in the data. In the next two Chapters, we will describe an application of this approach to the dataset describing lung cancer survival in the USA.

73

Chapter 5 Constructing a Global Model

The first task is to prepare the data. This requires multi-stage preprocessing, a process that is not well formalized and is frequently application-specific. We describe it below in Section 5.1. Having selected the features we turn to the task of building the model. There are several aspects to this task. One is the choice of representation - as we shall see below even in the simple case of a fixed set of binary features there are different ways of representing the features numerically. Second is the possibility of expanding the set of features with functions of several features, in order to better represent possible interactions. Additionally, closely interwoven with these choices is the choice of the method for constructing the model. In general, different methods have different learning biases and may perform better with different representations. Thus we have to consider the choices of representation and of the method together. We shall evaluate several different representations with BBR software (an implementation of penalized logistic regression). We will also experiment with the penalization parameter of BBR. We then compare BBR results with those of linear SVM. The purpose of these experiments is to make sure that we are not missing a simple way of producing a better model.

5.1 5.1.1

SEER Data Preparation SEER Data Format

The Surveillance, Epidemiology and End Results (SEER) Program of the National Cancer Institute is an authoritative source of information about cancer incidence and survival in the United States. In our work we used SEER data for the years 1973-2001, released in April 2004. The data were collected from 12 population-based cancer registries. All the following text, in particular when describing the fields and data format, refers to that version. Data are stored in SEER in rows of fixed width (166 characters), containing 77 fields of fixed length. Each patient is uniquely identified by the combination of “SEER registry” and “case number” fields. A

74 patient may have several different entries. These are distinguished by “record numbers.” Information for each patient can be partitioned into two sets: demographic and medical. Demographic information includes fields such as age, sex, race/ethnicity, place of birth, etc. Medical information includes location of the disease, its type (morphology, histology) and extent, and also the types of treatment (radiation, surgery) and cause of death (COD) where applicable. The SEER database has evolved over time and therefore certain kinds of information available in recent years are not present in older records. The year 1988 seems particularly significant, with the introduction of several new fields (such as extent of the disease) and of detailed schemes for several other fields.

5.1.2

Feature Conversion

The fields in a SEER record can be grouped into 3 types: categorical, ordinal and numeric. A categorical field, such as race, with m possible values can be represented by m binary variables (or fewer, depending on which categories are of interest) where xi has value 1 only if the i-th category occurred in the field. Many of the fields in a SEER record are ordinal, i.e. the values in these fields can be ordered but they don’t have a defined distance function. One example is field 18, ”Grade”. Possible values are: Grade I, Grade II, Grade III, Grade IV and some others. While there is a clear ordering to the possible values, it is not known how much worse Grade II is than Grade I, and how this relates to the difference between Grade III and Grade II. Consider an ordinal variable v taking values {1, . . . , m}. It can be represented by an m-tuple of binary variables vi ,i = 1, . . . , m in the following way: vi = 1 ⇐⇒ v ≥ i

(5.1)

Such representations allow a model, such as logistic regression model, to automatically determine the difference (in the effect on the outcome) between the levels by assigning different coefficient to different variables vi .

1

Certain fields have integer values, such as age (in years). Since the exact values are not likely to be significant (i.e. it probably does not matter whether a person is 31 or 35), these can be considered ordinal variables with values in m specific intervals, and can then be converted into m binary variables using equation (5.1). 1

There exist many approaches for analyzing ordinal variables ([6, 1, 64, 113]). These approaches vary in the assumptions made about the data, and in the ways the data is processed (see [65] for a discussion).

75 The process of constructing variables requires consultation with an epidemiologist who can point out features of interest based on medical significance, reporting practices and coding methods. For example, similar information on stage of the disease can be reported by different sources using somewhat different coding. Some of these sources may be more reliable or provide more detailed information. An outsider would not normally be aware of such distinctions. The variable construction processes also may require multiple iterations. We give one example of a situation where necessity of recoding became clear only after the completion of initial experiments. Some SEER fields, such as extension, appear to be coded as ordinal variables (i.e. the larger codes appear to correspond to more serious conditions). However we found that this is not the case, and that the effect is not monotonic. For this reason we coded the extension SEER field using 6 separate binary variables, rather than the 6-bit scale as described above. In some cases, a field may potentially have many values (i.e. there are many codes specified) but in practice only a few of them occur in the data. In such cases only the values that occur are of interest. This can only be determined by looking at the distribution of the values for a particular field. The long tables in Appendix F describe which specific features were used in this analysis and how they were coded. Our representation consists of 98 binary variables, either demographic or medical. Demographic features are related to the individual’s social and demographic description and are independent of the individual’s medical condition. They are described in Table F.1. The other set of features describes the patients’ medical status. The coding for these features is given in Table F.2. All the features are binary and are described below as having values 1/0. However, they can be coded as 1/0 or as 1/-1. We experimentally analyze the difference between the two coding schemes in Section 5.3.

5.1.3

Data Processing

From the files “RESPIR.TXT” in directories “yr1973 2001.seer9” and “yr1992 2001.sj la ak” of the SEER CD release we extracted all entries with a diagnosis of cancer of lung or bronchus (ICD Codes C34.0-C34.9). There were 436,022 such records. These entries contain multiple records (at most 4) for a small group of people (6374). In order to keep the analysis simple, we removed all records pertaining to such people, leaving us with 423, 078 records. We chose to use only the data from years 1988 and later. There were several reasons for this. One reason is that SEER contains more information on cases starting in 1988, and there is less missing data.

76

25 line 1 line 2

Survival Time (months)

20

15

10

5

0 1970

1975

1980

1985 Year of Diagnosis

1990

1995

2000

Figure 5.1: Plot of the median (line) and third quartile, 75%, (points) survival time as function of the date of diagnosis. Survival: 1988+

Zoom-in on 1st 24 months

30000

30000 line 1

25000

25000

20000

20000 Number of people

Number of people

line 1

15000

15000

10000

10000

5000

5000

0

0 0

20

40

60

80 100 Survival (months)

120

140

160

180

0

5

10 Survival (months)

15

20

Figure 5.2: Plot of survival rates for the cases diagnosed in 1988+. The x-axis shows the number of months, the y-axis shows the number of people surviving that time. (The plot is not cumulative.) The other reason is that the median survival time (8 month) has not changed from 1985 until now, while the third quartile survival increased slightly from 19 to 22 months. Thus there were no drastic changes in the survival time caused by advances in diagnosis or treatment in the longitudinal data. (The plots of median and 75% survival times are given in Figure 5.1, while the distribution of the survival time for cases diagnosed in 1988 or later is shown in Figure 5.2.) As a cut-off for short-term survival (class 1) and long-term survival (class 0) we chose the median survival time, 8 month. The label was determined as follows: 1. If survival time is unknown, or if a person was diagnosed in 2001 and is not dead and survival is less than 8 month, he has to be discarded (since we cannot assign him to either class).

77 2. If a person is dead, and cause of death was cancer of lung or bronchus, and survival time is less than 8 months, we assign that person to class 1. 3. If a person survived longer than 8 months, then he is assigned to class 0. 4. Otherwise, the person has to be removed since a label cannot be determined (for example if a person died within a 8 month of diagnosis but not from lung cancer). Since we are not gathering statistics, but attempting to construct a case-based model, we only want to include cases where we are certain of the relationship between the label and the features. The selected data from years 1988+ were split approximately evenly into a training set (up to and including 1995) and the test set (1996-2001). The training set consists of 122, 613 cases, and the test set consists of 100, 292 cases. The distribution of the feature values on the training set for each class is described in Tables F.3-F.4.

5.1.4

Descriptive Analysis

Before converting all data into binary we examined the relation between some features and survival time on the training set. The boxplots in Figures 5.3 show the distribution of survival time (in months) for particular disease stage, grade and extension codes and conditional on whether surgery was performed or recommended. Correlation of survival time with age at diagnosis is weak (-0.1472). Features such as registry code, race or sex of the patient also do not appear to have an effect on survival time. The two features that show strong relation to survival are stage code and ”performed surgery”. These two features will play an important role in the classification models we construct.

5.1.5

Missing Value Analysis

The SEER data contains a large number of missing values in different fields, and a number of rare values that lead to “constant” features. These factors have to be taken into account before constructing the classifier. In general, handling of missing data is an important aspect of practical applications, and many methods of varying complexity have been described in the literature [102, 4, 87]. We chose to discard cases and variables with a large number of missing values, and to use a simple missing value imputation method for the remaining data. Thus we try to balance the conflicting goals of retaining data for analysis while avoiding introducing strong bias into the data. More specifically, our handling of missing and constant data consists of the following steps:

78

50

100

150

Survival in months (y−axis) for different grade values (x−axis) (9 denotes missing value)

0

0

50

100

150

Survival in months (y−axis) for different Stage codes (x−axis) (99 denotes missing value)

0

10

20

31

32

40

99

1

3

4

9

Survival in months (y−axis) for different Extent codes (x−axis) (99 denotes missing value)

100 50 0

0

50

100

150

Survival in months (y−axis) against Surgery codes (x−axis)

150

2

Performed

Not Performed

Recommended

Unknown

0

13

23

26

40

60

71

73

78

85

Figure 5.3: Plots of relation between lung cancer survival time and some of the medical features in the SEER database. The thickness of the boxes is proportional to the square root of the number of observations in the group.

79

Class Value 0: Value 1: Missing values:

0 a c x

1 b d y

Table 5.1: The above matrix represents the frequency of values of some feature on the training set. The missing values will be filled using the most frequent value in the most frequently missing class. For example, if x > y, then missing values will be set to 1 if c > a and to 0 otherwise. 1. If the value of a feature is missing in more than 25% of cases, it is removed. 2. If a feature has the same value in 95% or more of cases where the value is not missing, it is removed (constant feature). 3. Those cases that are missing more than 25% of the feature values on the remaining features are removed as well. All the statistics on the features are computed exclusively on the training set. For each feature we find the most frequent value v in the class with the most missing values for that feature. All the missing values are imputed with this value v. The Table 5.1 explains this in detail. After this processing, 45 features are left. The training set retains 120,318 cases, while the test set now consists of 97,240 cases (in other words 2,295 training examples and 3,052 test examples were removed). All the experiments discussed below are performed starting with these data. We would like to comment on the stability of the above procedure. Varying the cut-off for missing values in the interval [15%, 38%], while keeping the other parameters fixed, did not change the number of variables selected. In other words, there is a large gap in the frequency of missing values for the variables (see Table 5.2). Changing the parameter that controls elimination of the “constant” features has a greater effect. Setting it to 99%, rather than 95%, leaves 64 variables in the model. Additionally, somewhat fewer examples are skipped (1,770 in training, and 2,080 in test). The sensitivity and specificity of a model built on these data are slightly worse than those of a model built on the 45 variables. While this result is achieved on a slightly larger dataset, it is unlikely that these additional variables are important for predicting survival. Increasing the threshold for the acceptable number of missing features in the example from 25% to 30% leads only to a minor increase in the number of cases: 1751 training cases and 2055 test cases

80 Number of Features 9 4 1 1 6 6 9 10 2 1 1 4 2 1 2

Missing Rate (% of all cases) 46.34 41.18 38.54 38.34 14.70 13.77 10.63 10.00 5.23 4.92 1.98 0.41 0.23 0.18 0.08

Table 5.2: Frequency of missing features with a particular missing rate (out of the initial 98). One can easily see the large gap [15%, 38%]. σ2 CV LL

10−4 -6526

10−3 -6508

10−2 -6495

10−1 -6490

1 -6490

10 -6489

102 -6489

105 -6489

108 -6489

Table 5.3: Cross-validation (CV) test log-likelihood (LL) for different values of the prior variance. are skipped, as opposed to 2295 and 3052 respectively for threshold of 25%. This change does not significantly affect the model or the results. We believe that the observations and experiments described in this subsection demonstrate that the parameters used in the data cleaning stage are reasonable. While the experimentation and the methodology could have been more extensive, this would take us away from the main subject of this work.

5.2

The Baseline Result

As a baseline result we consider the performance of BBR with Laplace prior (a setting favoring feature selection), using the 45 binary {0, 1} variables. The hyperparameter of BBR (variance of the prior) was selected by internal 10-fold cross-validation (the choices were: 10{−6,...,0} .) The prior variance selected was 1. Since this is an endpoint, we examined larger values of the hyperparameter, and discovered that increasing it further has little effect (see Table 5.3). In other words, the global maximum is likely achieved at σ 2 = ∞, but the value at σ 2 = 1 is already very close to it. For this reason we decided to keep using σ 2 = 1. Despite the use of the sparseness-inducing prior no features were removed. The threshold tuning

81 ID j 73 76 66 83 64 75 94 31 97 74 33 32

Relative importance rj 0.080 0.063 0.048 0.040 0.040 0.038 0.036 0.034 0.034 0.034 0.033 0.026

Coefficient wj -1.197 -0.950 0.717 -0.599 0.598 -0.573 0.542 0.512 0.511 -0.501 0.501 0.391

Description Surgery was performed No radiation sequence with surgery Extension code 80-85 Histology code 804* Extension code 71-76 Radiation Stage code 10 or higher Born in East South Central region Stage code 32 or higher Surgery recommended Born in Mountain region Born in South West Cental region

Table 5.4: Features (from BBR), sorted by importance that add to 50% of the total weight. parameter of BBR was set to minimize the sum of errors on the training set (this is an option available in BBR software). The training (including cross-validation) takes approximately 30 minutes for the whole training set. The baseline approach gives sensitivity (percentage of class 1 that is identified correctly) of 72.10 and specificity (percentage of class 0 that is identified correctly) of 72.50 on the test set. Estimates obtained with 2 runs of 5-fold cross-validation on the training set are 71.50 and 72.87 respectively. This demonstrates that the training set is representative of the test set and that the model built on it should generalize. Table 5.4 shows the top 12 variables according to the measure (4.7). Together they have 50% of the total weight. The performance using only these variables has higher sensitivity 76.78 and lower specificity 67.56. It is interesting that the features with large negative weights (73 - surgery was performed, 76 radiation and surgery were performed) are associated with treatment procedures. This can be related to the effectiveness of these procedures, or to the fact that these procedures are used when they will extend the patient’s life and thus are “proxy” variables for other factors. The features with large positive weights (66, 64, 94,97 indicate the extension into different organs and the stage of the disease. Histology code “804*” Carcinoma, Not Otherwise Specified (NOS) - variable 83 - may be a relatively less dangerous or more controllable type of cancer. While most of the top features are medical, several features relating to the place of birth (31,32,33) also have large weights. This could be a reflection of some cultural factors influencing behavior of the patients. We experimented with decreasing the prior variance in BBR to perform variable selection. Setting prior variance to 0.0001 resulted in 24 variables with non-zero coefficients. The most important are

82 ID j 73 76 66 83 75 74

Relative importance rj 0.150 0.090 0.083 0.078 0.065 0.062

Coefficient wj -1.164 -0.698 0.643 -0.600 -0.506 -0.479

Description Surgery was performed No radiation sequence with surgery Extension code 80-85 Histology code 804* Radiation Surgery recommended

Table 5.5: Features (from BBR with σ 2 = 0.0001), sorted by importance, that add to 50% of the total weight. ID j 73 61 66 96 74 75 83 97 24 70 76

Relative importance rj 0.249 0.182 0.135 0.099 0.093 0.072 0.061 0.044 0.035 0.020 0.010

Coefficient wj -0.858 -0.629 0.467 0.342 -0.322 -0.247 -0.210 0.152 0.119 -0.068 -0.036

Description Surgery was performed Extension code 10-30 Extension code 80-85 Stage code 31 or higher Surgery recommended Radiation Histology code 804* Stage code 32 or higher Age 75 or greater Site specific surgery performed No radiation sequence with surgery

Table 5.6: All features retained by BBR with σ 2 = 1 ∗ 10−6 , sorted by importance. given in Table 5.5. All of these were present with large weights in Table 5.4. Demographic variables here are not among the most important ones. The performance of this model is 74.62 (slightly better than that of the global model) and 70.57 (slightly worse than that of the global model) - comparable overall. Setting prior variance to 10−6 resulted in 11 variables with non-zero coefficients. All of them are given in Table 5.6. The main variables are still the same - related to the treatment procedures (73,74) and to the stage and extension of the cancer (61,66,96). The only demographic variable present (with a small positive weight) is 24 - age 75 or greater - implying that older people are less likely to survive 8 month after the diagnosis. The performance of this model is 74.84 and 68.26 - somewhat worse than that of the baseline or of the model obtained with variance 0.0001.

5.3

Effect of Changing Coding

An alternative representation of the data is as {−1, 1}, rather than as {0, 1}. In this representation taking the inner product between two vectors is equivalent to computing the Euclidean distance between them and to the total number of identical entries, while in {0, 1} representation it is the number of ones

83 ID j 73 66 76 83 64 31 94 33 75 32 30 28 26

Relative importance rj 0.083 0.054 0.051 0.042 0.040 0.037 0.036 0.036 0.036 0.029 0.028 0.028 0.028

Coefficient wj -0.641 0.417 -0.391 -0.322 0.306 0.282 0.282 0.281 -0.280 0.223 0.218 0.217 0.214

Description Surgery was performed Extension code 80-85 No radiation sequence with surgery Histology code 804* Extension code 71-76 Born in East South Central region Stage code 10 or higher Born in Mountain region Radiation Born in South West Cental region Born in New England Born in East North Central Born in South Atlantic

Table 5.7: Features (from BBR) on {−1, 1} representation that add to 50% of the total weight, sorted by importance. they have in common. However the change of representation did not drastically alter the quality of the result or the variable significance. The sensitivity and specificity were 71.26 and 73.09 respectively. The features that add to 50% of importance are given in Table 5.7. Comparing Tables 5.4 and 5.4 one can see that most of the variables are present in both tables, with the same signs and similar relative importance. The variables that disappeared of this list are 97 and 74. New variables are 30,28 and 26, all of which are place of birth indicators.

5.4

Capturing Variable Interactions

In an attempt to capture feature interactions, we computed frequency information for all triples of features (out of the 45 remaining ones). An example of a triple is ”feature 23 = 1 and feature 62 = 0 and feature 95 = 1”, which can be interpreted as ”a man above 65 years of age, with tumor code 10-30 (i.e. localized) and stage 20 or higher.” Triplets include all interactions of degree 3 or lower (i.e. pairs). Such triples can provide a compact description of a particular groups of patients, and in the event that some triples are important, they are easy to interpret. The triples were selected based on the two quantities: support (the total number of times that a particular triple occurs) and “strength” (the ratio of frequencies of the triple between the two classes). In other words, if a triple t occurs in n− (t) cases of the negative class and in n+ (t) of the positive class, then support of t is: u(t) = n− (t) + n+ (t),

(5.2)

84 and strength of t is: s(t) = max(

n− (t) n+ (t) , ). n+ (t) n− (t)

(5.3)

There are 112179 triples that occur at least once in the training set. We aimed to select a small number of important interactions. We discovered that the “strength” of triples from the negative class was greater. Setting support at 25000 and strength at 10, we obtained 4 triples. For the positive class (class 1), setting strength to 3 and support to 12000 produced 5 triples. It is interesting to note the support of these triples (combined) on the training set. The 4 triples covered 23056 cases of the negative class and only 2227 cases of the positive class. On the other hand, the 5 triples covered 23807 of the negative class but 9198 of positive class. Thus these triples are useful indicators of the class, despite low combined support. By combining these 9 triples with the 45 original variables we can expand our space to 54 variables. Building a classifier with BBR in this space resulted in 72.42 sensitivity and 72.43 specificity. This is essentially the same result as without the triples. Furthermore, the triples did not have high weights in the classifier. Decreasing the prior variance led to few triples remaining in the model, all with low coefficients. Thus we have to conclude that selectively introducing interactions of up to 3rd order did not lead to improved classification or to more interesting models. An alternative approach is to consider all pair-wise AND features. This has the effect of increasing the potential dimensionality of the space to up to 45 + 45 ∗ 44/2 = 1, 035 features. The prior variance 0.1 was selected by cross-validation. The resulting model had test set accuracy of 72.87 and sensitivity/specificity of (74.02, 71.91), which is better than of the baseline result. However, this classifier involved 696 features with non-zero weights, and therefore would be difficult to interpret, especially considering the nature of many features. Specifying lower variance, 10−4 , resulted in a model with 72.62 accuracy and (72.91,0.72.38) sensitivity/specificity, based on 89 features. While the reduction in dimensionality is large, and the results are still better than of the baseline, such classifier is again difficult to interpret,

5.5

Comparison with SVM

For SVM, we tried a linear kernel with hyperparameter set to 1. SVM turned out to be very slow, taking hours to train (possibly because the particular software we used did not provide special techniques for linear SVM). Therefore, we only evaluated SVM performance in several cases. The sensitivity and specificity obtained were 72.45 and 68.69 respectively. This is somewhat worse

85 ID j 64 97

Relative importance rj 0.499 0.499

Coefficient wj 1.999 1.998

Description Extension code 71-76 Stage code 32 or higher

Table 5.8: Top 2 Features from SVM. ID j 66 63 74 96 61

Relative importance rj 0.127 0.058 0.056 0.049 0.043

Coefficient wj 0.0009 -0.0004 -0.0004 0.0004 -0.0003

Description Extension code 80-85 Extension code 60-70 Surgery recommended Stage code 31 or higher Extension code 10-30

Table 5.9: Relative importance of top 5 features from SVM without features 64 and 97. than the baseline BBR result. However the variable coefficients are rather different from BBR model. Two variables accounted for more than 99% of relative importance. They are given in Table 5.8. These are only the only variables with high weights - all other weights are insignificant in comparison. If we ignore these two weights, the remaining variables have more similar coefficients. The top several are displayed in Table 5.9. All of these are features that have high weights in BBR models. However, in BBR models their weights are comparable to the largest coefficients. The structure of SVM model makes one wonder whether we can obtain the same level of accuracy simply by considering the values of variables 64 and 97. Table 5.10 shows the frequencies of all possible values of these variables. As can be seen from that Table, using the table of co-occurrence of these two features and the class label gives the same performance as building SVM classifier on all features. Another interesting question is what would happen if these two features were removed. It turns out that removing these two features results in an SVM classifier with another two dominant features: 63 (extension code 60-70) and 96 (stage 20 or greater) (see Table 5.11). It is interesting that this time one of the coefficients (for feature 96) is negative, meaning that this feature is associated with the negative class (longer life). The sensitivity and specificity on the test set using this classifier is 73.43 and 66.98, which is slightly lower than before. Notice that features 63 and 96 do not show up with high weights in BBR models, but features 64 and 97 do. This happens because in the presence on 64 and 97, features 63 and 96 add little information. Similar behavior - a few variables having almost all relative weight - is observed when SVM is applied to the data projected onto subsets of variables, or when the coding is changed from {0, 1} to {−1, +1}. The results with {−1, +1} encoding are comparable to the baseline, while with decreased dimensionality the the quality of the models decreases. The results of this section suggest that SVM may produce a very different (from logistic regression)

86 Training Set 0 1 Majority Prediction: Test Set 0 1

64 97 00 47540 16642 0

64 97 01 13799 29017 1

64 97 10 5530 7552 1

64 97 11 54 184 1

36303 12229

12212 25532

4281 6487

53 143

Table 5.10: Frequencies of features 64 and 97 on the training and test sets. Using the majority rule leads to (sensitivity, specificity)=(68.83, 71.04) on the training set, and (sensitivity, specificity)=(72.45, 68.69) on the test set. ID j 96 63

Relative importance rj 0.499 0.499

Coefficient wj 1.998 -1.998

Description Stage code 20 or higher Extension code 60-70

Table 5.11: Top 2 Features from SVM built after variables 64 and 97 were removed. model for the data, and thus may provide an important perspective on the phenomenon in question. While we will not focus on SVM in the remainder of this work, whenever possible it is interesting to use this method in addition to logistic regression.

5.6

Summary of Classifier Effectiveness Comparisons

We summarize the results mentioned in the preceeding Sections here (Table 5.12). These results suggest that the binary {0, 1} representation using 45 input variables performs as well or better than the alternatives examined. Inclusion of some 3rd degree interactions did not improve the performance or reveal interesting relations. Including ANDs of all pairs of features improves the performance, but drastically increases the dimensionality. SVM performed somewhat worse than BBR and required much more computational effort, making it impractical for studies of very large datasets. Therefore, we will focus on experiments with BBR using the 45 original variables. Representation 45 24 11 45 45 45

variables (BBR: σ 2 = 1) variables (BBR: σ 2 = 10−4 ) variables (BBR: σ 2 = 10−6 ) variable as {−1, +1} variables (select triples) variables + AND features

BBR test (Sens.,Spec.) Accuracy (72.10, 72.50) 72.32 (74.62, 70.57) 72.42 (74.84, 68.26) 71.26 (71.26, 73.09) 72.25 (72.42, 72.43) 72.42 (74.02, 71.91) 72.87

SVM test (Sens.,Spec.) Accuracy (72.45, 68.69) 70.41 (72.45, 68.69) 70.41 (75.44, 63.27) 68.82 (72.45, 68.69) 70.41 — — — —

Table 5.12: Predictive performance (sensitivity, specificity and accuracy) using cross-validation and the test set. All the differences in accuracy between SVM and BBR are significant at 0.99 level.

87

BBR |wj | j BBR C∆ BBR Ij SVM |wj | j SVM C∆ SVM Ij

|wj | 1.00000 0.97499 0.81381 0.20495 0.19396 0.19647

BBR j C∆ 0.97499 1.00000 0.87078 0.21783 0.21307 0.21797

Ij 0.81381 0.87078 1.00000 0.24009 0.25770 0.27057

|wj | 0.20495 0.21783 0.24009 1.00000 0.98247 0.97739

SVM j C∆ 0.19396 0.21307 0.25770 0.98247 1.00000 0.99948

Ij 0.19647 0.21797 0.27057 0.97739 0.99948 1.00000

Table 5.13: Correlations between the magnitudes of coefficients of SVM and BBR, the effect of the changing a variable and a relevance measure.

5.7

Analysis of Variable Importance in the Global Classifier Model

Once a model is constructed, it has to be analyzed. One important criterion is the predictive accuracy of the model. A model with poor predictive performance is of no practical interest. Even a model that perfectly describes the training data should be discarded if it has poor performance on the test set or in cross-validation. However, model analysis extends further than just the accuracy of a model. The next step, one of great importance for epidemiology, is to analyze the significance of individual features in terms of their effect on survival.2 Linear models (such as the ones produced by SVM and logistic regression) are traditionally popular precisely because their parameters can be interpreted as indicators of feature significance. There are however different ways to interpret them. We shall consider several measures described previously in Section 4.3.2. One approach is to examine the effect of changing variable values in the test set. If a variable is not significant, then changing its value will not have an effect on the prediction, but if it is then the j , is an indicator of feature’s prediction should change. The fraction of predictions that are changed, C∆

significance. j In our experiment with the baseline BBR the value of C∆ varied from 0.0057 to 0.391 on the test

set. The correlation with the magnitude of the coefficients in BBR model is large, as expected (see Table 5.13). The actual coefficients and the number of points for which predictions changed, together with value Ij for each variable are given in Table 5.14. Figure 5.4 shows scatter plots of these measures for BBR. Notice that while the relationship is strong, there are several cases where a feature with large coefficient (by absolute value) affects fewer predictions than features with smaller coefficients, and the other way around. Two examples of this are features 64 (extension code 71-76) and 74 (surgery 2

In epidemiology the purpose of building a model is not to construct a predictor for individual cases, but to provide a numerical description of the relation between the variables and the outcome.

88

0.3

0.4

0.4 0.3

I_j 0.2

0.1

0.1

0.2

40000 20000 30000 40000 30000 20000

C^j

20000 10000

0 1.2

10000 20000

0

0.6 0.8 1.0 1.2

1.0 0.8

|w_j|

0.6

0.6 0.4 0.2

0.0 0.2 0.4 0.6

0.0

Scatter Plot Matrix

Figure 5.4: Scatter plots of the feature importance measures on the BBR model.

0.4

0.6

0.6

0.4

I_j 0.2 0.0

0.2

0.0

80000 400006000080000 60000 40000

C^j

40000 20000

0 2.0

1.0

1.5

2000040000

0

2.0

1.5

|w_j|

1.0

1.0 0.5

0.0

0.5

1.0

0.0

Scatter Plot Matrix

Figure 5.5: Scatter plots of the feature importance measures on the SVM model.

89 recommended). Table 5.13 shows several other interesting relationships. While the measure Ij is strongly correlated j j with the coefficients and with C∆ , the correlation between |wj | and C∆ is stronger. Also, the measures

computed with the SVM model have little correlation with any of the measures based on the BBR weights, but show even stronger correlations between themselves. The reason is thaat the SVM model has only 2 significant features, as was previosuly discussed. The scatter plots in Figure 5.5 clearly illustrate that.

5.8

Summary

We have discussed SEER data preparation and construction of epidemiological models using modern machine learning methods (SVM and penalized logistic regression). While we did not pursue the use of SVM, our experiments suggest that it may provide an interesting and different perspectives from more standard logistic regression approaches. Specialized implementations of linear SVM should make it comparable in computational complexity to the logistic regression. We also considered several different feature importance measures. They are all strongly correlated within a particular method, but can also lead to different rankings for some variables. A number of other aspects of this work may be of interest to the epidemiological community and can be developed further. For example, we have spent a great deal of effort preparing the data. While this process probably cannot be automated completely, there are ways of making it much faster and easier. Developing standard methods for data preparation would be of benefit to the medical and epidemiological community. In the next Chapter we apply hierarchical classification scheme with local models to the analysis of epidemiological data. The approach we consider involves partitioning the data into clusters, using K-Means clustering algorithm. After that, for each cluster we build a classifier to distinguish this cluster from the rest of the data, and a local model in the cluster. This method will allow us to find “interesting” clusters in the data, where an interesting cluster is one consisting of a set of points that is distinguishable from the rest of the data and which allows for a different (better) local model of the phenomena.

90

ID 46 12 63 70 95 85 45 1 84 7 62 13 22 78 17 15 14 61 23 82 27 5 44 3 9 34 2 96 26 30 28 32 29 24 33 31 74 97 64 94 75 83 76 66 73

BBR wj 0.019 -0.040 0.046 0.049 0.060 0.064 -0.078 -0.101 0.101 -0.185 0.202 -0.206 0.213 -0.219 0.236 0.252 0.247 -0.237 0.249 0.266 0.278 -0.272 0.308 -0.290 -0.295 0.327 -0.319 0.303 0.379 0.385 0.389 0.391 0.384 0.371 0.501 0.512 -0.501 0.511 0.598 0.542 -0.573 -0.599 -0.950 0.717 -1.197

j N C∆ 552 1105 1223 1332 1579 1707 1987 2558 2627 4692 4991 5274 5310 5852 5998 6295 6551 6552 6562 6630 6976 7237 7617 7885 7981 8197 8657 8679 9414 9506 9583 9607 9669 10182 12026 12292 15067 15549 15837 16182 16423 16472 19038 22367 37975

BBR Ij 0.069 0.064 0.061 0.100 0.105 0.112 0.137 0.108 0.135 0.145 0.120 0.130 0.153 0.115 0.239 0.153 0.181 0.209 0.240 0.213 0.131 0.170 0.128 0.206 0.134 0.193 0.193 0.275 0.182 0.161 0.218 0.144 0.216 0.268 0.152 0.152 0.305 0.343 0.243 0.297 0.378 0.286 0.256 0.401 0.471

Description Laterality: right Registry: Los Angeles extension code 60-70 Site specific surgery (code 10 or higher) Stage code 20 or higher Histology code 814* Laterality: left Registry: San-Francisco Histology code 807* Registry: Seattle Extensions code 40-59 Place of birth: US Age 55 or greater Radiation after surgery Sex: Male Race: Black Race: White Extensions code 10-30 Age 65 or greater Histology code 801* Born in Mid Atlantic Region Registry: Iowa Primary Site: Bronchus Registry: Detroit Registry: Atlanta Born in Pacific region Registry: Connecticut Stage code 31 or higher Born in South Atlantic Born in New England Born in East North Central Born in South West Cental region Born in West North Central Age 75 or greater Born in Mountain region Born in East South Central region Surgery recommended Stage code 32 or higher Extension code 71-76 Stage code 10 or higher Radiation therapy Histology code 804* No radiation sequence with surgery Extension code 80-85 Surgery was performed

j Table 5.14: BBR model coefficients wj , corresponding Ij and C∆ (multiplied by the total number of j points in the test set), ordered by increasing C∆ .

91

Chapter 6 Local Models in SEER data

In the previous chapter we described construction of a global classifier using BBR with Laplace prior (”lasso” logistic regression). This classifier had a relatively high accuracy, but was making use of all 45 features. The feature weights were rather spread out, with top 12 variables adding up to 50% of the total feature weight. Here we discuss the use of local models in the analysis of epidemiological data. We will use the Single Cluster Hierarchy, described in Chapter 4. The basic idea is that we conduct cluster analysis, and then evaluate one cluster at a time in terms of the accuracy of the local classifier compared to the global one, of separability of the cluster from the other points, and the features involved in the local and first level classifiers for that cluster. We obtained the initial partition for the clustering on the training data by taking the values of the first 8 features (7 registry values, and “born in the USA or not”). Thus this initial partition gives us 8 × 2 = 16 clusters. We then apply K-Means and run it for 10 iterations. Below we present for each such cluster the structure of the cluster, of the local and first level classifiers, and comparable statistics for the global classifier.

6.1

Predictive Quality of the Hierarchical Models

The estimated quality of the individual classifiers on the training data is described in Table 6.1. One interesting observation is high cross-validation accuracy of the first-level classifier on the training set (the 4th column of Table 6.1). The accuracy on the complement of the cluster is always above 99%, and the worst results with respect to the cluster points is 77%. This suggest that overall the clusters are well-separated from other points and thus are likely to correspond to structure of the data. The results on the test data are in Table 6.2. The measures used are average log-likelihood (ALL) of the data, and area under ROC curve (AUC). These results suggest that the effect of hierarchy on the overall classification accuracy is rather small. ALL and AUC change of the SCH scheme differ only slightly from that of the global classifier. The accuracy for the SCH scheme, and the specificity at a

92

i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

α(R2i |Si , ∗) (31.42, 86.08) (70.54, 60.98) (41.97, 84.33) (64.74, 65.23) (4.03, 99.66) (18.18, 96.21) (74.70, 59.26) (95.99, 11.79) (48.61, 75.95) (58.87, 70.81) (53.13, 76.98) (63.64, 65.26) (48.83, 78.20) (95.08, 14.63) (65.87, 62.37) (66.97, 63.27)

α(R0 |Si , ∗) (31.80, 85.68) (72.49, 61.24) (44.86, 83.07) (67.29, 59.92) (0.17, 99.98) (16.28, 96.81) (77.59, 56.11) (96.39, 10.53) (47.22, 78.04) (69.49, 59.10) (46.29, 81.22) (65.21, 65.34) (49.94, 78.61) (95.01, 13.84) (65.57, 62.71) (79.03, 50.62)

α(R1i |W, ∗) (78.78, 99.60) (96.85, 99.84) (89.71, 99.52) (77.64, 99.22) (96.43, 99.52) (99.44, 99.93) (96.13, 99.76) (94.96, 99.74) (85.07, 99.41) (88.74, 99.83) (93.67, 99.82) (78.62, 99.89) (90.81, 99.56) (97.08, 99.81) (88.21, 99.50) (94.72, 99.98)

Classes in Si (1709, 2863) (2770, 2486) (2822, 4691) (4014, 4303) (1725, 17749) (1741, 6684) (5027, 4232) (12657, 5443) (2951, 4174) (1144, 1055) (2303, 3190) (572, 567) (1745, 2330) (8996, 3998) (2344, 2348) (875, 810)

Classes in W/Si (51686, 64060) (50625, 64437) (50573, 62232) (49381, 62620) (51670, 49174) (51654, 60239) (48368, 62691) (40738, 61480) (50444, 62749) (52251, 65868) (51092, 63733) (52823, 66356) (51650, 64593) (44399, 62925) (51051, 64575) (52520, 66113)

Table 6.1: Predictive performance (sensitivity, specificity) of the local and global classifiers inside the cluster estimated on the training data (with cross-validation), together with cluster size (on the training set).

Cluster Id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

SCH LL AUC -0.537 79.81 -0.537 79.82 -0.537 79.82 -0.537 79.85 -0.536 79.82 -0.536 79.85 -0.537 79.83 -0.537 79.83 -0.537 79.81 -0.537 79.81 -0.537 79.82 -0.537 79.82 -0.537 79.81 -0.537 79.84 -0.537 79.85 -0.537 79.83

(R2i |Si , ∗) LL AUC -0.608 69.361 -0.611 71.68 -0.602 72.08 -0.622 71.111 -0.212 77.261 -0.457 77.551 -0.598 72.04 -0.587 62.99 -0.612 69.43 -0.637 69.001 -0.616 71.421 -0.615 71.561 -0.634 69.481 -0.594 63.72 -0.622 71.041 -0.632 69.46

(R0 |Si , ∗) LL AUC -0.606 69.78 -0.611 71.78 -0.600 71.99 -0.626 70.51 -0.219 76.31 -0.467 76.25 -0.600 72.00 -0.589 63.04 -0.609 69.57 -0.633 69.36 -0.618 70.94 -0.616 71.83 -0.633 70.12 -0.597 63.57 -0.628 70.13 -0.635 69.54

Classes in Si

Classes in W/Si

(930, 1451) (1971, 1588) (1875, 2827) (4126, 4287) (1016, 14350) (1637, 5310) (3468, 2291) (10607, 4476) (1375, 2202) (743, 757) (1412, 1762) (370, 309) (1318, 1446) (7633, 3405) (2840, 2792) (1327, 1151)

(43461, 51398) (42420, 51261) (42516, 50022) (40265, 48562) (43375, 38499) (42754, 47539) (40923, 50558) (33784, 48373) (43016, 50647) (43648, 52092) (42979, 51087) (44021, 52540) (43073, 51403) (36758, 49444) (41551, 50057) (43064, 51698)

Table 6.2: Average log-likelihood and area under ROC curve (AUC) for the SCH schemes (over all test data) and for the local and global models on the clusters. The global model has average log-likelihood of −0.537 and AUC of 79.82. Superscript 1 indicates that local model AUC on the cluster is significantly different (better or worse, at 0.99 level) from the global model AUC on the same cluster.

93

ROC Plot: global (circles) and local (crosses)

0.8

0.8

0.6

0.6 sensitivity

Cluster 2

1

sensitivity

Cluster 1

ROC Plot: global (circles) and local (crosses) 1

0.4

0.4

0.2

0.2

0

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

1 - specificity

0.8

0.8

1

0.6

0.8

1

0.8

1

0.8

1

0.6 sensitivity

Cluster 4

1

0.4

0.4

0.2

0.2

0

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

1 - specificity

0.6 1 - specificity



0.8

0.8

0.6

0.6 sensitivity

Cluster 6

1

sensitivity

Cluster 5

0.8


1

sensitivity

Cluster 3


0.4

0.4

0.2

0.2

0

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

1 - specificity

0.6 1 - specificity



0.8

0.8

0.6

0.6 sensitivity

Cluster 8

1

sensitivity

Cluster 7

0.6 1 - specificity

0.4

0.4

0.2

0.2

0

0 0

0.2

0.4

0.6 1 - specificity

0.8

1

0

0.2

0.4

0.6 1 - specificity

Figure 6.1: ROC curves for global and local classifiers on the clusters 1-8. Line 1 (circles) - global classifier, line 2 (crosses) - local classifier.

94


0.8

0.8

0.6

0.6 sensitivity

Cluster 10

1

sensitivity

Cluster 9


0.4

0.4

0.2

0.2

0

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

1 - specificity

0.8

0.8

1

0.6

0.8

1

0.8

1

0.8

1

0.6 sensitivity

Cluster 12

1

0.4

0.4

0.2

0.2

0

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

1 - specificity

0.6 1 - specificity



0.8

0.8

0.6

0.6 sensitivity

Cluster 14

1

sensitivity

Cluster 13

0.8


1

sensitivity

Cluster 11


0.4

0.4

0.2

0.2

0

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

1 - specificity

0.6 1 - specificity



0.8

0.8

0.6

0.6 sensitivity

Cluster 16

1

sensitivity

Cluster 15

0.6 1 - specificity

0.4

0.4

0.2

0.2

0

0 0

0.2

0.4

0.6 1 - specificity

0.8

1

0

0.2

0.4

0.6 1 - specificity

Figure 6.2: ROC curves for global and local classifiers on the clusters 9-16. Line 1 (circles) - global classifier, line 2 (crosses) - local classifier.

95 Cluster Id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Accuracy of SCH 72.34 72.31 72.33 72.33 72.32 72.36 72.31 72.35 72.30 72.31 72.37 72.31 72.33 72.35 72.35 72.34

Specificity of SCH 72.54 72.52 72.54 72.64 72.50 72.54 72.52 72.56 72.50 72.51 72.57 72.51 72.53 72.51 72.56 72.56

Break-even R0 (63.44, 63.47) (65.46, 65.44) (66.29, 66.29) (64.64, 64.64) (69.00, 68.98) (69.95, 69.96) (66.43, 66.42) (59.41, 59.41) (64.15, 64.12) (63.26, 63.28) (65.16, 65.15) (65.03, 64.92) (65.25, 65.21) (59.78, 59.78) (64.86, 64.86) (64.96, 64.99)

Break-even R2i (63.48, 63.50) (66.01, 65.99) (66.29, 66.32) (65.54, 65.55) (69.96, 69.96) (71.66, 71.66) (65.92, 65.91) (59.14, 59.15) (64.10, 64.10) (63.06, 63.08) (65.79, 65.78) (66.49, 66.34) (64.26, 64.25) (59.41, 59.41) (65.42, 65.44) (64.68, 64.66)

Table 6.3: Second column - accuracy of SCH (global accuracy is 72.32). In the third clumn, by fixing SCH sensitivity to 72.10 (sensitivity of a global classifier), we can observe the specificity of the SCH scheme compared to that of the global (72.50). Also, this tabel shows break-even points of sensitivity/specificity for global and local models on the clusters (columns 4 and 5). fixed sensitivity level shown in the second and third columns of Table 6.3 support this conclusion. However, the results on the clusters may be rather different. As can be seen from Table 6.2, in clusters 4,5,6,11 and 15 the AUC of the local classifier is larger than that of the global classifier. These differences are statistically significant (at 0.99 level) according to the AUC comparison test described by Hanley and McNeil [50, 51]. The average log-likelihood is also higher for the local classifiers in these clusters. Also, the break-even sensitivity/specificity values on these clusters are better for the local models (Table 6.3). Figures 6.1-6.2 contain ROC plots of the local and global classifiers on the clusters. These graphs show that in cluster 5,6 and 15, for example, the ROC for the local model dominates the ROC of the global model, supporting the idea that in appropriate regions these local models are better than the global model. The different predictive performance of the local classifiers also suggests that feature importance is also different inside the clusters. We will examine this in more detail below.

6.2

Analysis of Local Feature Importance

Below we will analyze the local models, and models for separating a cluster for the rest of the data in terms of their difference from the global model.

96 j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Correlations of rj R0 and R2i R0 and R1i R2i and R1i 0.234 0.081 -0.175 0.610 -0.094 0.242 0.395 0.132 -0.142 0.351 0.091 0.022 0.083 0.128 0.200 -0.245 0.454 -0.245 0.654 -0.144 -0.147 0.643 -0.062 -0.151 0.147 -0.103 -0.390 0.333 -0.133 0.019 0.531 -0.024 0.036 0.670 -0.142 0.004 0.502 0.068 -0.188 0.656 -0.014 -0.106 0.600 -0.046 -0.159 0.324 -0.202 -0.057

Table 6.4: Correlations between relative coefficient weights rj of global classifier and those of local and first level classifiers. Table 6.4 shows correlations between the relative coefficient weight on all the pairs of different types of classifiers with respect to each cluster. These correlations are small in many clusters. In cluster 6 they even are negative, suggesting that globally important features are not important inside these clusters and vice-versa. The correlations between feature in the classification models and in the first-level classifier separating cluster from the rest of the data are all very small, and often negative. Figures 6.3 - 6.4 provide an example of another way of looking at the same information. In each plot, the relative weight of the coefficient is plotted for the global, local and first-level classifiers. The features are ordered by decreasing global relative weight (thus the order is the same in all the plots). By observing to what extent the monotonicity is violated by feature weights for the local and first-level classifiers, one can tell that features have drastically different importance in these types of classifiers. Appendix G contains Tables that describe comparing the sets of important features (those with high relative weight that together add up to 50%) of the total weight. We will not discuss each table in detail here, but will make some general observation about the structure of the data that can be made on the basis of those tables and the graphs above. Based on the locally important features it is possible to group clusters. For example, consider histology (82-85) and stage (94-97) features. Of these 83 (histology code 804*), 94 (stage code 10 or higher) and 97 (stage code 32 or higher) are important in the global classifier. In clusters 2, 7, 8, 12, 14, 15 out of all the histology and stage features only feature 83 is important. In clusters 4, 10, 13

97

0.1

0.16 line 1 line 2 line 3

0.09

line 1 line 2 line 3

0.14

0.08

0.06 0.05 0.04

Importance Criterion

Cluster 2


Cluster 1

0.12 0.07

0.1

0.08

0.06

0.03 0.04 0.02 0.02

0.01 0

0 0

5

10

15

20 25 Features

30

35

40

45

0

0.12

5

10

15

20 25 Features

30

35

40

45



0.09

0.1

0.06

0.07 Importance Criterion

Cluster 4


Cluster 3

0.08

0.06 0.05 0.04

0.04 0.03 0.02 0.02 0.01 0

0 0

5

10

15

20 25 Features

30

35

40

45

0

0.12

5

10

15

20 25 Features

30

35

40

45



0.16

0.1

0.06


Cluster 6


Cluster 5

0.14

0.04

0.1

0.08

0.06

0.04 0.02 0.02

0

0 0

5

10

15

20 25 Features

30

35

40

45

0

5

10

15

20

25

30

35

0.12

line 1 line 2 line 3 0.12

0.08

0.06


Cluster 8

0.1


45


Cluster 7

40

Features

0.08

0.06

0.04 0.04

0.02

0.02

0

0 0

5

10

15

20 25 Features

30

35

40

45

0

5

10

15

20 25 Features

30

35

40

45

Figure 6.3: Plots of rj for global classifier and local and first-level classifiers in each cluster. Line 1 global, line 2 - local, line 3 - first-level classifiers respectively.

98

0.09


0.08


0.05

0.04


Cluster 10


Cluster 9

0.07

0.03

0.06

0.04

0.02 0.02 0.01

0

0 0

5

10

15

20 25 Features

30

35

40

45

0

0.16

15

20 25 Features

30

35

40

45


0.1

0.08

0.06

0.12


Cluster 12

0.14

0.12


10


0.14

Cluster 11

5

0.1

0.08

0.06

0.04

0.04

0.02

0.02

0

0 0

5

10

15

20 25 Features

30

35

40

45

0

0.1

5

10

15

20 25 Features

30

35

40

45


0.09


0.06 0.05 0.04


Cluster 14


Cluster 13

0.08

0.03

0.08

0.06

0.04

0.02 0.02 0.01 0

0 0

5

10

15

20 25 Features

30

35

40

45

0

5

10

15

20

25

30

35

40

45

Features

0.14



0.12

0.08

0.06


Cluster 16


Cluster 15

0.2 0.1

0.15

0.1

0.04 0.05 0.02

0

0 0

5

10

15

20 25 Features

30

35

40

45

0

5

10

15

20 25 Features

30

35

40

45

Figure 6.4: Plots of rj for global classifier and local and first-level classifiers in each cluster, continued. Line 1 - global, line 2 - local, line 3 - first-level classifiers respectively.

99 j 1 3 5 7 12 13 24 31 32 33 64 66 73 74 75 76 78 83 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 -

R1i 1 1 1 1 1 1 1 -

µj 0.00 0.00 0.01 0.00 0.86 1.00 0.28 0.04 0.14 0.07 0.45 0.00 0.07 0.07 0.43 1.00 0.00 0.19 0.98 0.97 0.65 0.04

Description Registry: San-Francisco Registry: Detroit Registry: Iowa Registry: Seattle Registry: Los Angeles Place of birth: US Age 75 or greater Born in East South Central region Born in South West Central region Born in Mountain region Extention code 71-76 Extention code 80-85 Surgery was performed Surgery recommended Radiation therapy No radiation sequence with surgery Radiation after surgery Histology code 804* Stage code 10 or higher Stage code 20 or higher Stage code 31 or higher Stage code 32 or higher

Table 6.5: Feature information for Cluster 15 (only significant or discussed features are shown). Histology code ”xyz*” means that the first three digits of the code are xyz. only some of the stage features are important, while clusters 3, 5, 9, 11, 16 make use both of stage and histology features. Also, in clusters 1 and 6 none of these two groups of features are important. This grouping suggests that in different situations (i.e. different clusters), different features are important for predicting survival. It is of course interesting to consider what the clusters are. Consider Tables 6.5 and 6.6 that contain parts of the Tables from Appendix G for clusters 15 and 16. Cluster 16 consists of cases in Los Angeles area (feature 12) who weren’t born in the USA (feature 13). (This cluster does not necessarily include all such cases.) Cluster 15 consists only of cases that were born in the US, and of those 86% are in Los Angeles area. The local model in cluster 16 suggests that variables 96 and 97 are important (stage code 31 or 32 or higher), and the frequency of both features is high in that cluster. In cluster 15 on the other hand, these features are not important, and frequency of feature 97 is much less than in cluster 16. Similar analysis can be conducted for other clusters, or focusing on other variables. These and other observations may suggest a number of hypotheses to an investigative team which could be evaluated by targeted follow-up studies.

100 j 12 13 24 31 32 33 64 66 73 74 75 76 78 83 94 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1

R1i 1 1 1 1 -

µj 0.99 0.00 0.29 0.00 0.00 0.00 0.22 0.44 0.06 0.06 0.46 1.00 0.00 0.15 0.98 0.74 0.46

Description Registry: Los Angeles Place of birth: US Age 75 or greater Born in East South Central region Born in South West Central region Born in Mountain region Extention code 71-76 Extention code 80-85 Surgery was performed Surgery recommended Radiation therapy No radiation sequence with surgery Radiation after surgery Histology code 804* Stage code 10 or higher Stage code 31 or higher Stage code 32 or higher

Table 6.6: Feature information for Cluster 16 (only significant or discussed features are shown). Histology code ”xyz*” means that the first three digits of the code are xyz.

6.3

Conclusions

We have shown how combining cluster analysis with supervised classification can be used to obtain a description of the data in terms of the regions in the feature space that the points occupy as well as in terms of the features significant for prediction in each of the regions. Such analysis can be further automated to quickly detect interesting clusters by incorporating constraints on the cluster size and the differences between the global and local models. Alternatively, analysis of the cluster structure as a whole can provide insight into the data and suggests hypotheses for further research. The analysis should be conducted with participation of the domain expert, in this case an epidemiologist, since interpretation of the results is going to be domain-specific.

101

Chapter 7 Epilogue

7.1

Summary

The major innovation of this work was to systematically examine ways of using deterministic clustering methods together with supervised classification methods. While such approaches seem intuitive, to our knowledge there has previously been no such study. I experimented with clustering at four different levels: 1. examples belonging to one class; 2. examples belonging to different classes; 3. clusters belonging to different classes; and 4. classes as a whole. The process of clustering the data is followed by the process of constructing and evaluating classifiers on different elements of the partition. This approach also requires a method for assigning a new point to an appropriate region in the space. Thus it is essentially hierarchical. For the purpose of analysis, I have proposed 6 algorithms, not counting additional variants. I have shown that combinations based on clustering inside classes and unsupervised clustering result in better predictive performance (accuracy is higher by to 5%) than those based on grouping classes or inside-class clusters via confusion matrix analysis or than the global classifier approach. These conclusions are supported by more than 3000 experiments on 4 benchmark datasets. I also suggested a simple combination of unsupervised clustering and classification to search for local structure in the data and detect locally significant features. In such analysis the classification model itself is seen only as an intermediate result on the path to understanding the phenomena. The approach was illustrated with analysis of lung cancer survival data containing records for 200,000 patients.

102

7.2

Contributions

Some of the contributions of this work are summarized here: 1. Suggesting the Clustering Inside Classes (CIC) approach to improving classification accuracy. This approach uses cluster analysis to partition each class into a pre-specified number of clusters. Clearly, each cluster corresponds to exactly one class. These clusters are treated as classes when building a classifier. A new point is assigned to the class corresponding to the predicted cluster. This simple and efficient approach was shown to improve accuracy on 3 of the 4 benchmark datasets in the initial experiments. In most cases, the improvements are statistically significant. Additional experimentation indicated these results were not an artifact of the training/test split used. CIC yields better results than a global classifier when classes have more than 1 intrinsic components. Even when each class is homogeneous, with sufficient training data CIC is no worse than the global classifier. To my knowledge only one publication [61] previously discussed a similar, but more complicated, method. 2. Suggesting two methods (KKM and RBC) for analysis of confusion matrices. These two heuristic methods group together columns and rows of the error-based similarity matrix between classes or clusters by maximizing specific mathematical criteria. These criteria have, in some form, been studied in the literature. However, I don’t know of previous description of the proposed algorithms despite their simplicity and close relation to many other methods. Due to the large amount of literature it is possible that these algorithms have been proposed before. Both of these methods can be used in more general situations than the one described in this work. They can be applied to analysis of any symmetric similarity matrices. 3. Suggesting the ReUsing First Level Classifier in Error-Based Class Aggregation Approach, and showing it to outperform the ReTraining First Level Classifier Approach. The method proposed in [45, 44] involved training classifiers to distinguish between the metaclasses. That work suggested that the error rate of these classifiers, combined with that of the classifiers inside the metaclasses, made such a scheme ineffective. I suggested an alternative approach where the global classifier is used to unambiguously assign a point to a metaclass rather than training new classifiers to separate metaclasses. My approach, ReUsing First Level Classifier, outperformed ReTraining First Level Classifier in almost all experiments. However the resulting method failed to noticeably improve on the global classifier.

103 4. Investigating a combination of CIC and EBCA (the CIC+EBCA method). The first of these approaches partitions the classes into clusters, while the second combines classes into metaclasses. The combination of these two methods is natural: first each class is partitioned into clusters, and then the clusters are combined into metaclusters based on the analysis of confusion matrix. It turned out however that this approach failed to improve on CIC alone. 5. Suggesting the Hierarchy With Global Classifier (HGC) method, that includes the following ideas: • deterministic ”input partitioning” • use of first-level classifier to separate clusters • several methods for automatic selection of local or global classifier Using a first-level classifier rather than assigning points to clusters based distances to cluster centers almost always led to slightly better results. 6. Demonstrating that CIC and HGC approaches outperform EBCA and CIC+EBCA methods for combining clusters or classes into metaclusters or metaclasses. In a large number of experiments that I conducted, CIC+EBCA was rarely able to perform better than CIC alone, and EBCA did not improve on the global classifier. However, both CIC and HGC improved on the global classifier in almost all the cases. 7. Suggesting a simplified versions of HGC, SCH, as a way to systematically examine cluster structure of the data and find locally significant variables. This approach partitions data into clusters and analysis one cluster at a time. This is done by building a local predictive model and comparing its performance and structure to those of the global model, and also by examining the separator between the cluster and the rest of the data. This method will allow us to find ”interesting” clusters in the data, where an interesting cluster is one consisting of a set of points that is distinguishable from the rest of the data and that allows for a different (better) local model of the phenomena. 8. Applying the SCH scheme to analysis of lung cancer data from the SEER database. This analysis provided an example of how to look for local structure and interpret the differences between the local, global and first-level models.

104

7.3

Directions for Future Work

A number of directions for further work have been discussed in the preceeding chapters. Below is a brief summary of those, and some additional ideas. A general characteristic of this work is that it was conducted using linear classification methods and clustering in the input space. It would be interesting to examine the combinations of featurespace clustering, such as Kernel k-Means, with non-linear classification methods, for example SVM with Gaussian kernels.

7.3.1

Future Work on Local and Global Models

There is further work to be done exploring the criteria for using a local model and the effect of particular clustering method and of the cluster quality on classification accuracy and interpretability. Our results suggest that using some method to remove poor local classifiers is often advantageous, as expected. However it is not clear whether there is a single method for identifying poor local classifiers that is best in most situations. In general, the problem of selecting the best classifier in a particular region has been addressed in different contexts in the machine learning literature [73],[94] in ways similar to ours, i.e. using cross-validation. One approach for improving HGC results is to use statistical significance test rather than a comparison of accuracies of local and global classifiers. One interesting direction I did not consider was partitioning each cluster into a central “core” and boundary “tail” regions, and apply local experts only on cores, while leaving the tails to the global classifiers. This is a different variation of the HGC approach from the one examined, and it could lead to interesting results both from the perspective of accuracy and of interpretation.

7.3.2

Future Work on Parameter Tuning

The hyperparameter values selected for a particular dataset were used for all classifiers on that dataset. In other words the hyperparameters were not tuned for the local methods. It is possible that with appropriate tuning the methods proposed in this dissertation perform better than the experimental results described here suggest. Finding ways of locally setting parameters is a potential direction for further improving performance of our methods. As previously mentioned, the choice of the number of clusters (parameter k) has a strong effect on the results of both the CIC and HGC approaches. Having too few or too many clusters leads to drop in the performance since then clusters do not correspond to the high density data regions. This is also

105 related to the amount of the available data - the number of points in the training set is one important factor for choosing the number of clusters. Thus, one direction for further work is to experiment with clustering methods capable of automatically choosing the appropriate number of clusters. Similar work can be done for automatically determining the number of metaclasses or metaclusters in the EBCA scheme (parameter m). The problem of determining the number of clusters in the data is a well-known open problem in cluster analysis, with many heuristic solutions proposed in the literature. The most appropriate method is likely to depend on the clustering method used and be application dependent.

7.3.3

Future Work on Applications

It is important to find appropriate ways of interpreting the models and the decision rules they produce, for example by finding the feature values characterizing particular clusters. Developing fully or semiautomatic methods for this task is another direction for future work. In applying my method to analysis of epidemiological data, all features are used in clustering. It may be advantageous to select, manually or automatically, a subset of features to be used in clustering. This may lead to more interpretable models; and could be a simple way of incorporating domain knowledge. Analysis of the cluster structure of the data can be further automated to quickly detect interesting clusters by incorporating constraints on the cluster size and the differences between the global and local models. Simple examples include constraining clusters to be of specific size, and introducing thresholds on the difference between local and global model performance. A number of other aspects of this work may be of interest to the epidemiological community and can be developed further. For example, a great deal of effort was spent on preparing the data. While this process probably cannot be automated completely, there are ways of making it much faster and easier. Developing standard methods for data preparation would be of benefit to the medical and epidemiological community.

106

Appendix A K-Means

A.1

Data Clustering

Clustering is an unsupervised classification of patterns into groups based on similarity. Different clustering methods are widely used in applications ranging from image analysis to bioinformatics. Clustering methods can be: • hierarchical or partitional. Hierarchical methods produce a sequence of refined partition, with the whole dataset as some cluster on the topmost level. Partitional methods produce a single partition of the data, with every point belonging to some cluster. • deterministic or fuzzy. In deterministic clustering methods each point will belong to a single cluster. In fuzzy methods a point can have degrees of membership in many clusters. • agglomerative or divisive. This refers to the algorithmic operation of the method. Agglomerative methods start with individual points and repeatedly merge them into clusters. Divisive methods begin with all points in a single cluster and repeatedly split the clusters until a stopping criterion is met. • iterative or batch. Iterative methods can work with large datasets by processing one object at a time. Batch methods need access to all data points to perform clustering. A clustering algorithm can be described by a combination of the above labels and some additional ones (for example whether the algorithm can only work with points or it can handle similarity/distance matrices as input). A good discussion of taxonomy as well as an overview of clustering methods can be found in [60].

A.2

K-Means Criterion

Many partitional clustering methods attempt to find a partition optimizing some criterion.

107 Given a partition P = {Si |i = 1, . . . , K} of set W = {x1 , . . . , xN } into K clusters (W = T and Si Sj = ∅, ∀i 6= j), consider the set of functional criteria: F (P ) =

N X

q(xj , P )

S

Si

i=1,...,K

(A.1)

j=1

where q(xj , P ) is a measure of how far xj is from a corresponding cluster. We want all points to be well-clustered, i.e. to be inside one cluster and far from others. Thus we want to find partition P that minimizes F . A choice of function q(xj , P ) is one of the two main factors definining a particular method. (The other is a procedure for minimizing F (P )). K-Means (KM) is a popular clustering algorithm. It is widely used on its own, or as a part of more complex methods (to create initial partitions for Expectation Maximization (EM) or as a part of Spectral Clustering method [88]). K-Means attempts to find a partition of data into K clusters that minimizes the average intra-cluster dispersion: E0 (P ) =

X

i=1,...,K

where si is a mean vector and

σi2

|Si |σi2 =

X

j=1,...,N

min ||xj − si ||2

i=1,...,k

(A.2)

is variance of cluster Si . It is easy to see that E0 (P ) is a special case

of F (P ) with q(xj , P ) = minsi ||xj − si ||2 .

A.3

Batch, Iterative and Adaptive K-Means

Procedures such as the one given in Algorithm 9 are used to find local minima of E0 (P ). They alternate two steps - assigning all points to clusters with the closest centers (lines 9-15) and recomputing centers based on partitions (lines 5-8) - until the centers become stable and every point resides in the cluster with the closest center. This batch algorithm (ascribed alternatively to Forgy [36] or to Lloyd [76]) has complexity O(tdN), where n is the number of points, d is dimensionality of the data and t is the number of iterations that are performed. One potential source of problems is occurance of “empty clusters”. This can happen when for some center there are no points for which it is the closest center. Some heuristic ways of dealing with this include: • Keeping the empty cluster as it is, in hope that later on it will gain points. • Removing the empty cluster and splitting the cluster with the largest contribution to E0 (P ) in two.

108 Algorithm 9 Classical Batch K-means Pseudocode Require: Dataset W , consisting of N points in ℜd ; k - number of clusters 1: create initial clusters S1 , . . . , Sk 2: count = 1 3: while count > 0 do 4: count = 0 5: for i = 1, . . . , k do 6: ni = |Si |P x 7: si = n1i x∈Si

8: 9:

10: 11: 12: 13: 14: 15: 16: 17:

end for for j = 1, . . . , N do h = argmini=1...,...,k ||xj − si || if xj ∈ Si and i 6= h then Move xj to Sh count = count + 1 end if end for end while Return S1 , . . . , Sk .

• Moving the empty cluster to the point that is farthest from its center. • Keeping that closest point to the center in the cluster, even if there is another center to which the point is closer than to the current one. An alternative approach, first described by MacQueen in [77], processes one point at a time. A pseudocode for it is given by Algorithm 10. The algorithms loops through all the points, keeping track of how many were moved. If it finds that a point is assigned to one cluster but is closest to the center of another cluster (line 11), it reassigns it to the second cluster (line 12) and immediately updates the two affected centers (lines 13-16). We call such approach iterative. One advantage of this approach is low memory requirements (all of data does not have to be kept in main memory). Also, this method does not suffer from empty cluster problem (since the last point in a cluster becomes the center and cannot be moved out). A number of variants of this procedure exist, differing in how the ordering of points is produced or how the point to be moved is chosen. Recently Tarsitano [111] compared a number of different relocation and swapping strategies. The best performance was obtained by Transferring pass-Global best-improving (TGBI) method, where at each step the best single transfer over all points (the move that decreased the criterion the most) was selected. The iterative method can be modified to cluster an infinite stream of data[77]. It is then known as as adaptive K-Means [20]. Such method can be used when data is too large to be stored or only

109 Algorithm 10 Iterative Batch K-means Pseudocode Require: Dataset W , consisting of N points in ℜd ; k - number of clusters 1: create initial clusters S1 , . . . , Sk 2: for i = 1, . . . , k do 3: ni = |Si |P 4: si = n1i x x∈Si

5: 6: 7:

8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

end for count = 1 while count > 0 do count = 0 for j = 1, . . . , N do h = argmini=1...,...,k ||xj − si || if xj ∈ Si and i 6= h then Move xj to Sh si = ni1−1 (ni si − xj ) sh = nh1+1 (nh sh + xj ) ni = ni − 1 nh = nh + 1 count = count + 1 end if end for end while Return S1 , . . . , Sk .

becomes available with time. Clearly it is impossible to minimize the criterion E0 (P ) directly under these assumptions, since partition is not kept. Let I1 , . . . , Ik denote regions in the space, with their average vectors si . The criterion being optimized by adaptive K-Means is [15]: ′

E0 (I1 , . . . , Ik , s1 , . . . , sk ) =

X Z

i=1,...,k I

P (x)||x − si ||2 dx

(A.3)

i

where P (x) is a probability distribution on x. E0 ′ (si ) is usually minimized by a gradient descent procedure (Algorithm 11) that at each step t updates the center closest to the current point xt with some learning rate ηt (line 5). Different ways of choosing a learning rate are suggested in [15, 20]. Aside from briefly stating known convergence results for Algorithm 11 in Section A.4.1, we will only be interested K-Means on a finite number of points (i.e. with batch or iterative K-Means).

110 Algorithm 11 Adaptive Batch K-means Pseudocode Require: Data stream W , consisting of points in ℜd ; k - number of clusters; η - learning rate 1: create initial centers s1 , . . . , sk 2: while Stopping criterion is not reached do 3: xt is the next point 4: h = argmini=1...,...,k ||x − si || 5: sh = sh + ηt (x − sh ) 6: t=t+1 7: end while 8: Return s1 , . . . , sk .

A.4 A.4.1

Theoretical View of K-Means Convergence of K-Means

Proof of finite convergence of batch K-Means type algorithms (for any metric) on a finite set of points was given for the first time in [106]. When Euclidian distance is used, as in E0 (P ), the algorithm always converges to a local optimum. The proof, for Euclidian distance case, is based on the following remarks. The algorithm consists of repeating two steps: moving point (or points) to a different cluster and updating cluster centers. It is easy to show that each of these steps decreases the value of criterion (A.2). Since there is a finite number of points, the decrease can be bounded from below by some ǫ > 0. Since the initial value of E0 (P ) cannot exceed the scatter of the data, and since each step decreases the value of E0 (P ) by at least ǫ, the algorithm has to terminate in finite number of steps. Similar reasoning can be used to show finite convergence to the local minimum of the iterative procedure. We briefly mention that the proof of adaptive K-Means is more complicated. For decreasing learning P 2 P ηt → C, where C is some constant, the Algorithm 11 converges ηt → ∞ and rates, satisfying t→∞

t→∞

almost surely (in a probabilistic sense) to a locally optimal solution [20],[77].

A.4.2

Quality of Solution

Finding the globally optimal solution P ∗ (for a finite set of points): P ∗ = argminP E0 (P )

(A.4)

is known to be NP-hard [31]. Matousek [78] showed that it is possible to find (1 + ǫ)-approximately optimal k-clustering P ′ (meaning that E0 (P ′ ) ≤ (1 + ǫ)E0 (P ∗ )) in time polynomial in N for a fixed ǫ and k.

111 Theorem A.4.1 For any N points in ℜd , k > 2 and ǫ ∈ (0, 1), an (1 + ǫ)-approximately optimal partition into k clusters can be found in time 2

O(N (logN )k ǫ−2k d )

(A.5)

A somewhat better result can be obtained for k = 2 or if the cluster size can be bound from below by some constant. We do not know of any implementation of the algorithm of [78]. While this algorithm is polynomial in n, the constant factors make it impractical for real applications [67]. Recently an algorithm with better asymptotic computational complexity was described in [52]. Again, the described algorithm was not implemented and is not practical. It is easy to construct an example of a partition P that is a local optimum but that is arbitrarily bad in the sense that the ratio E0 (P )/E0 (P ∗ ) is not bounded [67]. Consider four ordered points on a line with coordinates −y, 0, z, z + x and satisfying x < y < z. For k = 3, the optimal partition puts points z and z + x in the same cluster and has value x2 /2. However, a partition that puts −y and 9 in the same cluster is a local optimum with value y 2 /2. The ratio between the values of these partitions can be made arbitrarily large by increasing the ratio y/x. However the quality of the partition can be bound in other ways. Let X be an n × d matrix with points xi as its rows, and let Y be the centered version of X. Notice that E0 (P ) is not affected by centering of the data since it doesn’t affect inter-point distances. Clearly E0 (P ) ≤

X

j=1,...,N

||xj − x||2 = N V ar(X) = N V ar(Y )

(A.6)

Zha et. al. showed in [127] that K-Means clustering can be formulated as trace maximization problem. Observe that: E0 (P ) =

X

j=1,...,N

x2j −

X

i=1,...,k

1 |Si |

X

xTj xh

(A.7)

xj ,xh ∈Si

The first of these terms is a constant. The second term is the sum of diagonal block elements of withincluster similarity matrix (assuming the indexing of points such that the points from the same cluster form a block). Thus, the problem of minimizing (A.7) is equivalent to: max trace(HXX T H t ) H

(A.8)

under some special structural constraints on the k × N matrix H, which is a matrix containing partition p information. More specifically, H = hij , with hij = 1/ |Si | if xj ∈ Si and hij = 0 otherwise (for

j = 1, . . . , N and i = 1, . . . , k). While the exact solution respecting the constraints on H is difficult

112 to find, the relaxed problem (ignoring the structure of H) can be easily solved by applying Ky Fan Theorem: max trace(HXX T H t ) = λ1 (XX T ) + . . . + λk (XX T ),

(A.9)

where λi (XX T ) is the i-th largest eigenvalue of XX T . Also, the optimal H ∗ is given by: H ∗ = QAk ,

(A.10)

where Ak is a k × N matrix such that its row i is the i-th eigenvector of XX T , and Q is any orthogonal matrix. This approach leads to the following bound: N V ar(X) ≥ E0 (P ) ≥

X

λi (XX T ).

(A.11)

i=k+1,...,min(n,d)

This work also suggests several algorithms for performing K-Means clustering by extracting partition information from H ∗ = Ak . One possibility is to represent data points by corresponding columns of Ak and perform K-Means (with randomly selected initial centers) in this small-dimensional space where points from different clusters should be well separated. Another approach is to orthogonalize columns of Ak and assign points to clusters whose index matches the row index of the largest element (by absolute value) in the new representation (see the paper for details). Empirical results showed that both of these methods to give better quality partitions than batch K-Means with random initial centers. Ding and He [27] further developed results of [127]. They demonstrated that: N V ar(Y ) ≥ E0 (P ) ≥ N V ar(Y ) −

X

λi (Y Y T ).

(A.12)

i=1,...,k−1

They also suggested improvements to the relaxation approach of [127]. Note that methods of [127] and [27] require finding k eigenvectors of an N × N matrix making the proposed algorithms at least O(kN 2 ).

A.4.3

Computational Complexity

As mentioned previously, the computational complexity of batch K-Means is O(tdN), where n is the number of points, d is dimensionality of the data and t is the number of iterations that are performed. While in practice t is usually a small number (several dozen iterations), in some cases it can be much larger. Several interesting results on computational complexity of K-Means were recently reported.

113 max ||xi −xj ||

Let us define spread of set W as ∆ =

i,j∈W

min ||xi −xj || .

i,j∈W

Har-Peled et. al [53] analyze the bounds on the number of iterations until convergence. They show that in the one-dimensional case it is possible to construct a case of 2n points with 2 initial centers that requires n iterations to converge. An upper bound shown for K-Means in one dimension is O(n∆2 ). Another special case is points on a grid: Theorem A.4.2 For N points on a grid {1, . . . , M }d the number of steps is at most dN 5 M 2 . While the authors of [53] were not able to show bounds for batch K-Means, they provided results for what they called ”SinglePnt” (essentially an iterative version of K-Means, as in Algorithm 10) and “Lazy K-Means”. In order to describe Lazy K-Means we need the following definition. A point x ∈ Si is (1 + ǫ)-misclassified for centers si and sj if ||x − si || > (1 + ǫ)||s − sj || for some ǫ > 0. The Lazy K-Means algorithm operates like regular batch K-Means, but it only moves points that are (1 + ǫ)-misclassified. In other words it only moves points which are strongly misclassified. The algorithm terminates when no point is (1 + ǫ)-misclassified. The regular batch K-Means can be thought of as Lazy K-Means with ǫ = 0. The two results for these algorithms are [53]: Theorem A.4.3 On any input X ∈ ℜd , SinglePnt makes at most O(kN 2 ∆2 ) steps before termination. Theorem A.4.4 On any input X ∈ ℜd , ǫ > 0, Lazy K-Means makes at most O(N 2 ∆2 ǫ−3 ) steps before termination.

A.5

Addressing Weaknesses of K-Means

Over the years a number of arguments were developed against using K-Means. In particular: • K-Means assumes spherical clusters; • k has to be specified; • many unnecessary distance computations are made in the FOR-loop part of Algorithm 9 • rate of convergence and quality of the result are strongly dependent on the initial conditions Weakness (i) is inherent in the choice of K-means algorithm with Euclidean distance. Using a different distance metric or representation will result in a different clustering. For example in spectral

114 clustering K-Means is done on a transformed matrix of similarities [88]. K-Means-like algorithms can be used on matrices of distances or similarities. Weakness (ii) is important from practical point of view and has received a lot of attention (for example [80, 123, 97, 35, 49, 104]) both in the context of K-Means and as a general issue for many clustering algorithms.

A.5.1

Improving Efficiency of K-Means

There have been a number of papers on improving efficiency of K-means by using kd-trees, range trees and other structures ([90], [115],[5]). The overall idea of these papers is maintaining some information on the spatial distribution of points. This allows one not to compute all distances between points and centers, since in many cases these are known to be far away from each other. These suggestions improve the efficiency of the algorithm, but they do not not provide a better result (a lower E0 (P )). An argument is frequently made that a more efficient algorithm can be run more times with different starting points and the best result can be taken. This is often done in practice but is rather inelegant and requires drastic improvements of efficiency in order to be effective. A different approach to speeding up (and improving quality of K-Means) is to perform clustering on the bootstrap samples of the data and then to average similar cluster centroids [21]. The benefits are due to (i) using only fraction of the data each time, thus obtaining multiple estimates of a solution instead of a single solution on the whole data and (ii) having fewer points in each run, which decreases the number of iterations to convergence.

A.5.2

Role of Initial Conditions

Since only convergence to a local optima is guaranteed, the initial condition determines the quality of the resulting partition and the number of iterations until convergence. A number of approaches for selecting a good initial condition have been discussed in the literature. Pen et. al. [91] empirically analyzed four classical initialization methods with iterative K-Means: 1. Random Approach (RA) - where dataset is randomly divided into k clusters. 2. Forgy Approach (FA) - proposed by Forgy in 1965 [36], where k instances are chosen at random to be cluster centers and the rest are assigned to the cluster with the closest center. 3. MacQueen Approach (MA) - proposed by MacQueen in 1967[77], where k instances are also chosen at random to be cluster centers. However, unlike the Forgy Approach, then the remaining points

115 Algorithm 12 Kaufman et. al [69] Initialization Require: Dataset W , consisting of n points x ∈ ℜd ; k - number of clusters P 1 1: Let v = argminx∈W ||x − n j xj || 2: V = {v} 3: while |V | = 6 k do 4: for xi ∈ W/V do 5: for xj ∈ W/V, j 6= i do 6: dij = ||xi − xj || 7: Dj = min ||xj − v|| v∈V

8: 9: 10: 11:

cij = max(Dj − dij , 0) end for end for P v = argmaxxz ∈W/V czj j

V =V ∪v 13: end while 14: Return V .

12:

are assigned to the closest center one at a time, and the center is updated every time it gets assigned a new point. 4. Kaufman Approach (KA) - proposed in [69]. Initial clustering is obtained by successive selection of representative instances until K such instances have been found. Algorithm 12 presents the pseudocode for this method. The goal is to incrementally add points to the set of centers, so that each new center has many points near it but is far from previously selected centers. The quantity P czj (in line 11) is indicative of this - larger value indicates greater difference between distance to j

the other centers and distance to z point. Hence the point with the largest such value is selected

as the new center. According to the empirical results given in the above paper both RA and KA lead to comparable values of E0 (S) and are more efficient and robust than FA or MA. Using KA results in somewhat faster convergence that using RA. According to results reported in [48] however, RA leads to better results than FA. It is useful to note the computational complexity of these methods. RA is O(N ), while MA and KA have the same complexity as one iteration of K-Means, O(kN ). The KA approach however is quadratic in the number of points - O(k2 N 2 ). A great number of other initialization methods have been suggested in the literature. A binary splitting (BS) initialization methods were suggested in [75] and [56]. At each step all clusters are split into two. The papers differed in the method used to split a cluster. Notice that these approaches can only produce a power-of-two number of clusters.

116 More recently, [16] and [109] described incremental splitting approaches, where at each step the cluster with the largest sum of squared errors is split into two, until the desired number of clusters is obtained. That is, the cluster Si to be split is: argmaxSi ∈P

X

x∈Si

||x − si ||2 .

(A.13)

The two papers differ in how splitting is done. Algorithm 13 KKZ [68] Initialization Require: Dataset W , consisting of N points x ∈ ℜd ; k - number of clusters 1: Let v = argmaxx∈W x2 2: V = {v} 3: while |V | = 6 k do 4: v = argmaxx∈W/V argminv∈V ||x − v|| 5: V =V ∪v 6: end while 7: Return V . Initialization method of [68], the pseudocode for which is given by Algorithm 13, starts by selecting the vector with the largest norm as the first center (line 1). At each successive step the point that is most distant from previously selected centers is added as a new center (lines 4-5), until the desired number of centers is obtained. The complexity of this process is O(k2 N ), somewhat larger than of one iteration of K-Means. This method was empirically shown to be better than binary splitting method of [75]. [3] constructed new initialization methods based on partitioning the input space. The centers are randomly chosen from the partition cells/volumes, with each cell having the number of centers proportional to the fraction of points that it contains. The differences between the methods suggested revolved around construction of space partitioning. These methods were compared to the method of [68]. The authors reported that the cell-based methods performed better. Bradley and Fayyad suggested a method for refining initial points for K-Means [11]. The pseudocode is given by Algorithm 14. The method consists of choosing J small random subsamples of the data and clustering them with K-means (lines 2-6 of Algorithm 14). This process results in J k-centers” sets. Then K-Means clustering is done J times over the set of centers, each time with the initial condition given by centers of a different sample (lines 7-9). The final set of centers that gives the smallest value for E0 (P ) (selected in line 10) is taken as the initial condition for the whole data. This approach is also applicable to clustering large datasets, where k-Means cannot be applied to the whole dataset. The authors argue that the additional costs are small while the gains are significant. The empirical

117 Algorithm 14 Bradley and Fayyad [11] Initialization Require: Dataset W , consisting of N points x ∈ ℜd ; k - number of clusters; J - number of samples 1: C = ∅ 2: for j = 1, . . . , J do 3: Let Wj be a random subsample of the data 4: Let Cj be set of centers returned by K-Means with random initial conditions on set Wj 5: C = C ∪ Cj 6: end for 7: for j = 1, . . . , J do 8: Let Fj be the set of centers returned by K-Means with initial centers Cj on set C; and let Ej be corresponding value of the partition 9: end for 10: h = argminj=1,...,J Ej 11: Return Fh . results are shown for different synthetic datasets using random initialization method, and refinement with J = 1 and J = 10, using 10% of data for a single subsample. The accuracy is much better for J = 10 than for the other two methods. As can be seen from this brief overview, there is a great variety of initialization methods. Each study described conducts empirical evaluations only of several of these methods at a time. Different datasets are used for evaluation in almost all of these studies, and results sometimes suggest different conclusions about quality of the methods. Therefore is remains unclear what method is best or whether there is in fact a single best method. The only general piece of information seems to be that starting with centers that are spread out leads to better results than having initial centers close to each other. This has been confirmed by a number of studies, for example [128].

A.5.3

Other Algorithms for Optimizing K-Means criterion

Kanungo et. al. [67] suggested a method for trying to escape poor local solutions. The idea is to replace one of the centers by the nearest point in the dataset. When batch k-means converges, this so-called single-swap heuristic is used to produce a new initial condition. This method was inspired by theoretical results obtained in [78]. Experimental evidence showed that this approach can be effective in improving the value of the final solution. Zhang et. al. [130] suggested combining iterative K-Means with Local Search (LKM). The decision to move a point x from one cluster to another is made not on the basis of its distance to the centers, but on the bases of the change in the K-Means criterion after the move (which can be computed in constant time given the centers and cluster sizes). Mathematically, iterative K-Means (Algorithm 10) moves a

118 point if ||x − sj ||2 > min ||x − si ||2 ,

(A.14)

nj ni ||x − sj ||2 > min ||x − si ||2 , i=1,...,k ni + 1 nj − 1

(A.15)

i=1,...,k

while LKM moves a point if

where ni = |Si | is the size of of the ith cluster. The pseudocode of LKM is given in Algorithm 15. This version differs slightly from the one given in [130] to make the similarities and differences with Algorithm 10 stand out. Here, instead of comparing distances to the centers, the smallest of the quantities δt (lines 11-13) is compared with v (in line 15) to determine if a particular point has to be moved. (In other words (A.15) is used instead of (A.14)).The LKM converges in finite number of steps by the same reasoning as classical K-Means. The authors show that this modified algorithm can get stuck only in a subset of local minima that can trap standard K-Means. They empirically demonstrate that this method achieves better clustering performance and requires fewer iterations than the K-Means. This method can also be used with aggregated data, i.e. data where points have different weights. Attempts have been made at finding global solutions, using genetic algorithms ex. [17], [7]. The population consists of multiple ”individuals” or members, each representing a partition by encoding cluster assignment for every point in the data [17] or the cluster centers [7]. Each member of the population is evaluated using the fitness function (which computes the corresponding value K-Means criterion). Based on this evaluation some members are removed from the population, while others are combined in some way to produce the next population (see the papers for the details). Both [17] and [7] report that genetic algorithms outperform K-Means. None of the papers however conducted a comparison of computational costs. However, genetic algorithms have to be run for many generations, each consisting of multiple members. The fitness function has to be computed for each of the members. This makes them more computationally expensive than batch K-Means with multiple random conditions. Another global approach was described in [74]. The pseudocode for this method is given by Algorithm 16. The authors suggest solving K-Means clustering problem recursively. Given an optimal solution for k − 1 clusters (line 4), a solution for k clusters is obtained by running K-Means N times, starting with k − 1 optimal centers and putting the Kth center at each one of the data points (the loop in lines 5-9) . Since the optimal center for k = 1 is known (it is the center of the whole dataset - lines 1-3), the procedure is well-defined. As an added bonus it produces good clusterings for less than K clusters. It is easy to see however that the number of calls to K-Means procedure is kN since for each 1 < k0 ≤ k standard K-Means is called N times. In other words there is an extra factor kN in running time.

119 Algorithm 15 LKM Pseudocode Require: Dataset W , consisting of N points in ℜd ; k - number of clusters 1: create initial clusters S1 , . . . , Sk 2: for i = 1, . . . , k do 3: ni = |Si |P 4: si = n1i x x∈Si

5: 6: 7:

8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

end for count = 1 while count > 0 do count = 0 for j = 1, . . . , N do i ||xj − si ||2 , where xj ∈ Si Let v = nin−1 for t = 1, . . . , k; t 6= i do t δt = ntn+1 ||xj − st ||2 end for Let h = argmint=1,...,k; t6=i δt if δh − v < 0 then // Move xj to Sh si = ni1−1 (ni si − xj ) sh = nh1+1 (nh sh + xj ) ni = ni − 1 nh = nh + 1 count = count + 1 end if end for end while Return S1 , . . . , Sk .

The authors suggest several heuristics for speeding up their algorithm. One heuristic involves computing upper bound on reduction of E0 (P ) when adding a new center at point x, instead of running K-Means from the corresponding initial condition. K-means is run only with the point that gives the largest estimate of error reduction. Another heuristic involves constructing k-d trees and using buckets centers as possible insertion points for the new center instead of actual data points. Experimental results show that the two heuristic methods give results close to the full global K-Means method, and outperform regular K-Means with multiple restarts. However it remains much slower than the regular K-means. In as much as [74] is based on obtaining good initial conditions for K-Means with k clusters (by recursively solving the clustering problems with 1 ≤ k0 < k clusters), the description of this work would also be appropriate for Section A.5.2 on initial conditions. Finally, spectral relaxation / PCA-based methods of [127] and [27] have already been discussed in Section A.4.

120 Algorithm 16 Global K-Means Pseudocode Require: Dataset W , consisting of N points in ℜd ; k - number of clusters; 1: if k==1 then P 2: Return x = N1 x. x∈W

end if V0 = {s1 , . . . , sk−1 } - centers returned by Global K-Means called on W with number of clusters set to k − 1 for j = 1, . . . , N do V ′ = V0 ∪ xj Vq =LocalKMeans(W,V’): i.e. run standard K-Means to convergence with a given set of centers as initial condition qj = E0 (P (Vq )): evaluate the quality of this solution end for h = argmaxj=1,...,n qj Return Vh .

3: 4:

5: 6: 7: 8: 9: 10: 11:

To our knowledge no direct comparison between any of the methods discussed in this section was ever done. Therefore we cannot say how they compare in terms of quality of partition and running time.

A.5.4

Alternative Criteria

It is interesting to note that there has been work on using alternative criteria for clustering optimizing which was empirically shown to lead to better partitions with respect to criterion E0 . Zhang has developed an approach that is similar to K-Means, called K-Harmonic Means (KHM) ([129]). It is based on minimizing harmonic mean of distances from points to centers. It is in the form of criterion A.1, with: q(xj , P ) = P

k 1 i=1,...,k ||xj −si ||p

(A.16)

In earlier versions of his work, Zhang used p = 2, but then moved to higher values of p that give better performance. Consider a generalized clustering problem where the membership of a point in a cluster i is given by function mKM (xj , Si ), while point’s contribution to performance (weight) is given by wKM (xj , P ). (Such framework has been discussed in [48]). In K-Means wKM (xj , P ) = 1 ∀j = 1, N and

   1, i = argmin  ||x − s || j h h=1,...,k mKM (xj , Si ) = ,  0, otherwise 

i.e. membership is hard (a point contributes to only one cluster) and the weights are constant.

(A.17)

(A.18)

121 The algorithm optimizing A.16 can be seen as having a dynamic weighting function and soft membership function. In KHM wKHM (xj , P ) =

P

i=1,...,k

(

P

||xj − si ||−p−2

i=1,...,k

and mKHM (xj , Si ) =

||xj − si ||−p )2

||x − si ||−p−2 P j . ||xj − sh ||−p−2

(A.19)

(A.20)

h=1,...,k

Experiments ([129],[48]) show KHM to perform better than KM, and to have little dependence on initialization. Empirical comparison suggests that KHM is better at finding good K-Means clusterings than K-Means algorithms. The work of [48] also compared KHM to Gaussian Expectation-Maximization (GEM) and two hybrids of KM and KHM, using respectively hard membership with dynamic weights and soft membership with constant weights. The conclusions were that soft membership (fuzzyness) leads to better clusterings, but dynamic weighting by itself can benefit hard membership clusterings by “forcing” the algorithm to take care of points that are far away from any center. KHM gave the best performance while GEM did worse than regular KM.

A.6

Conclusions

In this section we attempted to describe a number of different approaches to practical usage and theoretical analysis of K-Means clustering. This review is by no means exhaustive. In our experiments we are using batch K-Means algorithms, since the datasets we work with fit into the machine memory. When no outside information was available we used RA approach, involving randomly partitioning data into k clusters. The clustering procedure is applied a number of times with different initial conditions and the best partition is retained. This approach was shown to give good results in [48] and [91]. Since there is no thorough comparison between the known initialization methods (a number of which are significantly more expensive than RA) and conducting one would be beyond the scope of this work, we decided that the use of this simple and popular initialization scheme is appropriate.

122

Appendix B Logistic Regression

Logistic regression is a well-known statistical method, used in many applications. We first briefly discuss linear regression, and then describe logistic regression as it is used in classification.

B.1

Linear Regression

Linear regression is a method for approximating a functional relation between inputs and outputs, described as a set of points (xi , yi ), i = 1, N where xi ∈ ℜd are real-valued input vectors and yi ∈ ℜ are real-valued outputs [54].This is achieved by finding parameters w, b minimizing the residual sum of squares between the estimates and the actual values yi : RSS(w, b) =

N X (yi − (< w, xi > +b))2

(B.1)

i=1

For convenience, let us add an extra dimension with value 1 to all input vectors, define an N ×d+1 matrix X consist of the augmented input vectors, and include b into vector of coefficients w′ = (w, b) ∈ ℜd+1 . RSS can now be rewritten in matrix notation as: RSS(w′ ) = (y − Xw′ )t (y − Xw′ ),

(B.2)

where y is the vector of outputs. The parameter vector w ˆ ′ minimizing (B.2) are given by: w ˆ′ = (X t X)−1 X t y

(B.3)

yˆ = X(X t X)−1 X t y

(B.4)

and the estimate of y is given by:

If X t X is singular, w ˆ ′ can have multiple values. They all however lead to the same estimates yˆ via (B.4). To define a unique solution, additional specifications are necessary.

123

B.2

Regularization

The standard way of specifying additional constraints is by regularization [54]. One of the most popular ways of doing regularization is by using the following formula for w ˆ′ , instead of (B.3): w ˆ′ = (X t X + λI)−1 X t y,

(B.5)

where I is an identity matrix, λ > 0. There are several ways in which this approach is advantageous: • Adding a small multiple of the identity matrix to X t X, i.e. replacing X t X with (X t X + λI) guarantees that there exists an inverse and therefore leads to a unique solution and alleviates numerical complications. • Regularization is also a way of controlling model complexity. It has been noted that models can overfit the data by having large coefficients. This led to the idea of ridge regression - explicitly penalizing large coefficients in the RSS: minimizew′ RSSr (w′ ) = (y − Xw′ )t (y − Xw′ ) + λ||w′ ||22 ,

(B.6)

It turns out that the minimum of (B.6) is given by precisely w ˆ′ in (B.5) (see [54]). Note that for large problems, it is more efficient to optimize (B.6) using gradient descent type procedures, rather than matrix inverses [10, 131]. Another regularization method, called lasso, was suggested in [112]. It uses L1 penalty for the coefficients, resulting in the following criterion to be minimized: lassoRSSl (w′ ) = (y − Xw′ )t (y − Xw′ ) + λ||w′ ||1 .

(B.7)

There is no analytical solution to minimizing (B.7) and numerical methods have to be used to find optimal vector of coefficients. Both lasso and ridge regression lead to feature selection, by shrinking feature weights towards zero. However, lasso sets coefficients of some variables to 0 exactly, thus removing them from the model, while ridge regression shrinks coefficient values but usually doesn’t remove them completely. In this sense lasso is better for feature selection [112].

124

B.3

Logistic Regression

The aim of logistic regression is to model conditional probabilities of the class given the data. For such a task linear regression cannot be used because the outputs have to be restricted to the range [0, 1]. When dealing with a two-class problem (i.e. l ∈ {−1, +1}), this is achieved by applying the logistic transformation to the linear regression problem, leading to the following formula: p(l|x) =

1 . 1 + exp(−l[< w, x > +b])

(B.8)

When dealing with a two-class problem (i.e. l ∈ {−1, +1}), logistic regression can be interpreted as modeling the log-odds ratio with a linear function: ln

p(l = 1|x) =< w, x > +b. p(l = −1|x)

(B.9)

Logistic regression is most commonly trained by Maximum Likelihood (ML) method, in other words by solving for w, b that maximize the probability of seeing the data xi ,i = 1, N , under the assumption that events related to different x’s are independent: maximizew ΠN i=1 p(li |xi , w, b) = maximizew

N X i=1

= minimizew

N X

ln p(li |xi , w, b)

ln(1 + exp(−li [< w, xi > +b])

i=1

This optimization problem may be ill-conditioned (i.e. having multiple or no solutions). The usual way to address this is by adding a multiple of the identity matrix to the Hessian, which amounts to the following formulation: minimizew λ < w, w > +

N X

ln(1 + exp(−li [< w, xi > +b]).

(B.10)

i=1

The reasoning behind this is exactly the same as in “ridge” linear regression (B.6).

B.4

Multinomial Logistic Regression

Logistic regression can be extended to multiclass problems ( see for example [72]). For a point xi , let yih = 1 iff li = h, and yih = 0 otherwise. Then: p(y h = 1|x) =

exp(< wh , x > +bh ) K P exp(< wj , x > +bh )

j=1

(B.11)

125 for h = 1, K. The parameters wh can be found by solving the following optimization problem: " K # K N X X X minimizewh ,bh ln exp(< wh , xi > +bh ) − yih [< wh , xi > +bh ] i=1

h=1

(B.12)

h=1

Regularization can be included by adding a penalty term on vectors wh , similarly to the two-class case.

B.5

Bayesian Logistic Regression

A Bayesian approach to logistic regression is to specify a prior distribution of the coefficients wi and then to find maximum a posteriori (MAP) estimates of wi . Setting the prior distribution for each coefficient independently to be Gaussian with mean 0 and variance σ 2 for all coefficients leads to the “ridge” logistic regression and is equivalent to (B.10), with λ=

1 . 2σ2

To see this, note: wj2 1 p(w) = Πdj=1 √ exp(− 2 ). 2σ σ 2π

(B.13)

Using Bayes theorem leads to a formulation of Maximum Likelihood that includes the prior distribution of parameters. The optimization problem then becomes:

= maximizew

N X i=1

maximizew ΠN i=1 p(li |xi , w, b)p(w) ln p(li |xi , w, b) +

d X wj2 1 (− ln σ − ln 2π − 2 ) 2 2σ

(B.14) (B.15)

j=1

which is equivalent to: minimizew

N X

d 1 X 2 wj 2σ 2

(B.16)

ln(1 + exp(−li [< w, xi > +b]))

(B.17)

ln(1 + exp(−li (< w, xi > +b)) +

j=1

i=1

= minimizew λ||w||2 2 +

N X i=1

It is also possible, in a hierarchical fashion, to specify a prior distribution for the variances σi2 . Using exponential prior with parameter γ for parameter variances σi2 leads to Laplace prior for the coefficients, and the following optimization problem: minimizew λ||w||1 +

N X i=1

ln(1 + exp(−li [< w, xi > +b])

(B.18)

126 with λ =

q

2 γ.

The above formulation is essentially “lasso” logistic regression, also suggested in [112].

Recently, an efficient implementation of Bayesian logistic regression with Gaussian and Laplace priors was described in [41]. Software for binary1 and for multi-class2 classification is available online. This is the software that we use in our experiments.

1 2

http://www.stat.rutgers.edu/˜madigan/BBR/ http://www.stat.rutgers.edu/˜madigan/BMR/

127

Appendix C Support Vector Machines for Classification

C.1

Introduction

In many real-life situations we want to be able to assign an object to one of several categories based on some of its characteristics. For example, based on the results of several medical tests we want to be able to say whether a patient has a particular disease, should be recommended a specific treatment or is healthy. In computer science such situations are described as classification problems. A binary (two-class) classification problem can be described as follows: given a set of labeled points (xi , li ), i = 1, N , where xi ∈ ℜd are vectors of features and li ∈ {−1, +1} are class labels, construct a rule that correctly assigns a new point x to one of the classes. The vectors xi in this formulation correspond to objects, and the dimensions of the space are the features or characteristics of these objects. For example, a vector may represent a person, with individual features corresponding to the measurements given by some medical tests — blood pressure, cholesterol level, white cell count and so on. Using labels {0, . . . , K − 1} instead of {−1, +1} we can describe a multiclass problem with K classes. A classification method or algorithm is a particular way of constructing a rule, also called a classifier, from the labeled data and applying it to the new data. Support Vector Machines (SVM) recently became one of the most popular classification methods. They have been used in a wide variety of applications such as text classification [62], facial expression recognition [79], gene analysis[47] and many others. Support Vector Machines can be thought of as a method for constructing a special kind of rule, called a linear classifier, in a way that produces classifiers with theoretical guarantees of good predictive performance (the quality of classification on unseen data). The theoretical foundation of this method is given by statistical learning theory [114].

128 While no general method for producing non-linear rules with such properties is known, the socalled “kernel trick” can be used to construct special kinds of non-linear rules using the same SVM methodology.

C.2

Linear Classifiers

A classifier can frequently be represented using a discriminant function f (x) : ℜd → ℜ. In a two-class case, a point is assigned to the positive class if f (x) ≥ 0, and to the negative class otherwise. A classifier is linear if its discriminant function f (x) can be expressed as: f (x; w, b) =< w, x > +b

(C.1)

where w, b are parameters of the function and denotes the inner product of two vectors. A set S of points (xi , li ), i = 1, N , where li ∈ {−1, +1} are class labels, is called linearly separable if a linear classifier can be found so that li f (xi ) > 0, ∀i = 1, . . . , N . Algorithm 17 Online Perceptron Learning Algorithm [19] Require: A linearly separable set S, learning rate η ∈ ℜ+ 1: w0 = 0; b0 = 0; k = 0; 2: R = max ||xi || 1≤i≤N

3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

while at least one mistake is made in the for loop do for i = 1, . . . , N do if li [< wk , xi > +bk ] ≤ 0 then wk+1 = wk + ηli xi bk+1 = bk + ηli R2 (updating bias1 ) k =k+1 end if end for end while Return wk , bk , where k is the number of mistakes

The use of linear classifiers in machine learning can be traced back to Rosenblatt’s work on the perceptron [101], though they have been used before that in statistics. The pseudocode for the perceptron learning, taken from [19], is given as Algorithm 17. It works by taking one instance at a time and predicting its class. If the prediction is correct, no adjustments are made. If the prediction is wrong, the parameters, describing a hyperplane, are moved in the direction of the point where the mistake occurred. A scalar value, η, called the learning rate determines how far the parameters are moved. 1 Usually this update is given as bk+1 = bk + ηli . However, such formulation makes the result of Novikov’s theorem, given below, dependent on the exact value of the bias, b∗ , of the separating hyperplane. This reflects the fact that R can be increased simply by moving all points away from the origin. The update used in the Algorithm 17 (suggested in [19]) compensates for that.

129

w

γ

−b

Figure C.1: Here (w, −b) define the separating hyperplane and γ is the size of the margin. The relation between w and γ is discussed in the text. The choice of a learning rate can significantly affect the number of iterations until convergence on a linearly-separable set.

C.2.1

Margin and VC dimension

The idea of margin (Figure C.1) has come to play an important role in the theory of statistical learning. A hyperplane < w∗ , x > +b∗ = 0, ||w∗ || = 1

(C.2)

is called γ-margin separating hyperplane if li [< w∗ , x > +b∗ ] ≥ γ

(C.3)

for all (xi , li ) in set S. Here γ (clearly γ > 0) is the margin. Any separating hyperplane can be converted into this form. Suppose li [< w, x > +b] ≥ 1. Then, by setting w∗ =

w ||w||

and b∗ =

b ||w|| ,

we obtain a γ-margin separating hyperplane with γ =

(C.4) 1 ||w|| .

The first result suggesting a relation between the margin and predictive ability of a classifier was Novikov’s theorem. Theorem C.2.1 (Novikov’s Theorem [89]) Let S, |S| = N be a training set, i.e. a set of points with class labels, and let R = max ||xi || 1≤i≤N

(C.5)

130 Suppose that there exists a γ-margin separating hyperplane (w, b) such that li [< w, xi > +b] ≥ γ, ∀ 1 ≤ i ≤ N . Then the number of mistakes made by the on-line perceptron algorithm on S is at most 2 2R (C.6) γ This theorem effectively proves that for a linearly separable set of points the perceptron algorithm finds a separating hyperplane after making a finite number of mistakes. The number of mistakes is directly proportional to the ratio of the volume of the data to the measure of separation of the classes, γ. Note that Novikov’s theorem shows convergence on a periodic training sequence. This result has been extended to an arbitrary infinite sequence of points (each belonging to one of two region that can be linearly separated) by Aizerman et. al. [2]. Novikov’s theorem bounds the number of errors in the training stage. But in classification we are interested in the accuracy of a classifier on unseen data. Such a number clearly cannot be computed exactly, but it turns out it can be bounded. Before proceeding let us define the notion of Vapnik-Chevronenkis dimension. Definition C.2.1 The Vapnik-Chervonenkis (VC) dimension of a set of classifiers is the maximum number h of points that can be separated into all possible 2h ways using classifiers in this set. If for any n there exists a set of n points that can be separated into two classes in all possible ways, the VC dimension of this set of functions is said to be infinite. Intuitively, VC dimension measures the complexity of the classifiers in the set. If the classifiers are simple, they have small VC dimension. If they are complicated, the VC dimension is large. For example, the VC dimension of hyperplanes in Rd is known to be d + 1. The following two results bound the VC dimension of the set of γ-margin separating hyperplanes and the probability of misclassifying an unseen instance with such a hyperplane chosen on the training data. Theorem C.2.2 Theorem 5.1 in [114] Let set S, |S| = N belong to sphere of radius R. Then the set of γ-margin separating hyperplanes has VC dimension h bounded by: 2 ! R h ≤ min ,d + 1 γ

(C.7)

Theorem C.2.3 Corollary to Theorem 5.1 in [114] With probability 1 − η the probability of a test example not being separated correctly by a γ-margin hyperplane has the bound ! r 4m m E 1+ 1+ + Perror ≤ N 2 NE

(C.8)

131 where E=4

η h(ln 2N h + 1) − ln 4 , N

(C.9)

m is the number of training examples not separated correctly by γ-margin hyperplane, and h is the bound of the VC dimension given in theorem C.2.3. The bound on the probability of making a mistake on unseen data is proportional to the VC dimension of the set of classifiers. In other words, everything else being equal, a classifier with a lower VC dimension is likely to be a better predictor. Notice that this result does not depend on any assumptions about the distribution of the test data — it is a “distribution-free” bound. Since the upper bound on VC dimension is inversely proportional to the margin, the strategy for building a good classifier is to have as large a margin as possible while keeping the number of errors on the training set low. This is somewhat similar to the idea regularization but is motivated from the perspective of the statistical learning theory. It can also be seen as a version of Occam’s razor: we want to use the simplest classifier that makes no (or fewest) mistakes on the training set.

C.2.2

The Maximal Margin in Separable Case

We have seen in the previous section that the probability of making a mistake is inversely proportional to the size of the margin. Thus we would like to find a classifier with the largest margin that still correctly separates the training points. The maximal-margin separating hyperplane can be found by solving the following optimization problem: minimizew,b < w, w >

(C.10)

li [< w, xi > +b] ≥ 1, ∀i = 1, . . . , N

(C.11)

subject to:

One method for solving optimization problems involves introducing Lagrange multipliers, αi , for the constraints [10]. In this case, the so-called Lagrangian function is given by: N

L(w, b, α) =

X 1 αi [li [< w, xi > +b] − 1] < w, w > − 2 i=1

(C.12)

132 Taking derivatives with respect to w and b gives: w=

N X

li αi xi

(C.13)

i=1

and 0=

N X

li αi

(C.14)

i=1

Note that w is given by a linear combination of the training points. We will return to this observation later on. Re-substituting these into the primal problem (C.12) gives a dual formulation: maximize W (α) =

N X i=1

αi −

N X

li lj αi αj < xi , xj >

(C.15)

i,j=1

subject to: N X i=1

li αi = 0, αi ≥ 0, ∀i = 1, . . . , N

Let α∗ be a vector of parameters optimizing W (α). The the weight vector w∗ =

(C.16) N P

i=1

maximal margin hyperplane, with margin γ=

1 . ||w∗ ||2

li α∗i xi is a

(C.17)

The parameter b is not present in the dual problem and has to be computed from the primal constraints (C.11). Notice that, because of (C.13), the classifier f (x) can be expressed as: f (x) =< w, x > +b =

N X

li αi < xi , x > +b

(C.18)

i=1

If αi = 0, then xi is not used in decision rule and can be discarded. Points xi such that αi 6= 0 lie on the margin and are called support vectors. They determine the decision boundary.

C.2.3

Extension for the non-separable case

Until now we have discussed only the linearly-separable case. However, similar machinery can be used to handle the non-separable case. The main idea is that those points that lie on the wrong side of the hyperplane are explicitly penalized by introducing slack variables, ξ, that control how far on the wrong side of a hyperplane a point lies.

133 The optimization problem becomes: N

X 1 minimizew,b < w, w > +C ξi 2

(C.19)

i=1

subject to: li [< w, xi > +b] ≥ 1 − ξi , ξi ≥ 0, ∀i = 1, . . . , N

(C.20)

where the parameter C, controlling the trade-off between the size of the margin and the training errors, is chosen by the user. The dual then becomes: maximize

N X i=1

αi −

N X

li lj αi αj < xi , xj >

(C.21)

i,j=1

subject to: N X i=1

li αi = 0, C ≥ αi ≥ 0, ∀i = 1, . . . , N

(C.22)

Once again the solution is given by a linear combination of inner products with support vectors: N P w= li αi xi . There are no general methods for choosing parameters b in a non-separable case - it is i=1

usually set to optimize some performance measure on a training or a validation set. The same approach is often taken for choosing a value of C.

C.2.4

Multi-class classification

Many methods exist for building a multiclass classifier system from binary classifiers (one-vs-all, one-vsone [38], error-correcting output codes (ECOC) [26], Directed Acyclic Graph (DAG) [93]). In all of these approaches multiple binary classifiers are trained separately and their predictions are then combined. For example, in one-vs-all classification, with K classes, K classifiers are constructed. Each recognizes points of one of the classes as positive and those of all others as negative. A new point is assigned to the class h if the corresponding classifier gives this point the highest score among all K classifiers. In one-vs-one, a classifier is trained for each pair of classes. Classification usually proceeds as follows: each of

K(K−1) 2

classifiers makes a prediction and the number of votes for each class is counted. The point

is assigned to the class that has received most votes. DAG and ECOC are more complicated in how classifiers are trained and how their predictions are combined. Special multiclass methods have been developed for SVM. They usually involve solving a single optimization problem. In [119] the optimization problem to be solved (over K separating hyperplanes

134 and N K slack variables) is: N

K

minimizewj ,bj ,ξ j

XX j 1X < wj , wj > +C ξi 2 j=1

(C.23)

i=1 j6=li

subject to: (< wli , xi > +bli ) ≥ (< wj , xi > +bj + 2 − ξij ), ∀j 6= li

(C.24)

ξij ≥ 0, ∀i = 1, . . . , N A new point x is classified as follows: f (x) = argmaxj=1,...,K (< wj , x > +bj )

(C.25)

Cramer and Singer [18] suggested a somewhat different formulation, requiring only N slack variables: K

N

j=1

i=1

X 1X ξi , < wj , wj > +C minimizewj ,ξ 2

(C.26)

subject to: (< wli , xi > − < wj , xi >) ≥ 1 − ξi

(C.27)

∀i = 1, N , j = 1, K : ξi ≥ 0, j 6= li . Note that this formulation does not involve coefficients bj . The resulting decision function is: f (x) = argmaxj=1,...,K < wj , x >

C.2.5

(C.28)

Computational Issues

We have shown that the problem of finding a maximal margin hyperplane can be formulated as a particular quadratic optimization problem. Many numerical methods for solving general quadratic optimization problems are known. For the special kind of optimization needed for SVM there exist particularly efficient algorithms such as Platt’s sequential minimal optimization (SMO) [92] capable of handling thousands of vectors in a thousand-dimensional space.

C.3

The “Kernel Trick”

The main idea behind the “kernel trick” is to map the data into a different space, called feature space, and to construct a linear classifier in this space. It can also be seen as a way to construct non-linear

135 classifiers in the original space. Below we explain what the kernel trick is and how these two views are reconciled. Notice that in the the dual problem (C.15) the training points are included only via their inner products. Also, as can be seen in (C.18), the classifier function f (x) can be expressed as a sum of inner products with support vectors. An important result, called Mercer’s theorem, states that any symmetric positive semi-definite function K(x, z) is an inner product in some space (and vice-versa). In other words, any such function K(x, z) implicitly defines a mapping into so-called feature space φ : x → φ(x) such that K(x, z) =< φ(x), φ(z) >. Such functions K are called kernels. The feature space can be high-dimensional or even have infinite dimension. However we don’t need to know the actual mapping since we can use kernel function to compute similarity in the feature space. Some examples of kernels include polynomial kernels K(x, z) = (< x, z > +1)p

(C.29)

and Gaussian kernels K(x, z) = e−

||x−z||2 2σ 2

.

(C.30)

Many kernels have been developed for special applications such as sequence matching in bioinformatics [58]. General properties of kernels are described in many publications, including [19]. By replacing the inner product in the formulation of SVM by a kernel and solving for Lagrange multipliers αi , we can obtain via (C.18) a maximal margin separating hyperplane in the feature space defined by this kernel. Thus choosing non-linear kernels allows us to construct classifiers that are linear in the feature space, even though they are non-linear in the original space. The dual problem in the kernel form is: maximize W (α) =

N X i=1

subject to 0 =

N X i=1

αi −

N X

li lj αi αj K(xi , xj )

(C.31)

li αi , αi ≥ 0, ∀i = 1, . . . , N

(C.32)

N X

(C.33)

i,j=1

and the classifier is given by: f (x) =

li αi K(xi , x).

i=1

The idea of viewing kernels as implicit maps into a feature space was first suggested in [2]. However this approach was not widely used until the emergence of SVM. Many algorithms other than SVM have

136 now been “kernelized” — reformulated in terms of kernels rather than inner products. A survey of kernel-based methods is given by [86].

C.4

Conclusion

In this note we attempted to highlight the main ideas underlying the SVM method. For a detailed explanation of these, and many other ideas in statistical learning and kernel methods an interested reader is referred to [19, 114]. While [19] provides a clear and broad introduction to the area, [114] goes into much greater depth. Both of these books provide many references to relevant literature. Another valuable resource is kernel-machines website (http://www.kernel-machines.org/) which contains information on books, articles and software related to SVM and other kernel-based methods. In our work we used LIBSVM package [14]. The multi-class SVM method implemented in LIBSVM is one-vs-one.

137

Appendix D Logistic Regression and SVM

It turns out that SVM and regularized logistic regression are similar [132], in the sense explained below. Consider the following minimization problem: minimizew λ < w, w > +

N X

g(li , xi , w)

(D.1)

i=1

Regularized logistic regression (Appendix B (B.10)) is clearly of this form, with: gγ (li , xi , w) =

1 ln(1 + exp(−γ(li [< w, xi > +b] − 1))), γ = 1. γ

(D.2)

The soft-margin formulation of SVM (C.19) can be rewritten as: N

minimizew,b

X 1 max{0, 1 − li [< w, xi > +b]}. < w, w > +C 2

(D.3)

i=1

which shows that it is a particular case of (Appendix C (D.1)) with gSV M (li , xi , w) = max{0, 1 − li [< w, xi > +b]} and λ =

(D.4)

1 2C .

Furthermore, as pointed out in [132] and proved in [131]: lim gγ = gSV M

γ−>∞

(D.5)

It also proves the convergence of the functions to be minimized (i.e. functions of the form (D.1) with gγ converge to the function with gγ ) and of their solutions.

138

Appendix E Complete Results of Experiments on Benchmark Datasets

Below we present Tables of complete results of the experiments on the benchmark datasets. We will make use of the following notation: superscripts 1 , 2 ,

3

will be used to mark results that are

significantly different (at 0.95, 0.99 and 0.999 confidence levels respectively) from the performance of the global classifier. Since the results can be significantly better or worse than those of the global classifier, the values in each table should be compared with the results of the appropriate global classifier. This notation will be used for the results of CIC, EBCA and HGC approaches. The significance testing was not done on results CIC+EBCA because its results were consistently not as good as those of CIC alone.

E.1

Complete Results of the CIC and EBCA Approaches

For convenience of presentation the rusults of CIC, EBCA and CIC+EBCA are presented together. Tables E.1-E.4 shows the results of these methods on the four benchmark datasets with SVM as a base classifier, while tables E.5-E.8 show the results for BMR base classifier. We conduct experiments for the number of clusters per class k = 1, 2, 3, 4 and for the number of metaclasses m = 2, 3, 5, 10. Note that CIC with k = 1 is just the standard multi-class classification. Combination CIC+EBCA can be described as (m > 1 and k > 1). CIC can also be seen as CIC+EBCA with m = K · k. In order to obtain the confusion matrix we use 5-fold cross-validation on the training set. For each combination of parameters k and m in EBCA and CIC+EBCA we tried both KKM and RBC partitioning methods, starting with the same initial conditions. The KKM and RBC methods were applied to B = 12 (A + At ) both in the RUFL and RTFL experiments. The best result(s) in each table are given in bold font, while the second-best set of results is given in italics. The Image dataset (Tables E.1-E.5) has a small training set (210 points) and the results obtained on it are rather atypical. In particular, both with SVM and BMR using CIC does not improve on the global classifier (i.e. k = 1). RTFL performs better than RUFL for a number of different choices of m

139

k CIC

EBCA k=1 KKM RBC 92.10 92.10

m=2 m=3 m=5 m = 10

92.10 92.10 91.95 NA

92.10 92.10 91.95 NA

m=2 m=3 m=5 m = 10

92.10 92.10 92.38 NA

92.10 92.10 92.38 NA

CIC+EBCA k=2 k=3 KKM RBC KKM RBC 91.71 91.71 89.383 89.383 RUFL 91.38 92.05 92.05 92.05 91.52 91.95 91.43 90.95 91.71 92.10 90.19 90.95 91.71 92.19 89.52 90.00 RTFL 88.52 92.05 92.48 92.48 88.71 92.76 89.29 90.76 89.90 92.05 91.62 90.90 90.62 92.33 89.86 89.81

k=4 KKM RBC 89.903 89.903 92.43 89.67 88.81 90.76

92.10 92.10 91.29 92.19

92.24 91.38 90.38 89.33

92.10 92.10 92.19 90.86

Table E.1: Image Dataset, SVM CIC+EBCA results (% accuracy). As mentioned previously, this is a small dataset and the results here are atypical. More specifically, CIC does not improve the predictive accuracy, compared to global classifier (k = 1). RTFL performs better than RUFL in a number of cases. In particular, for k = 1 and m = 5, RTFL beat the global classifier both with KKM and RBC. The best result is given by RTFL with k = 2,m = 2 and RBC.

k CIC

EBCA k=1 KKM RBC 95.85 95.85

m=2 m=3 m=5 m = 10

95.94 95.91 95.80 NA

95.91 95.85 95.77 NA

m=2 m=3 m=5 m = 10

95.432 95.143 93.403 NA

93.003 91.453 93.973 NA

CIC+EBCA k=2 k=3 KKM RBC KKM RBC 97.483 97.483 97.51 3 97.51 3 RUFL 96.14 96.34 95.88 96.86 96.34 97.03 96.00 97.03 96.43 97.17 96.20 96.63 97.40 97.51 97.11 97.34 RTFL 95.57 94.85 95.20 94.60 95.54 95.60 95.57 94.57 94.85 95.14 95.48 93.91 95.71 96.17 95.91 95.48

k=4 KKM RBC 98.003 98.003 96.08 96.25 95.85 96.25

95.63 96.11 96.97 97.54

95.48 95.45 94.54 95.11

95.11 94.88 94.00 95.77

Table E.2: Pendigits Dataset, SVM CIC+EBCA results (% accuracy). Here RUFL is systematically better than RTFL for all values of k and m. CIC+EBCA does better than the global classifier in most cases, but not better than CIC alone for the same value of k. Overall, for RUFL increasing m seems to improve CIC+EBCA results, while increasing k improves CIC results. Clearly however, increasing either k or m (or both) too much would result in the drop in the performance.

140

k CIC

EBCA k=1 KKM RBC 86.05 86.05

m=2 m=3 m=5 m = 10

85.95 85.95 86.00 NA

86.05 85.95 86.05 NA

m=2 m=3 m=5 m = 10

85.402 85.55 83.753 NA

80.403 85.90 86.20 NA

CIC+EBCA k=2 k=3 k=4 KKM RBC KKM RBC KKM RBC 3 85.90 85.90 87.30 87.30 88.45 88.453 RUFL 86.80 85.00 85.95 86.70 86.45 86.70 86.65 85.00 86.45 86.95 86.40 86.00 86.55 84.65 87.15 87.65 87.00 86.40 85.90 85.90 87.00 87.65 86.75 86.75 RTFL 85.85 82.70 85.30 85.50 85.70 85.80 85.00 84.30 84.95 85.85 85.15 85.40 85.00 82.70 84.75 85.90 85.15 85.65 85.65 85.55 84.85 86.80 84.85 86.00

Table E.3: Satimage Dataset, SVM CIC+EBCA results (% accuracy). These results are qualitatively rather similar to those on the Pendigits dataset, with the exception of those for k = 2, where there is a general drop in the accuracy. For k = 3 and k = 4 however, the CIC accuracy is better than for k = 1. CIC+EBCA again does not improve on CIC alone with the exception of RBC for k = 3 with m = 5, 10, and KKM for k = 2 with m = 2, 3, 5. RTFL almost everywhere performs worse than RUFL.

EBCA k=1 KKM RBC 51.08 51.08

k=2 KKM RBC 53.25 53.25

m=2 m=3 m=5 m = 10

52.38 51.08 51.30 NA

50.22 50.65 50.65 NA

49.57 48.48 51.30 53.25

49.78 52.16 48.27 51.30

m=2 m=3 m=5 m = 10

51.08 51.73 52.16 NA

52.38 50.22 48.05 NA

50.87 49.57 50.43 47.19

49.57 49.35 41.77 49.35

k CIC

CIC+EBCA k=3 KKM RBC 1 56.06 56.06 1 RUFL 50.87 49.78 49.57 52.81 51.08 49.13 51.08 51.52 RTFL 50.87 49.57 49.35 50.00 49.57 48.48 50.22 49.13

k=4 KKM RBC 1 56.71 56.711 49.78 48.92 53.25 54.11

50.22 50.65 53.25 53.90

49.57 48.05 50.00 46.75

49.57 49.78 50.00 51.30

Table E.4: Vowel Dataset, SVM CIC+EBCA results (% accuracy). The Vowel dataset is twice as large as Image, but smaller than the other two. The results however are much more similar to the larger datasets. CIC does better than the global classifier. CIC+EBCA does worse than CIC. RTFL is consistently worse than RUFL for k > 1 (for k = 1 the results are mixed).

141

k CIC

EBCA k=1 KKM RBC 91.86 91.86

m=2 m=3 m=5 m = 10

91.57 91.57 92.10 NA

91.86 91.57 92.10 NA

m=2 m=3 m=5 m = 10

91.57 91.57 92.43 NA

91.86 91.57 92.43 NA


k=4 KKM RBC 87.673 87.673 89.86 89.62 88.90 88.52

91.81 91.62 90.90 89.71

91.57 90.57 89.76 85.71

91.81 91.43 91.76 88.95

Table E.5: Image Dataset, BMR CIC+EBCA results (% accuracy). In general. BMR results follow the same trends as the SVM results. Here, as in Table E.1 CIC does worse than the global classifier, while EBCA with m = 5 improves on the global results for both RUFL and RTFL. Also, CIC+EBCA does better than CIC alone. RTFL gives the best results.

k CIC

EBCA k=1 KKM RBC 92.71 92.71

m=2 m=3 m=5 m = 10

92.82 92.68 92.88 NA

93.11 92.74 92.97 NA

m=2 m=3 m=5 m = 10

92.481 91.82 91.853 NA

91.973 88.743 91.223 NA


k=4 KKM RBC 3 96.08 96.083 92.48 92.80 92.68 94.77

93.91 93.14 95.31 96.08

92.14 91.97 91.74 93.97

91.31 92.00 93.94 93.85

Table E.6: Pendigits Dataset, BMR CIC+EBCA results (% accuracy). Here CIC does better than the global classifier. EBCA improves slightly on the global classifier for m = 5. For m = 10, CIC+EBCA tends to perform somewhat worse than CIC alone in most cases. RTFL is consistently worse than RUFL.

142

EBCA k=1 KKM RBC 84.10 84.10

k=2 KKM RBC 84.70 84.70

m=2 m=3 m=5 m = 10

84.20 83.90 84.15 NA

84.10 83.90 84.25 NA

85.60 85.80 85.20 85.05

m=2 m=3 m=5 m = 10

84.15 84.20 83.30 NA

83.10 84.20 84.40 NA

84.20 84.10 83.50 85.60

k CIC

CIC+EBCA k=3 k=4 KKM RBC KKM RBC 85.952 85.952 87.953 87.953 RUFL 85.10 84.35 84.05 83.75 87.35 85.05 84.65 86.10 84.00 85.45 84.45 86.40 85.50 86.10 86.95 85.20 86.05 86.50 85.50 86.40 RTFL 84.10 83.60 84.10 83.60 86.10 84.65 83.40 84.90 83.25 84.15 83.40 83.75 84.70 83.50 85.20 85.40 82.70 85.95 82.80 85.45

Table E.7: Satimage Dataset, BMR CIC+EBCA results (% accuracy).Qualitatively, again, these results are similar to both results on Pendigits dataset (Table E.6) and to the SVM results (Table E.3).

k CIC

EBCA k=1 KKM RBC 47.40 47.40

m=2 m=3 m=5 m = 10

47.40 46.10 46.75 NA

47.40 45.24 49.78 NA

m=2 m=3 m=5 m = 10

46.54 44.16 47.84 NA

46.54 35.28 50.22 NA


k=4 KKM RBC 49.78 49.78 46.32 46.97 43.72 45.24

46.32 47.40 51.30 52.81

47.40 46.32 41.13 37.88

47.40 47.62 48.48 50.00

Table E.8: Vowel Dataset, BMR CIC+EBCA results (% accuracy). Here the best results are given by RTFL, RBC for m = 3 and k = 2, and by RUFL RBC for m = 10 and k = 4. However all results in the nearby cells are much lower. RUFL is in most cases better than RTFL. With a few exceptions CIC+EBCA does worse than CIC. EBCA alone improves on the global classifier for some values of m. CIC does better than global for k = 3, 4.

143

Clusters 5 10 20

With R1 f1 f2 f3 91.86 92.10 92.10 90.761 90.573 92.10 89.571 92.19 92.05

Distance to Centers f1 f2 f3 91.38 92.10 92.10 90.761 90.863 92.10 88.571 92.00 91.432

Table E.9: Image Dataset, Hierarchical Classification Results (SVM). R0 accuracy: 92.10 Clusters 5 10 20

f1 97.083 97.233 97.43 3

With R1 f2 f3 97.083 97.083 97.463 96.482 97.202 96.05


Table E.10: Pendigits Dataset, Hierarchical Classification Results (SVM). R0 accuracy: 95.85. and k, and furthermore, EBCA with RTFL and m = 5 actually does better than the global classifier. The best overall results (both for SVM and BMR) are obtained with RTFL and RBC, for k = 2, m = 3. The results on Pendigits and Satimage datasets are qualitatively rather similar, both with SVM and BMR (Tables E.2-E.3 and E.6-E.7). CIC leads to improvements over the global classifier, while EBCA doesn’t. The combination CIC+EBCA may perform better than the global classifier, but does not do as well as the CIC with the same parameter value for k. The best results on both datasets are obtained with k = 4 and CIC. RUFL is consistently better than RTFL in these experiments. The Vowel dataset is twice as large as the Image dataset and several times smaller than the other two. The results on the Vowel dataset however are similar to those on the larger datasets. The CIC method alone does best and RUFL outperforms RTFL. The results with BMR are somewhat different, with RUFL for k = 2, 3 with m = 2, 5 giving the best two results. These results are unusual in that the results for similar values are much lower. Without these results CIC with k = 4 is the best.

E.2

Complete Results of the HGC Method

For each dataset, we experimented with HGC, using R1 and using distance to centers, with k = 5, 10, 20. The local model selection functions f1 -f3 were described in Chapter 3. The same first level classifier R1

Clusters 5 10 20

f1 85.95 86.70 89.153

With R1 f2 f3 86.40 86.40 87.101 87.202 89.10 3 88.553

Distance to Centers f1 f2 f3 85.95 86.45 86.40 86.60 86.95 87.152 89.052 89.10 3 88.453

Table E.11: Satimage Dataset, Hierarchical Classification Results (SVM). R0 accuracy: 86.05

144

Clusters 5 10 20

With R1 f1 f2 f3 54.55 54.55 54.55 52.38 54.11 54.11 49.57 49.35 49.57


Table E.12: Vowel Dataset, Hierarchical Classification Results (SVM). R0 accuracy: 51.08

Clusters 5 10 20

With R1 f1 f2 f3 92.24 91.86 91.86 90.002 91.86 91.86 89.143 91.101 91.86


Table E.13: Image Dataset, Hierarchical Classification Results (BMR). R0 accuracy: 91.86

Clusters 5 10 20

f1 95.54 97.20 97.17

With R1 f2 f3 95.54 95.54 97.20 97.20 97.23 97.17


Table E.14: Pendigits Dataset, Hierarchical Classification Results (BMR). R0 accuracy: 92.08. All results in this Table are better than the global classifier at 99.9% significance level.

Clusters 5 10 20

With R1 f1 f2 f3 86.052 86.052 86.052 87.053 86.953 87.003 88.553 88.153 87.753

Distance to Centers f1 f2 f3 86.052 86.052 86.052 87.053 86.953 87.003 88.35 3 87.953 87.703

Table E.15: Satimage Dataset, Hierarchical Classification Results (BMR). R0 accuracy: 84.10.

Clusters 5 10 20

f1 48.27 50.87 44.16

With R1 f2 f3 48.27 48.27 50.87 50.87 42.64 42.64


Table E.16: Vowel Dataset, Hierarchical Classification Results (BMR). R0 accuracy: 46.54

145 was used for the same values of k. Results of experiments with SVM are presented in Tables E.9-E.12, while the results of BMR experiments are in Tables E.13-E.16 The first column indicates the number of clusters k. The second through fourth columns of results contain HGC accuracy for f1 , f2 and f3 when using a first level classifier R1 . The next three columns of results contain accuracy of HGC when the on the first level a point is assigned to the cluster with the nearest center. The best results in each table are given in bold font, while the second-best set of results is given in italics. We will not discuss each table in detail but make several observations. Note that the results are very symmetric between our method with the first level classifier R1 , and the method of nearest centers, in a sense that relative performance of f1 ,f2 and f3 are similar in these two groups. For the Image and Vowel datasets, the best results are split between the two approaches, but for Pendigits and Satimage datasets the best results are obtained using a special first level classifier R1 . The comparison between methods f1 ,f2 and f3 is inconclusive, since each of these methods produces best results on some dataset. However, on smaller datasets the results are very similar across these three methods, while on Pendigits and Satimage there is greater diversity in the results. Another difference between the smaller and the larger sets is that for the smaller sets the best results are for k = 5, 10, while for larger ones the best results are with k = 10, 20. Finally, note that majority of results in every table are better than the global classifier result. This is, again, more noticeable on the Pendigits and Satimage datasets.

E.3

Other Measures of Performance

In Tables E.17 —E.20 we show the accuracy with respect to each class for a subset of methods discussed. These Tables shows that the when improvement is achieved it is either by improving the accuracy over most of the classes as on the Pendigits or Satimage datasets, or by strongly improving some classes at the cost of slight deterioration of accuracy on the other classes as on the Vowel dataset. (There is little if any improvement over the global classifier on the Image Dataset).

146

SVM Global HGC, f1 , k HGC, f3 , k CIC, k = 4 BMR Global HGC, f1 , k HGC, f3 , k CIC, k = 4

=5 = 10

=5 = 10

1 96.00 96.00 96.00 99.67

2 80.33 90.33 80.33 76.00

3 91.00 88.33 91.00 86.33

Classes 4 5 99.00 100.00 99.00 96.67 99.00 100.00 99.33 100.00

93.33 96.00 93.33 90.33

89.00 88.33 89.00 74.67

87.00 86.67 87.00 82.00

99.00 99.00 99.00 99.00

100.00 99.00 100.00 99.33

All 6 100.00 100.00 100.00 100.00

7 78.33 72.67 78.33 68.00

92.10 91.86 92.10 89.90

100.00 100.00 100.00 99.67

74.67 76.67 74.67 68.67

91.86 92.24 91.86 87.67

Table E.17: Accuracy on Each Class and Overall Accuracy: Image Dataset

SVM Global HGC, f1 HGC, f3 CIC, k = 4 BMR Global HGC, f1 HGC, f3 CIC, k = 4

1 91.18 95.04 97.25 95.59

2 95.60 96.70 95.60 97.25

3 98.08 99.73 98.08 100.00

4 98.81 98.21 98.81 99.40

Classes 5 6 98.35 98.21 98.08 97.91 98.35 98.51 98.63 98.51

7 99.40 99.70 99.40 100.00

8 90.11 92.86 89.84 93.68

9 93.75 98.81 93.75 99.40

10 95.54 94.05 95.54 97.92

95.85 97.08 96.48 98.00

88.98 95.32 96.97 94.77

84.34 90.66 96.43 94.78

98.35 99.18 99.45 98.35

98.51 97.92 98.81 96.43

96.70 97.25 96.98 98.08

96.43 99.40 99.40 99.70

83.24 87.64 90.11 88.19

87.80 98.21 99.40 97.92

92.56 92.86 97.02 94.94

92.08 95.54 97.20 96.08

94.63 97.61 97.91 98.21

All

Table E.18: Accuracy on Each Class and Overall Accuracy: Pendigits Dataset (number of clusters k used in HGC is 5 and 10 respectively same as in the other tables).

SVM Global HGC, f1 , k HGC, f3 , k CIC, k = 4 BMR Global HGC, f1 , k HGC, f3 , k CIC, k = 4

=5 = 10

=5 = 10

1 98.70 99.35 98.48 98.26

2 93.75 93.75 95.54 96.43

Classes 3 4 94.71 43.13 95.97 42.18 94.96 56.40 92.44 54.03

5 80.17 79.32 81.01 86.50

6 84.89 83.62 82.55 88.09

86.05 85.95 87.20 88.45

98.26 99.13 98.70 97.61

92.86 93.75 95.54 95.54

92.19 92.70 94.46 93.20

77.22 80.59 83.97 82.28

86.81 84.47 83.40 87.45

84.10 86.05 87.00 87.95

30.33 46.45 49.76 56.40

All

Table E.19: Accuracy on Each Class and Overall Accuracy: Satimage Dataset

147

SVM Global HGC HGC CIC BMR Global HGC HGC CIC

1 88.10 69.05 57.14 71.43

2 57.14 57.14 80.95 61.90

3 54.76 73.81 66.67 78.57

4 54.76 52.38 59.52 73.81

Classes 5 6 7.14 71.43 30.95 47.62 45.24 52.38 28.57 73.81

78.57 69.05 40.48 66.67

52.38 52.38 57.14 50.00

45.24 76.19 78.57 69.05

61.90 59.52 61.90 57.14

14.29 23.81 33.33 19.05

38.10 33.33 54.76 61.90

All 7 33.33 61.90 59.52 71.43

8 66.67 76.19 66.67 57.14

9 52.38 54.76 45.24 52.38

10 30.95 21.43 23.81 16.67

11 45.24 54.76 38.10 38.10

51.08 54.55 54.11 56.71

26.19 26.19 69.05 61.90

52.38 83.33 66.67 52.38

54.76 42.86 28.57 30.95

30.95 38.10 35.71 47.62

57.14 26.19 33.33 30.95

46.54 48.27 50.87 49.78

Table E.20: Accuracy on Each Class and Overall Accuracy: Vowel Dataset. Parameters k and fi used in HGC, and k used in CIC, are the same as in the other tables.

148

Appendix F SEER Features

Demographic Features FIELD NUMBER/NAME

f1

SEER ITEM NUMBER

1

SEER registry

SEER DESCRIPTION

2 digit code

COMPUTER REP.

1 - San Francisco - Oakland; 0 - otherwise

FIELD NUMBER/NAME

f2

SEER ITEM NUMBER

1

SEER registry

SEER DESCRIPTION

2 digit code

COMPUTER REP.

1 - Connecticut; 0 - otherwise

FIELD NUMBER/NAME

f3

SEER ITEM NUMBER

1

SEER registry

SEER DESCRIPTION

2 digit code

COMPUTER REP.

1 - Detroit; 0 - otherwise

FIELD NUMBER/NAME

f4

SEER ITEM NUMBER

1

SEER registry

SEER DESCRIPTION

2 digit code

COMPUTER REP.

1 - Hawaii; 0 - otherwise

FIELD NUMBER/NAME

f5

SEER ITEM NUMBER

1

SEER registry

SEER DESCRIPTION

2 digit code

COMPUTER REP.

1 - Iowa; 0 - otherwise

FIELD NUMBER/NAME

f6

SEER ITEM NUMBER

1

SEER registry

SEER DESCRIPTION

2 digit code

COMPUTER REP.

1 - New Mexico; 0 - otherwise

FIELD NUMBER/NAME

f7

SEER ITEM NUMBER

1

SEER DESCRIPTION

SEER registry

2 digit code

149

COMPUTER REP.

1 - Seattle; 0 - otherwise

FIELD NUMBER/NAME

f8

SEER ITEM NUMBER

1

SEER registry

SEER DESCRIPTION

2 digit code

COMPUTER REP.

1 - Utah; 0 - otherwise

FIELD NUMBER/NAME

f9

SEER ITEM NUMBER

1

SEER registry

SEER DESCRIPTION

2 digit code

COMPUTER REP.

1 - Atlanta; 0 - otherwise

FIELD NUMBER/NAME

f10

SEER ITEM NUMBER

1

SEER registry

SEER DESCRIPTION

2 digit code

COMPUTER REP.

1 - Alaska; 0 - otherwise

FIELD NUMBER/NAME

f11

SEER ITEM NUMBER

1

SEER registry

SEER DESCRIPTION

2 digit code

COMPUTER REP.

1 - San Jose - Montgomery; 0 - otherwise

FIELD NUMBER/NAME

f12

SEER ITEM NUMBER

1

SEER registry

SEER DESCRIPTION

2 digit code

COMPUTER REP.

1 - LA; 0 - otherwise

FIELD NUMBER/NAME

f13

SEER ITEM NUMBER

5

Place of birth

SEER DESCRIPTION

3 digit

COMPUTER REP.

1 - USA; 0 - otherwise

FIELD NUMBER/NAME

f14

SEER ITEM NUMBER

8

Race: White

SEER DESCRIPTION

2 digit

COMPUTER REP.

1 - White; 0 - otherwise

FIELD NUMBER/NAME

f15

SEER ITEM NUMBER

8

Race: Black

SEER DESCRIPTION

2 digit

COMPUTER REP.

1 - Black; 0 - otherwise

FIELD NUMBER/NAME

f16

SEER ITEM NUMBER

9

SEER DESCRIPTION

Spanish Surname or Origin

1 digit

150

COMPUTER REP.

1 - Spanish; 0 - otherwise

FIELD NUMBER/NAME

f17

SEER ITEM NUMBER

10

Sex

SEER DESCRIPTION

1 digit

COMPUTER REP.

1 - Male; 0 - female

FIELD NUMBER/NAME

f18-f25

SEER ITEM NUMBER

7

Age at diagnosis

SEER DESCRIPTION

3 digit

COMPUTER REP.

8-binary scale

COMPUTER REP.

(0-25),(25-34),(35-44),(45-54),(55-64),(65-74),(7584),≥ 85

FIELD NUMBER/NAME

f26-f34

SEER ITEM NUMBER

5

Region of birth in US

SEER DESCRIPTION

3 digit

COMPUTER REP.

9 binary variables: New England, 0 - otherwise Mid Atlantic, 0 - otherwise East North Central, 0 - otherwise West North Central, 0 - otherwise South Atlantic, 0 - otherwise East South Central, 0 - otherwise South West Cental, 0 - otherwise Mountain, 0 - otherwise Pacific, 0 - otherwise

FIELD NUMBER/NAME

f35-f43

Region of birth outisde US (with more than 1000 cases)

SEER ITEM NUMBER

5

SEER DESCRIPTION

3 digit

COMPUTER REP.

9 binary variables: Canada, 0 - otherwise Mexico, 0 - otherwise UK, 0 - otherwise Scandinavia, 0 - otherwise Germanic Countries, 0 - otherwise Romance langauge Countries, 0 - otherwise Slavic Countries, 0 - otherwise

151

Indochina, 0 - otherwise East Asia, 0 - otherwise Table F.1: Demographic Features

152

Medical Features FIELD NUMBER/NAME

f44

SEER ITEM NUMBER

14

Primary Site

SEER DESCRIPTION

4 character

COMPUTER REP.

1 - brochus; 0 - otherwise

FIELD NUMBER/NAME

f45-f47

SEER ITEM NUMBER

15

Laterality - left, right or both

SEER DESCRIPTION

1 digit

COMPUTER REP.

3 binary: 1 - left organ; 0 - otherwise left organ, 0 otherwise right organ, 0 otherwise both organs, 0 otherwise

FIELD NUMBER/NAME

f48-f51

SEER ITEM NUMBER

18

Grade

SEER DESCRIPTION

1 digit

COMPUTER REP.

4-scale binary: 1000 - I 1100 - II 1110 - III 1111 - IV, 5-8

FIELD NUMBER/NAME

f52

SEER ITEM NUMBER

20

Primary Not Found

SEER DESCRIPTION

12 digit

COMPUTER REP.

1 - primary not found or not stated (000 or 999); 0 - otherwise

FIELD NUMBER/NAME

f53-f60

SEER ITEM NUMBER

20

Size

SEER DESCRIPTION

12 digit

COMPUTER REP.

8-binary scale 1 - 001 or 002 2 - 003 3 - 004-005 4 - 006-010 5 - 011-015 6 - 016-020

153

7 - 020-990 8 - 998 FIELD NUMBER/NAME

f61-f66

SEER ITEM NUMBER

20

Extension

SEER DESCRIPTION

12 digit

COMPUTER REP.

6 separate binary variables 10-30,40-59,60-70,71-76,77-79,80-85

FIELD NUMBER/NAME

f67

SEER ITEM NUMBER

20

Lymph Involvement

SEER DESCRIPTION

12 digit

COMPUTER REP.

1 - 1-8; 0 - otherwise

FIELD NUMBER/NAME

f68

SEER ITEM NUMBER

20

Lymph Regional

SEER DESCRIPTION

12 digit

COMPUTER REP.

1 - (1-6); 0 - 7

FIELD NUMBER/NAME

f69-f72

SEER ITEM NUMBER

21

Site specific surgery

(until 1998), 66 (1998+) SEER DESCRIPTION

2 digits

COMPUTER REP.

4-binary scale 1000 - 0 1100 - 10,20,30,40,90 1110 - 50 1111 - 60,70 9 or 99 - missing -99

FIELD NUMBER/NAME

f73

SEER ITEM NUMBER

22

Surgery performed

SEER DESCRIPTION

1 digit

COMPUTER REP.

1 - yes; 0 - no

FIELD NUMBER/NAME

f74

SEER ITEM NUMBER

22

SEER DESCRIPTION

Surgery recommended

1 digit

154

COMPUTER REP.

1 - yes, if was not performed (see prev.); 0 - otherwise

FIELD NUMBER/NAME

f75

SEER ITEM NUMBER

23

Radiation

SEER DESCRIPTION

1 digit

COMPUTER REP.

1 - performed (1-6); 0 - none, or refused (0,7);

FIELD NUMBER/NAME

f76

SEER ITEM NUMBER

25

Radiation with surgery

SEER DESCRIPTION

1 digit

COMPUTER REP.

0 - performed (2-9); 1 - not

FIELD NUMBER/NAME

f77

SEER ITEM NUMBER

25

Radiation prior to surgery

SEER DESCRIPTION

1 digit

COMPUTER REP.

1 - yes; 0 - no

FIELD NUMBER/NAME

f78

SEER ITEM NUMBER

25

Radiation after surgery

SEER DESCRIPTION

1 digit

COMPUTER REP.

1 - yes (2); 0 - no

FIELD NUMBER/NAME

f79

SEER ITEM NUMBER

25

Radiation before and after surgery

SEER DESCRIPTION

1 digit

COMPUTER REP.

1 - yes(3); 0 - no

FIELD NUMBER/NAME

f80

Behavior

SEER ITEM NUMBER

28

(recode of field 17)

SEER DESCRIPTION

1 digit

COMPUTER REP.

1 - ’3’ (malignant); 0 - ’2’ (in situ)

FIELD NUMBER/NAME

f81-f86

SEER ITEM NUMBER

27

Histology (most frequent codes ¿ 10000)

SEER DESCRIPTION

4 digit

COMPUTER REP.

6 binary variables: 800*, 0 otherwise 801*, 0 otherwise 804*, 0 otherwise 807*, 0 otherwise 814*, 0 otherwise 825*, 0 otherwise

155

FIELD NUMBER/NAME

f87-f92

SEER ITEM NUMBER

27

Histology (medium frequency codes: > 1000)

SEER DESCRIPTION

4 digit

COMPUTER REP.

6 binary variables: 802*, 0 otherwise 803*, 0 otherwise 824*, 0 otherwise 826*, 0 otherwise 848*, 0 otherwise 856*, 0 otherwise

FIELD NUMBER/NAME

f93-f98

SEER ITEM NUMBER

34

SEER modified AJCC stage 3rd ed

SEER DESCRIPTION

2 digit

COMPUTER REP.

6-binary scale 00,10,20,31,32,40 Table F.2: Medical Features

156

Value Feature 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

0 59483 58836 55909 66153 60473 66009 59613 66995 63778 68319 66467 60376 6012 10733 61408 66003 29757 0 55 330 2242 9789 27533 53273 66492 52934 54962 50057 50548 54630 56504 55973 56603 50878 58297 58749 58781 59038 58733 58640 58600 58329 58222 65310 37788 28786 65418 0 2990

1 Missing Class 0 8918 0 9565 0 12492 0 2248 0 7928 0 2392 0 8788 0 1406 0 4623 0 82 0 1934 0 8025 0 53451 8938 57600 68 6925 68 2266 132 38644 0 68401 0 68346 0 68071 0 66159 0 58612 0 40868 0 15128 0 1909 0 6529 8938 4501 8938 9406 8938 8915 8938 4833 8938 2959 8938 3490 8938 2860 8938 8585 8938 821 9283 369 9283 337 9283 80 9283 385 9283 478 9283 518 9283 789 9283 896 9283 3091 0 28088 2525 37090 2525 651 2332 43140 25261 40150 25261

0 46819 47445 44987 52574 47585 51875 47201 52827 50883 54131 52658 47347 4831 8557 48416 52215 20503 0 7 111 1168 5634 17219 36821 50912 45792 47355 43067 42743 46946 48127 47619 47923 43634 49774 50121 50179 50356 50189 50112 50058 49968 49748 50456 29881 21585 49423 0 931

1 Missing Class 1 7393 0 6767 0 9225 0 1638 0 6627 0 2337 0 7011 0 1385 0 3329 0 81 0 1554 0 6865 0 46057 3324 45627 28 5768 28 1911 86 33709 0 54212 0 54205 0 54101 0 53044 0 48578 0 36993 0 17391 0 3300 0 5096 3324 3533 3324 7821 3324 8145 3324 3942 3324 2761 3324 3269 3324 2965 3324 7254 3324 692 3746 345 3746 287 3746 110 3746 277 3746 354 3746 408 3746 498 3746 718 3746 3756 0 20440 3891 28736 3891 1085 3704 28979 25233 28048 25233

Imputed Value 0 0 0 — 0 — 0 — 0 — — 0 1 1 0 — 1 — — — — 1 1 0 — 0 0 0 0 0 0 0 0 0 — — — — — — — — — 0 0 1 — — —

Table F.3: Frequencies of Feature Values in Each Class (Part I). The imputed Value for the features remaining after data cleaning are filled in as described in Section 5.1 of Chapter 5.

157

Value Feature 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98

0 13486 33299 42845 130 365 401 517 1796 4908 6232 42928 32822 50474 50916 52000 57025 44773 22975 45952 0 15834 64279 65975 42602 42468 35207 7055 67786 62206 68277 74 66998 56441 57093 51213 49826 65659 67917 68268 66936 67887 67394 67407 74 21504 25433 33681 44254 58195

1 Class 0 29654 9841 130 42845 42610 42574 42458 41179 38067 36743 47 24780 7128 6686 5602 577 12829 24034 968 68123 52289 3844 2148 25646 25780 31800 61346 615 6195 124 68327 1403 11960 11308 17188 18575 2742 484 133 1465 514 1007 994 58121 36691 32762 24514 13941 0

Missing

0

25261 25261 25426 25426 25426 25426 25426 25426 25426 25426 25426 10799 10799 10799 10799 10799 10799 21392 21481 278 278 278 278 153 153 1394 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10206 10206 10206 10206 10206 10206

5168 20763 22724 100 342 355 390 734 1368 1520 22731 42524 44835 43185 39218 46237 18951 5762 26588 0 19463 52082 52520 49993 49927 28672 1950 54061 52518 54171 9 52241 39341 45572 42885 40164 53534 53626 54042 54023 54034 53491 53632 9 3206 3937 7595 18184 47530

1 Class 1 23811 8216 100 22724 22482 22469 22434 22090 21456 21304 93 4466 2155 3805 7772 753 28039 22829 1847 53992 34529 1910 1472 4087 4153 24508 52262 151 1694 41 54203 1971 14871 8640 11327 14048 678 586 170 189 178 721 580 47521 44324 43593 39935 29346 0

Missing

Imputed Value

25233 25233 31388 31388 31388 31388 31388 31388 31388 31388 31388 7222 7222 7222 7222 7222 7222 25621 25777 220 220 220 220 132 132 1032 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6682 6682 6682 6682 6682 6682

— — — — — — — — — — — 0 0 0 0 — 0 — — — 1 — — 0 0 0 1 — 0 — — — 0 0 0 0 — — — — — — — — 1 1 0 0 —

Table F.4: Frequencies of Feature Values in Each Class (Part II). The imputed Value for the features remaining after data cleaning are filled in as described in Section 5.1 of Chapter 5.

158

Appendix G SEER Data Clusters Descriptions

Here we present for completeness tables describing feature importance in the local models and in separating clusters from the rest of the data. The importance of the j-th feature with respect to classifier R is measured by the relative weight of the coefficients, rj , in the classifier, as in formula (4.7). Tables G.1 - G.16 provide feature importance information for each cluster. For ease of understanding, we indicate that is a particular feature is in the set of top features with cumulative relative weight of 50%, with ”1”. Otherwise a feature is marked with ”-”. Each table has the following structure: • in the first column, feature id; • in the second column, feature importance in the global classifier R0 , ”-” otherwise; • in the third column, feature importance in local classifier R2i ; • in the fourth column, feature importance in the first-level classifier R1i ; • the fifth column contains the mean feature value µj , which for binary variables is also frequency of value 1. • the last column provides feature description. It is interesting to note that columns 4 and 5 together provide a description of a cluster. The fourth column indicates whether a feature is important for separating the cluster from the rest of the data, while the fifth column can be used to determine the frequency of the feature occurring in the cluster with value ”1” (alternatively, this is the mean value of the feature in the cluster).

159

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 1 -

R1i 1 1 1 1 1 1 1

µj 0.01 0.82 0.02 0.02 0.00 0.01 0.01 0.87 0.97 0.03 0.62 0.93 0.77 0.38 0.42 0.27 0.04 0.01 0.02 0.02 0.02 0.03 0.01 0.08 0.80 0.20 0.22 0.11 0.16 0.12 0.00 0.77 0.06 0.06 0.53 1.00 0.00 0.19 0.19 0.32 0.21 0.85 0.82 0.23 0.00

Description Registry: San-Francisco Registry: Connecticut Registry: Detroit Registry: Iowa Registry: Seattle Registry: Atlanta Registry: Los Angeles Place of birth: US Race: White Race: Black Sex: Male Age 55 or greater Age 65 or greater Age 75 or greater Born in South Atlantic Born in Mid Atlantic Region Born in East North Central Born in West North Central Born in New England Born in East South Central region Born in South West Cental region Born in Mountain region Born in Pacific region Primary Site: Bronchus Laterality: left Laterality: right Extensions code 10-30 Extensions code 40-59 extension code 60-70 Extension code 71-76 Extension code 80-85 Site specific surgery (code 10 or higher) Surgery was performed Surgery recommended Radiation therapy No radiation sequence with surgery Radiation after surgery Histology code 801* Histology code 804* Histology code 807* Histology code 814* Stage code 10 or higher Stage code 20 or higher Stage code 31 or higher Stage code 32 or higher Table G.1: Feature information for Cluster 1.

160

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 -

R1i 1 1 1 1 1 1 1 -

µj 0.00 0.98 0.00 0.00 0.00 0.00 0.00 1.00 0.98 0.02 0.57 0.88 0.64 0.27 0.96 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.08 0.01 0.98 0.10 0.04 0.11 0.09 0.42 0.75 0.05 0.05 0.49 0.99 0.00 0.25 0.21 0.20 0.25 0.95 0.94 0.63 0.45


161

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 1 1 1

R1i 1 1 1 1 1 1 1 1 1 -

µj 0.00 0.00 0.96 0.01 0.00 0.00 0.00 0.96 0.75 0.25 0.61 0.87 0.57 0.13 0.01 0.04 0.59 0.01 0.06 0.16 0.03 0.00 0.00 0.07 0.43 0.57 0.25 0.09 0.17 0.33 0.00 0.41 0.05 0.06 0.59 1.00 0.00 0.17 0.21 0.32 0.22 0.85 0.83 0.49 0.02


162

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 1 1 1 -

R1i 1 1 1 1 1 1 1 1 1 1

µj 0.15 0.03 0.21 0.12 0.09 0.04 0.09 0.90 0.92 0.05 0.54 1.00 0.98 0.83 0.02 0.09 0.17 0.14 0.05 0.05 0.07 0.13 0.11 0.04 0.32 0.68 0.33 0.08 0.03 0.11 0.00 0.16 0.00 0.01 0.25 1.00 0.00 0.53 0.07 0.13 0.11 0.68 0.67 0.13 0.01


163

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 -

R1i 1 1 1 1 1 1 1 1 1 -

µj 0.13 0.13 0.16 0.11 0.13 0.07 0.13 0.93 0.87 0.08 0.55 0.88 0.62 0.20 0.08 0.06 0.12 0.11 0.06 0.04 0.05 0.05 0.12 0.01 0.43 0.57 0.82 0.17 0.00 0.00 0.00 0.99 0.89 0.90 0.04 0.99 0.00 0.08 0.04 0.29 0.35 0.13 0.03 0.00 0.00


164

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 1 1 1 1 -

R1i 1 1 1 1 1 -

µj 0.13 0.14 0.21 0.12 0.12 0.06 0.11 0.91 0.84 0.11 0.60 0.78 0.47 0.11 0.10 0.07 0.16 0.13 0.07 0.04 0.05 0.04 0.14 0.03 0.43 0.56 0.33 0.18 0.15 0.06 0.22 1.00 1.00 1.00 0.97 0.00 0.92 0.14 0.04 0.30 0.38 0.86 0.66 0.35 0.23


165

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 1 -

R1i 1 1 1 1 1 1 1 1 1 -

µj 0.00 0.00 0.00 0.98 0.00 0.00 0.00 1.00 0.99 0.01 0.66 0.91 0.69 0.29 0.00 0.00 0.01 0.97 0.00 0.00 0.00 0.00 0.00 0.08 0.41 0.56 0.09 0.05 0.11 0.15 0.45 0.80 0.04 0.04 0.57 1.00 0.00 0.20 0.24 0.25 0.24 0.94 0.94 0.71 0.48


166

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 1 -

R1i 1 1 1 1 1 1 1 -

µj 0.15 0.06 0.24 0.03 0.17 0.07 0.13 0.94 0.84 0.12 0.60 0.84 0.57 0.22 0.02 0.09 0.20 0.06 0.08 0.06 0.07 0.06 0.18 0.05 0.00 0.95 0.00 0.00 0.00 0.00 0.98 0.63 0.04 0.04 0.48 0.99 0.00 0.26 0.23 0.14 0.27 1.00 1.00 1.00 1.00


167

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 1 1 1 1 -

R1i 1 1 1 1 1 1 1 -

µj 0.69 0.00 0.00 0.01 0.00 0.00 0.00 0.99 0.75 0.13 0.58 0.84 0.56 0.15 0.02 0.06 0.06 0.06 0.01 0.04 0.16 0.10 0.42 0.08 0.33 0.65 0.15 0.09 0.27 0.23 0.00 0.70 0.04 0.04 0.65 1.00 0.00 0.19 0.19 0.28 0.27 0.95 0.93 0.49 0.03


168

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 1 -

R1i 1 1 1 1 1 1 -

µj 0.81 0.01 0.01 0.01 0.00 0.01 0.00 0.03 0.35 0.03 0.66 0.91 0.73 0.43 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.06 0.37 0.61 0.09 0.05 0.13 0.26 0.23 0.53 0.03 0.03 0.52 1.00 0.00 0.23 0.11 0.22 0.34 0.95 0.94 0.67 0.30


169

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 -

R1i 1 1 1 1 1 -

µj 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.96 0.03 0.58 0.89 0.67 0.26 0.02 0.05 0.08 0.17 0.02 0.02 0.06 0.08 0.42 0.04 0.38 0.62 0.14 0.08 0.19 0.23 0.00 0.78 0.04 0.04 0.61 1.00 0.00 0.21 0.21 0.27 0.24 0.93 0.92 0.42 0.03


170

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 -

R1i 1 1 1 1 1 1 -

µj 0.00 0.07 0.09 0.03 0.69 0.02 0.00 0.02 0.82 0.01 0.51 0.92 0.74 0.40 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.21 0.77 0.07 0.05 0.14 0.26 0.22 0.80 0.05 0.05 0.48 1.00 0.00 0.21 0.23 0.19 0.27 0.97 0.96 0.66 0.29


171

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 1 1

R1i 1 1 1 1 1 1 1 1 -

µj 0.01 0.05 0.05 0.01 0.01 0.85 0.01 0.99 0.53 0.47 0.69 0.81 0.51 0.16 0.00 0.03 0.03 0.00 0.81 0.07 0.01 0.00 0.00 0.08 0.37 0.61 0.17 0.09 0.20 0.18 0.13 0.78 0.06 0.06 0.56 1.00 0.00 0.28 0.19 0.30 0.16 0.90 0.88 0.52 0.16


172

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 1 1 1 1 -

R1i 1 1 1 1 1 1 1

µj 0.14 0.15 0.21 0.03 0.15 0.07 0.11 0.93 0.85 0.11 0.59 0.86 0.60 0.23 0.11 0.08 0.17 0.05 0.08 0.06 0.07 0.06 0.15 0.07 1.00 0.00 0.00 0.00 0.00 0.00 0.97 0.64 0.04 0.04 0.48 0.99 0.00 0.25 0.24 0.16 0.27 1.00 1.00 1.00 1.00


173

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1 -

R1i 1 1 1 1 1 1 1 -

µj 0.00 0.00 0.00 0.01 0.00 0.01 0.86 1.00 0.78 0.19 0.56 0.89 0.67 0.28 0.02 0.09 0.11 0.06 0.02 0.04 0.14 0.07 0.16 0.11 0.41 0.58 0.10 0.06 0.19 0.45 0.00 0.77 0.07 0.07 0.43 1.00 0.00 0.20 0.19 0.24 0.26 0.98 0.97 0.65 0.04


174

j 1 2 3 5 7 9 12 13 14 15 17 22 23 24 26 27 28 29 30 31 32 33 34 44 45 46 61 62 63 64 66 70 73 74 75 76 78 82 83 84 85 94 95 96 97

R0 1 1 1 1 1 1 1 1 1 1 1 1

R2i 1 1 1 1 1 1 1

R1i 1 1 1 1 -

µj 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.64 0.01 0.65 0.86 0.63 0.29 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.08 0.40 0.58 0.07 0.04 0.09 0.22 0.44 0.73 0.06 0.06 0.46 1.00 0.00 0.22 0.15 0.19 0.31 0.98 0.97 0.74 0.46


175

References [1] A. Agresti. Categorical Data Analysis. Wiley, 1990. [2] M. Aizerman, E. M. Braverman, and L. I. Rozenoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25(6):821– 837, June 1964. [3] M. B. Al-Daoud and S. A. Roberts. New methods for the initialization of clusters. Pattern Recognition Letters, 17(5):451–455, 1996. [4] P. D. Allison. Multiple imputation for missing data: A cautionary tale. Sociological Methods and Research, 28:301–309, 2000. [5] Alsabti, Ranka, and Singh. An efficient k-means clustering algorithm. In Proceedings of the First Workshop on High-Performance Data Mining, 1998. [6] J. A. Anderson. Regression and ordered categorical variables. Journal of the Royal Statistical Society. Series B., 46(1):1–30, 1984. [7] S. Bandyopadhyay and U. Maulik. An evolutionary technique based on k-means algorithm for optimal clustering in RN . Information Sciences, 146:221–237, 2002. [8] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. [9] C. Blake and C. Merz. UCI repository of machine learning databases, 1998. [10] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [11] P. S. Bradley and U. M. Fayyad. Refining initial points for K-Means clustering. In Proceedings of the 15th International Conference on Machine Learning, pages 91–99. Morgan Kaufmann, San Francisco, CA, 1998. [12] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, 1984. [13] C. E. Brodley and P. E. Utgoff. Multivariate decision trees. Machine Learning, 19(1):45–77, April 1995. [14] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. [15] C. Chinrungrueng and C. H. Sequin. Optimal adaptive k-means algorithm with dynamic adjustment of learning rate. IEEE Transactions on Neural Networks, 6(1):157–169, 1995. [16] S.-C. Chu and J. F. Roddick. Pattern clustering using incremental splitting for non-uniformly distributed data. In Fifth International Conference on Knowledge-Based Intelligent Information Engineering Systems and Allied Technologies, 2001. [17] M. Cowgill, R. Harvey, and L. Watson. A genetic algorithm approach to cluster analysis. Computers and Mathematics with Applications, 37:99–108, 1999.

176 [18] K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. In Computational Learning Theory, pages 35–46, 2000. [19] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, 2000. [20] C. Darken and J. Moody. Fast adaptive k-means clustering: Some empirical results. In Proceedings of the IEEE IJCNN, 1990. [21] I. Davidson and A. Satyanarayana. Speeding up k-means clustering by bootstrap averaging. In Proceedings of IEEE Data Mining Workshop on Clustering Large Data Sets, 2003. [22] C. deBoor. A Practical Guide to Splines. Springer-Verlag, 1978. [23] W. S. DeSarbo and W. L. Cron. A maximum likelihood methodology for clusterwise linear regression. Journal of Classification, 5:249–282, 1988. [24] I. Dhillon, Y. Guan, and B. Kulis. Kernel k-means, spectral clustering and normalized cuts. In Proceedings of KDD’2004, 2004. [25] T. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10:1985–1923, 1998. [26] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263–286, 1995. [27] C. Ding and X. He. K-means clustering via principal component analysis. In Proceedings of ICML, 2004. [28] I. Dohoo, C. Ducrot, C. Fourichon, A. Donald, and D. Hurnik. An overview of techniques for dealing with large numbers of independent variables in epidemiologic studies. Preventive Veterinary Medicine, 29:221–239, 1996. [29] J.-X. Dong, A. Krzyzak, and C. Y. Suen. Local learning framework for recognition of lowercase handwritten characters. In Proceedings of MLDM 2001, pages 226–238, 2001. [30] A. L. Dontchev. Perturbations, approximations, and sensitivity analysis of optimal control systems. Springer-Verlag, 1983. [31] Drineas, Frieze, Kannan, Vempala, and Vinay. Clustering in large graphs and matrices. In SODA: ACM-SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms), 1999. [32] S. T. Dumais and H. Chen. Hierarchical classification of Web content. In Proceedings of SIGIR-00, pages 256–263, 2000. [33] A. El-Hamdouchi and P. Willett. Hierarchic document classification using ward’s clustering method. In Proceedings of the 9th annual international ACM SIGIR Conference, pages 149–156, New York, NY, USA, 1986. ACM Press. [34] S. Eschrich and L. O. Hall. Soft partitions lead to better learned ensembles. In Proceedings of NAFIPS, pages 406–411, 2002. [35] C. Farley and A. E. Raftery. Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association, 97(458):611–631, June 2002. [36] E. Forgy. Cluster analysis of multivariate data: efficiency vs interpretability of classification. Biometrics, 21:768, 1965.

177 [37] D. Fradkin and I. Muchnik. A study of k-means clustering for improving classification accuracy of multi-class svm. Technical Report 2004-02, DIMACS, February 2004. [38] J. H. Friedman. Another approach to polychotomous classification. Technical report, Stanford University, 1996. [39] J. H. Friedman and B. E. Popescu. Predictive learning via rule ensembles. Technical report, Stanford University, 2005. [40] J. Gama. Functional trees. Machine Learning, 55(3):219–250, 2004. [41] A. Genkin, D. D. Lewis, and D. Madigan. Large-scale bayesian logistic regression for text categorization. Software: http://mms-01.rutgers.edu/ ag/BBR/. [42] M. Girolami. Mercer kernel based clustering in feature space. IEEE Transactions on Neural Networks, 13(3):780–784, 2002. [43] R. Gnanadesikan, J. R. Kettenring, and S. L. Tsao. Weighting and selection of variables for cluster analysis. Journal of Classification, 12:113–136, 1995. [44] S. Godbole. Exploiting confusion matrices for automatic generation of topic hierarchies and scaling up multi-way classifiers. Technical report, IIT Bombay, 2002. [45] S. Godbole, S. Sarawagi, and S. Chakrabarti. Scaling multi-class support vector machines using inter-class confusion. In Proceedings of SIGKDD’02, 2002. [46] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, 2003. [47] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46(1-3):389–422, 2002. [48] G. Hamerly and C. Elkan. Alternatives to the k-means algorithm that find better clusterings. In Proceedings of the eleventh international conference on Information and knowledge management, pages 600–607, 2002. [49] G. Hamerly and C. Elkan. Learning the k in k-means. In Advances in Neural Information Processing Systems, volume 17, 2003. [50] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143:29–36, 1982. [51] J. A. Hanley and B. J. McNeil. A method of comparing the area under receiver operating characteristic curves derived from the same cases. Radiology, 148:839–843, 1983. [52] S. Har-Peled and S. Mazumdar. Coresets for k-means and k-median clustering and their applications. In STOC 2004, pages 291–300, 2004. [53] S. Har-Peled and B. Sadri. How fast is the k-means method? Algorithmica, 41(3):185–202, 2005. [54] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer, New York, 2001. [55] C. Hsu and C. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2), March 2002. [56] C.-M. Huang and R. W. Harris. A comparison of several vector quantization codebook generation approaches. IEEE Transactions on Image Processing, 2(1):108–112, 1993.

178 [57] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193 – 218, 1985. [58] T. Jaakkola, M. Diekhaus, and D. Haussler. Using the fisher kernel method to detect remote protein homologies. pages 149–158, 1999. [59] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. Hinton. Adaptive mixtures of local experts. Neural Computation, 3:79–87, 1991. [60] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [61] N. Japkowicz. Supervised learning with unsupervised output separation. In Proceedings of the IASTED International Conference on Artificial Intelligence and Soft Computing (ASC’2002), pages 321–325, 2002. [62] T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning. Springer, 1998. [63] J. W. Johnson and J. M. LeBreton. History and use of relative importance indices in organizational research. Organizational Research Methods, 7(3):238–257, 2004. [64] V. E. Johnson and J. H. Albert. Ordinal Data Modeling. Springer, 1999. [65] J. Kampen and M. Swyngedouw. The ordinal controversy revisited. Quality and Quantity, 34(2):87–102, 2000. [66] R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. In 41st Annual Symposium on Foundations of Computer Science, FOCS, pages 367–377, 2000. [67] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. In Proceedings of the eighteenth annual symposium on Computational geometry, pages 10–18, 2002. [68] I. Katsavounidis, C.-C. J. Kuo, and Z. Zhang. A new initialization technique for generalized lloyd iteration. IEEE Signal Processing Letters, 1(10):144–146, 1994. [69] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, Inc., 1990. [70] R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of Fourteenth International Joint Conference on Artificial Intelligence, pages 1137– 1143, 1995. [71] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 170–178, 1997. [72] B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink. Sparse multinomial logistic regression: Fast algorithms and generalized bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):957–968, June 2005. [73] L. I. Kuncheva. Switching between selection and fusion in combining classifiers: an experiment. IEEE Transactions on Systems, Man and Cybernetics, 32(2):146–156, April 2002. [74] A. Lika, N. Vlassis, and J. J. Verbeek. The global k-means clustering algorithm. Pattern Recognition, 36(2):451–461, 2003. [75] Y. Linde, A. Buzo, and R. Gray. An algorithm for vector quantizer design. IEEE Transactions on Communication, 28:84–95, 1980.

179 [76] S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:128– 137, 1982. [77] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkley Symposium on Mathematical Sciences and Probability, pages 281–297, 1967. [78] J. Matousek. On approximate geometric k-clustering. Discrete Computational Geometry, 24:61– 84, 2000. [79] P. Michel and R. E. Kaliouby. Real time facial expresion recognition in video using support vector machines. In Proceedings of ICMI’03, pages 258–264, 2003. [80] G. W. Milligan and M. C. Cooper. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 58(2):159–179, 1985. [81] G. W. Milligan and M. C. Cooper. A study of standardization of variables in in cluster analysis. Journal of Classification, 5:181–204, 1988. [82] B. Mirkin. Mathematical Classification and Clustering. Kluwer Academic Publishers, 1996. [83] B. Mirkin. Clustering for Data Mining: A Data Recovery Approach. Chapman and Hall, 2005. [84] B. Mirkin and I. Muchnik. Layered clusters of tightness set functions. Applied Mathematics Letters, 15:147–151, 2002. [85] J. Mullat. Extremal subsystems of monotonic systems i. Avtomatika i telemekhanika, 5, 1976. [86] K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. B. Sch¨ olkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–202, March 2001. [87] I. Myrtveit, E. Strensrud, and U. H. Olsson. Analyzing data sets with missing data: An empirical evaluation of imputation nethods and likelihood-based methods. IEEE transactions on Software Engineering, 27(11):999–1013, 2001. [88] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Proceedings of NIPS’2001, 2001. [89] A. Novikov. On convergence proofs for perceptrons. In Proceedings of the Symposium of the Mathematical Theory of Automata, volume 12, pages 615–622, 1962. [90] D. Pelleg and A. Moore. Accelerating exact k -means algorithms with geometric reasoning. In Knowledge Discovery and Data Mining, pages 277–281, 1999. [91] J. Pen, J. Lozano, and P. Larranaga. An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters, 20:1027–1040, 1999. [92] J. Platt. Fast training of support vector machines using sequential minimal optimization. In C. B. B. Sch¨olkopf and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 185–208. MIT Press, 1999. [93] J. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin DAGs for multiclass classification. Advances in Neural Information Processing Systems, 12:547–553, 2000. [94] S. Puuronen, V. Terziyan, and A. Tsymbal. A dynamic integration algorithm for an ensemble of classifiers. In Proceedings of Foundations of Intelligent Systems, 11th International Symposium, 1999.

180 [95] J. R. Quinlan. C4.5: Programs for machine learning, 1993. [96] W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971. [97] S. Ray and R. H. Turi. Determination of number of clusters in k-means clustering and application in colour image segmentation. In Proceedings of the ICAPRDT’99, pages 137–143, 1999. [98] A. Rida, A. Labbi, and C. Pellegrini. Local expert combination through density decomposition. In Proceedings of the IEEE International workshop on AI and Statistics, pages 130–136, 1999. [99] J. Roberts. Sensitivity of elasticity estimates for oecd health care spending: analysis of a dynamic heterogeneous data field. Health Economics, 8(5):459–472, 1999. [100] B. Rockhill, D. Spiegelman, C. Byrne, D. J. Hunter, and G. A. Golditz. Validation of the Gail et al. model of breast cancer risk prediction and implications for chemoprevention. Journal of the National Cancer Institute, 93(5):358–366, March 2001. [101] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization of the brain. Psychological Review, 65:386–408, 1959. [102] D. B. Rubin. Multiple imputation for nonresponse in surveys. Wiley, 1987. [103] A. Saltelli, S. Tarantola, F. Campolongo, and M. Ratto. Sensitivity Analysis in Practice: A Guide to Assessing Scientific Models. Wiley, 2004. [104] S. Salvador and P. Chan:. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In Proceedings of the ICTAI’04, pages 576–584, 2004. [105] A. Sato and K. Yamada. Generalized learning vector quantization. In Proceedings of ICPR’98, 1998. [106] S. Z. Selim and M. Ismail. K-means-type algorithms: A generalyzed convergence theorem and characterization of local optimality. IEEE Transactions on pattern analysis and machine intelligence, 6(1):81–87, 1984. [107] E. S. Soofi, J. J. Retzer, and M. Yasai-Ardekani. A framework for measuring the importance of variables with applications to management research and decision models. Decision Sciences, 31(3), 2000. [108] H. Spath. Clusterwise linear regression. Computing, 22(4):367–273, 1979. [109] T. Su and J. G. Dy. A deterministic method for initializing k-means clustering. In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’04), 2004. [110] B. Tang, M. I. Heywood, and M. Shepherd. Input partitioning to mixture of experts. In Proceedings of IEEE World Congress on Computational Intelligence (IEEE WCCI 2002). [111] A. Tarsitano. A computational study of several relocation methods for k-means algorithms. Pattern Recognition, 36:2955–2966, 2003. [112] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.Series B (Methodological), 58(1):267–288, 1996. [113] A. J. van der Kooija, J. J. Meulmana, and W. J. Heiserb. Local minima in categorical multiple regression. Computational Statistics and Data Analysis, 50(2):446–462, 2006. [114] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 2nd edition, 1998.

181 [115] M. Vrahatis, B. Boutsinas, P. Alevizos, and G. Pavlides. The new k-windows algorithm for improving the k-means clustering algorithm. Journal of Complexity, 18:375–391, 2002. [116] V. Vural and J. G. Dy. A hierarchical method for multi-class support vector machines. In Proceedings of ICML-04, 21th International Conference on Machine Learning, 2004. [117] A. Wade. Derivation versus validation. Archives of Disease in Childhood, 83:459–460, 2000. [118] S. M. Weiss and C. A. Kulikowski. Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems. Morgan Kaufmann Publishers Inc., 1991. [119] J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSD-TR-9804, Department of Computer Science, Royal Holloway, University of London, 1998. [120] G. J. Williams and Z. Huang. Mining the knowledge mine: The hot spots methodology for mining large real world databases. In Australian Joint Conference on Artificial Intelligence, pages 340–348, 1997. [121] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999. [122] D. H. Wolpert. Stacked generalization. Neural Networks, 5(2), 1992. [123] L. Xu. Bayesian ying-yang machine, clustering and number of clusters. Pattern Recogn. Lett., 18(11-13):1167–1178, 1997. [124] L. Xu, M. I. Jordan, and G. E. Hinton. An alternative model for mixtures of experts. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems (NIPS), volume 7, pages 633–640. 1995. [125] A. C. Yeo, K. A. Smith, R. J. Willis, and M. Brooks. Clustering technique for risk classification and prediction of claim costs in the automobile insurance industry. International Journal of Intelligent Systems in Accounting, Finance and Management, 10(1):39–50, 2001. [126] Y.Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the eleventh CIKM, 2002. [127] H. Zha, X. He, C. Ding, H. Simon, and M. Gu. Spectral relaxation for k-means clustering. In Proceedings of NIPS, 2001. [128] B. Zhang. Dependence of clustering algorithm performance on clustered-ness of data. HP Labs Technical Report HPL-2001-91, Hewlett-Packard Laboratories, 2001. [129] B. Zhang. Generalized k-harmonic means – boosting in unsupervised learning. In Proceedings of the First SIAM Conference on Data Mining, 2001. [130] B. Zhang, G. Kleyner, and M. Hsu. A local search approach to K-clustering. HP Labs Technical Report HPL-1999-119, Hewlett-Packard Laboratories, 1999. [131] J. Zhang, R. Jin, Y. Yang, and A. G. Hauptmann. Modified logistic regression: An approximation to svm and its applications in large-scale text categorization. In Proceedings of ICML, pages 888– 895, 2003. [132] T. Zhang and F. J. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4(1):5–31, 2001.

182

Vita Dmitriy Fradkin 1999

B. A. in Mathematics, Computer Science from Brandeis University, Waltham, MA.

2006

Ph. D. in Computer Science from Rutgers University, New Brunswick, NJ.

Publications: Dmitriy Fradkin and David Madigan. Experiments with Random Projections for Machine Learning. The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), August 2003. Dmitriy Fradkin and Ilya B. Muchnik. Clusters With Core-Tail Hierarchical Structure And Their Applications To Machine Learning Classification. International Conference on Integration of Knowledge Intensive Multi-Agent Systems (KIMAS), October 2003. Dmitriy Fradkin, Ilya B. Muchnik and Simon Streltsov. Image Compression in Real-Time Multiprocessor Systems Using Divisive K-Means Clustering. International Conference on Integration of Knowledge Intensive Multi-Agent Systems (KIMAS), October 2003. Dmitriy Fradkin and Ilya B. Muchnik. A Study of K-Means Clustering for Improving Classification Accuracy of Multi-Class SVM. DIMACS Technical Report # 2004-02, February 2004. Aynur Dayanik, Dmitriy Fradkin, Alex Genkin, Paul Kantor, David D. Lewis, David Madigan, Vladimir Menkov. DIMACS at the TREC 2004 Genomics Track. Proceedings of the 13th Text REtrieval Conference (TREC 2004). November 2004. Sundara Venkataraman, Dimitris Metaxas, Dmitriy Fradkin, Casimir Kulikowski, Ilya Muchnik. Distinguishing Mislabeled Data from Correctly Labeled Data in Classifier Design. The 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI). November 2004. Dmitriy Fradkin and Paul Kantor. A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems. The 13th ACM Conference on Information and Knowledge Management (CIKM). November 2004. Dmitriy Fradkin and Michael Littman. Exploration Approaches to Adaptive Filtering. DIMACS Technical Report # 2005-01, January 2005. Dmitriy Fradkin and Paul Kantor. Methods for Learning Classifier Combinations: No Clear Winner. Proceedings of ACM Symposium on Applied Computing, March 2005. David Madigan, Alexander Genkin, David D. Lewis and Dmitriy Fradkin. Bayesian Multinomial Logistic Regression for Author identification. MaxEnt 2005.

183 David Madigan, Alexander Genkin, David D. Lewis, Shlomo Argamon, Dmitriy Fradkin and Li Ye. Author Identification on the Large Scale. CSNA 2005. Dmitriy Fradkin, Dona Schneider and Ilya Muchnik Machine Learning Methods in the Analysis of Lung Cancer Survival Data DIMACS Technical Report # 2005-35, October 2005.

within-class and unsupervised clustering improve ... - CiteSeerX

within-class and unsupervised clustering improve ... - CiteSeerX

Suggest Documents

Unsupervised Distributed Clustering - CiteSeerX

Texture Analysis and Unsupervised Clustering for ... - CiteSeerX

within-class and unsupervised clustering improve ... - Google Sites

Unsupervised Clustering Methods for Identifying Rare ... - CiteSeerX

UNSUPERVISED FUZZY C-MEANS CLUSTERING FOR ... - CiteSeerX

Financial Forecasting Through Unsupervised Clustering ... - CiteSeerX

Financial Forecasting through Unsupervised Clustering ... - CiteSeerX

Unsupervised Clustering In Streaming Data - CiteSeerX

Unsupervised Clustering Methods for Identifying Rare ... - CiteSeerX

Unsupervised Translation Sense Clustering

Unsupervised Learning: Clustering

Robust Path-Based Clustering for the Unsupervised and ... - CiteSeerX

Clustering using Unsupervised Binary Trees

Unsupervised Image Clustering using the

Unsupervised Clustering of Bioinformatics Data

Unsupervised clustering and epigenetic classification of ... - Nature

Unsupervised morphological segmentation and clustering with ...

Financial forecasting through unsupervised clustering and neural

Name Discrimination and Email Clustering using Unsupervised ...

Integration of Unsupervised Clustering, Interaction and Parallel ...

Comparison of Clustering Metrics and Unsupervised Learning ...

GibbsCluster: unsupervised clustering and ... - Oxford Academic

Unsupervised Recognition and Clustering of Speech ...

Unsupervised and Supervised Clustering: The Mutual Information ...