data mining on biomedical data ranging from clustering via semi-supervised clustering to ..... data mining algorithm w.r.t. the goal of the discovery process.
Doctoral Thesis
Knowledge Extraction And Data Mining Algorithms For Complex Biomedical Data
Dipl.-Inf. Claudia Plant
Private Universit¨at f¨ ur Gesundheitswissenschaften, Medizinische Informatik und Technik Institute of Biomedical Engeneering Eduard Walln¨ofer Zentrum 1 A-6060 Hall ¨ Osterreich/Austria www.umit.at
to Chris and my parents
If you would be a real seaker of truth, it is necessary that at least once in your life you doubt, as far as possible, all things.
¨ EIDESSTATTLICHE ERKLARUNG
Ich erkl¨are hiermit an Eides Statt, dass ich die vorliegende Dissertation selbst¨andig angefertigt habe. Die aus fremden Quellen direkt oder indirekt u ¨bernommenen Informationen und/oder Inhalte sind als solche kenntlich gemacht.
Hall, December 2006
Acknowledgement
There are many people who encouraged me in the development stage of this thesis during the last years. First of all, I want to express my warmest thanks to my supervisor Univ.-Doz. Dr. Christian Baumgartner. We had many fruitful discussions. He always supported me the best possible way in developing my ideas. In fact, he often believed more in me than I did.
I also want to thank Prof.
Dr.
Hans-J¨org Schek who was my supervisor during
the first time of this thesis before he left UMIT. His outstanding experience and mathematical background gave me novel insights into some problems. In addition, I want to thank Prof. Dr. Schek for permitting me to do parts of this work during inspiring research visits to other universities.
My warmest thanks also go to Univ.-Prof.
Dr.
Bernhard Tilg who gave me the
opportunity to continue this work at his institute. In his group I found a very positive and supportive environment giving me much freedom for my research.
Inspiring
biomedical applications and the access to real-world data sets have been essential for the development of most of the techniques described in this thesis.
4
In addition, I want to thank the database group at the University of Munich and particularly Prof. Dr. Christian B¨ohm for the close cooperation during the last years. I also want to mention Prof. Dr. Hans-Peter Kriegel, Dr. Karin Kailing and Dr. Peer Kr¨oger, the supervisors of my diploma thesis, who continued to support me during the first time at UMIT.
As mentioned, parts of the development work for some the proposed techniques has been done during research visits to Carnegie Mellon University, (Pittsburgh, US), National University of Athens and National University of Singapore. At Carnegie Mellon University I want to thank Prof. Dr. Christos Faloutsos and Dr. Jia-Yu Pan for many inspiring and encouraging discussions. My thanks also include Prof. Dr. Yannis Yoannidis and Dr. Harry Dimitropolous at National University of Athens. Many thanks also to Prof. Dr. Beng Chin Ooi at National University of Singapore for giving me the opportunity to work with his excellent group, especially with Ying Yan.
I also want to express many thanks to my current and past colleagues at UMIT, not to forget how much fun we had.
In particular, I want to mention my former
roommate Dragana Damjanovic, Manfred Wurz and Prof. Dr. Heiko Schuldt who have left UMIT and my current colleagues including Dr. Bernhard Pfeifer, Marc Breit and Fabio Cerqueira.
My special thanks go to my master student and future colleague
Melanie Osl. She persistently supported me with several tasks, such as implementation work and also discussing novel ideas.
Abstract
Due to the advances of high throughput technologies for data acquisition in biomedicine an increasing amount of data is produced. This thesis proposes innovative algorithms for data mining on biomedical data ranging from clustering via semi-supervised clustering to classification. Clustering aims at deriving a previously unknown grouping of a data set and can be applied e.g. to investigate different stages of a disease. In contrast to conventional clustering algorithms, the proposed method RIC (Robust Information-theoretic Clustering) detects clusters of various data distributions and subspace clusters without requiring input parameters. In contrast to many other algorithms assuming a gaussian data distribution, RIC succeeds in detecting non gaussian clusters on metabolic and other biomedical data sets. Often, for biomedical data additional information in the form of class labels is given for at least some of the objects. Also in the presence of class labels a cluster analysis can be very interesting, e.g. to detect multi modal classes. This thesis proposes an algorithm for hierarchical density based semi-supervised clustering called HISSCLU. This algorithm derives and visualizes a hierarchical cluster structure of the data set which is maximally consistent with the class labels. HISSCLU confirms the findings in literature w.r.t. discovering similar classes on data on protein localization sites. The task of classification deals with predicting the class label of unlabeled instances. This thesis proposes LCF (Local Classification Factor), an approach towards enhancing instance-based classifica-
6
tion with the information of local object density. In biomedical applications classification is often used for diagnosis support, and therefore especially a high sensitivity in identifying diseased instances is essential. LCF shows superior accuracy and, most importantly, superior sensitivity on metabolic data. Not only novel data mining techniques are needed to support the knowledge discovery process on biomedical data. The selection of appropriate data sources and the choice of suitable data mining methods can lead to a substantial knowledge gain as demonstrated on mining genotype-phenotype correlations in the Marfan Syndrome. Besides on data mining this thesis focusses on unsupervised and supervised feature selection to enable data mining on high dimensional data. In the area of unsupervised feature selection, SURFING (”Subspaces Relevant for Clustering”) is proposed. This algorithm determines the most interesting subspaces for clustering without requiring sensitive parameter settings. SURFING outperforms competitive methods in finding interesting subspaces on gene expression data. For supervised feature selection, a feature selection framework for identifying biomarker candidates on proteomic spectra is proposed. This three-step method has been evaluated on mass spectrometry data on ovarian and prostate cancer. The framework is a hybrid method between filter and wrapper approaches. As a result, a set of highly selective features discriminating well among healthy and diseased instances has been identified on both data sets. Finally, this work proposes an efficient index structure to support k-nearest neighbor queries on data streams, and thus supporting the use of data mining methods on this type of data. Data streams are of increasing importance in health monitoring. The proposed index structure provides exact answer guarantees to k-nearest neighbor queries on high throughput data streams with a very limited memory usage.
Zusammenfassung
Durch die Fortschritte in den Hochdurchsatztechnologien werden in der Biomedizin immer mehr Daten produziert. In dieser Arbeit werden neue und innovative Data MiningAlgorithmen f¨ ur Clustering, Semi-Supervised Clustering und Klassifikation vorgestellt, sowie experimentelle Ergebnisse auf komplexen biomedizinischen Daten pr¨asentiert. Clusteringalgorithmen verfolgen das Ziel, die Datenmenge in sinnvolle Gruppen einzuteilen und k¨onnen z. B. dazu verwendet werden, um verschiedene Krankheitsstadien zu identifizieren. Im Gegensatz zu herk¨ommlichen Clusteringverfahren findet der hier vorgestellte Algorithmus RIC (”Robust Information-theoretic Clustering) Cluster verschiedener Datenverteilungen und in Unterr¨aumen ohne Eingabeparameter vorauszusetzen. RIC findet z. B. Cluster in nicht gaussverteilten metabolischen Hochdurchsatzdaten. H¨aufig ist f¨ ur biomedizinische Daten Zusatzinformation in Form einer Klassenzugeh¨origkeit zumindest f¨ ur einige Objekte verf¨ ugbar. Diese Zusatzinformation kann f¨ ur eine Clusteranalyse sehr interessant sein, um z. B. Klassen mit multimodaler Datenverteilung zu finden. F¨ ur solche Fragestellungen wird HISSCLU, ein Algorithmus f¨ ur hierarchisches dichtebasiertes Semi-Supervised Clustering, vorgestellt. Dieser Algorithmus ermittelt aus den Daten eine hierarchische Clusterstruktur, die so konsistent wie m¨oglich zur gegebenen Klasseninformation ist. Einen weiteren Schwerpunkt dieser Arbeit stellt die Klassifikation dar.
Das hier
vorgestellte Verfahren LCF (Local Classification Factor) kombiniert die instanzbasierte
8
Klassifikation mit dem Aspekt der lokalen Objektdichte im Datenraum. In der biomedizinischen Anwendung werden Klassifikationsverfahren h¨aufig zur Diagnoseunterst¨ utzung angewendet. wichtig.
Neben einer hohen Spezifit¨at ist hier eine hohe Senistivit¨at besonders
LCF zeigt auf Metabolomdaten eine deutlich verbesserte Klassifikationsge-
nauigkeit im Vergleich zu etablierten Klassifikationsverfahren. F¨ ur den Prozeß der Wissensgewinnung in der Biomedizin ist eine richtige Auswahl von Datenquellen und Algorithmen von besonderer Bedeutung. Am Beispiel des Marfan Syndroms wird ein Data Mining Ansatz zur Entdeckung von Genotyp-Ph¨anotyp Korrelationen von hohem prognostischem und diagnostischem Wert vorgestellt. Außerdem werden in dieser Arbeit neue Ans¨atze zur Attributselektion f¨ ur Clustering und Klassifikation vorgestellt. Der Algorithmus SURFING (”Subspaces Relevant for Clustering”) bestimmt, ohne zus¨atzliche Eingabeparameter, die f¨ ur Clustering relevantesten Unterr¨aume. Auf Genexpressionsdaten findet SURFING z. B. mehr biologisch bedeutsame Unterr¨aume als Vergleichsverfahren. Zur Identifikation von proteomischen Biomarker-Kandidaten auf MS-Spektren wird ein dreistufiges Attributselektionsverfahren vorgestellt. Dieser Hybridansatz unter Verwendung von Filter- und Wrappermethoden identifizierte hochselektive Peaks in den vorklassifizierten Spektren von Patienten mit Eierstock- und Prostatakrebs im Vergleich zu Gesunden. F¨ ur die effiziente Bearbeitung von k-n¨achsten Nachbarn-Anfragen auf Datenstr¨omen ¨ wurde eine Indexstruktur entwickelt. Datenstr¨ome entstehen z. B. bei der Uberwachung von Vitalfunktionen von Patienten und sind z. B. f¨ ur das Telemonitoring von steigender Bedeutung. Im Gegensatz zu approximativen Verfahren erlaubt die vorgestellte Indexstruktur sogar bei einem sehr beschr¨anktem Speicherplatzangebot eine exakte Bearbeitung von k-n¨achsten Nachbarn-Anfragen.
List of Publications
Publications in peer-reviewed journals and conferences: • Plant C, B¨ohm C, Tilg B, Baumgartner C. (2006) Enhancing instance-based classification with local density: A new algorithm for classifying unbalanced biomedical data, Bioinformatics, 22, 981-988. • B¨ohm C, Faloutsos C, Pan J-Y, Plant C. Robust Information-theoretic Clustering. Proc. 12th SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), pp. 65 - 75. • Baumgartner C, Kailing K, Kriegel HP, Kr¨oger P, Plant C. Subspace Selection for Clustering High-Dimensional Data. Proc. 4th IEEE International Conference on Data Mining (ICDM’04), pp. 11 - 18. • B¨ohm C, Ooi BC, Plant C, Yan Y. Efficiently Processing Continuous k-NN Queries on Data Streams. IEEE International Conference on Data Engeneering (ICDE 2007), accepted. • Plant C, Osl, M, Tilg, B, Baumgartner, C. Feature Selection on High Throughput SELDI-TOF Mass-Spectronometry Data for Identifying Biomarker Candidates in Ovarian and Prostate Cancer. IEEE International Conference on Data Mining, Workshop on Data Mining in Bioinformatics (DMB 2006), accepted.
10
• Baumgartner C, Baumgartner D, Eberle M, Plant C, M´aty´as G, Steinmann B. Genotype-phenotype correlation in patients with fibrillin-1 gene mutations. Proc. 3rd Int. Conf. on Biomedical Engineering (BioMED 2005), Innsbruck, Austria, pp. 561-566. • Plant C, B¨ohm C, Tilg B, Baumgartner C. Enhancing instance-based classification on high throughput MS/MS data: Metabolic syndrome as an example. Gemeinsame ¨ Jahrestagung der Deutschen, Osterreichischen und Schweizerischen Gesellschaft f¨ ur Biomedizinische Technik (BMT 2006). • Damjanovic D, Plant C, Balko S, Schek HJ: User-Adaptable Browsing and Relevance Feedback in Image Databases. Proc. DELOS workshop on Future Digital Library Management Systems (System Architecture and Information Access), 2005. Submitted publications: • Plant C, B¨ohm C. HISSCLU. A Hierarchical Density Based Algorithm for Semisupervised Clustering. Submitted to ACM Transactions on Knowledge Discovery from Data (TKDD), 2006. • B¨ohm C, Faloutsos C, Pan J-Y, Plant C. Extensions of Robust Information-theoretic Clustering. Submitted to ACM Transactions on Knowledge Discovery from Data (TKDD), 2006. Submitted patents: • Plant C, Tilg, B, Baumgartner, C. Feature Selection on Proteomic Data for Identifying Biomarker Candidates. European Patent Office 2006.
12
Contents
I
Preliminaries
1 Introduction
19 21
1.1
The Knowledge Discovery Process . . . . . . . . . . . . . . . . . . . . . . . 22
1.2
Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3
Outline of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Biomedical Application Areas
33
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2
Genomics: Gene Expression Analysis . . . . . . . . . . . . . . . . . . . . . 35
2.3
Proteomics: Proteomic Spectra and Protein Localization . . . . . . . . . . 37
2.4
Metabolomics: Metabolite Profiling . . . . . . . . . . . . . . . . . . . . . . 38
2.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Clustering and Classification
45
3.1
The Clustering Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2
Partitioning Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3
3.2.1
K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.2
DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.1
Single Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
14
CONTENTS
3.3.2
II
Optics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4
The Classification Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5
Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6
Validation of the Classification Result . . . . . . . . . . . . . . . . . . . . . 56
3.7
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Techniques for Mining Biomedical Data
4 Robust Information-theoretic Clustering 4.1
59 61
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1.1
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2
Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3
Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4
4.5
4.3.1
Goodness Criterion: VAC . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.2
Robust Fitting (RF) . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.3
Cluster Merging (CM) . . . . . . . . . . . . . . . . . . . . . . . . . 81
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4.1
Results on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . 83
4.4.2
Results on Biomedical Data . . . . . . . . . . . . . . . . . . . . . . 86
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5 Semi-Supervised Clustering
97
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2
Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2.1
Semi-Supervised Clustering . . . . . . . . . . . . . . . . . . . . . . 102
5.2.2
Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.3
Density Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . 104
CONTENTS
5.2.4 5.3
5.4
5.5
15
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3.1
Cluster Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.2
Local Label-Based Distance Weighting . . . . . . . . . . . . . . . . 111
5.3.3
Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4.1
Visualizing Class- and Cluster-Hierarchies . . . . . . . . . . . . . . 117
5.4.2
Spatial- and Class-Coherent Cluster Assignment. . . . . . . . . . . 122
5.4.3
Making Use of Supervision . . . . . . . . . . . . . . . . . . . . . . . 123
5.4.4
Parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6 Instance Based Classification with Local Density
129
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2
Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.3
6.4
6.2.1
Density Based Outlier Detection . . . . . . . . . . . . . . . . . . . . 133
6.2.2
Supervised Clustering
6.2.3
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
. . . . . . . . . . . . . . . . . . . . . . . . . 133
Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.3.1
Classification Method . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.3.2
Parameters and Efficiency . . . . . . . . . . . . . . . . . . . . . . . 142
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.4.1
Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.4.2
Benchmark Classifiers, Validation and Parameter Settings . . . . . 147
6.4.3
Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4.4
Metabolic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4.5
Yeast Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
16
CONTENTS
6.5
6.4.6
E.Coli Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.4.7
Biomedical UCI Data Sets . . . . . . . . . . . . . . . . . . . . . . . 157
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7 Discovering Genotype-Phenotype Correlations in Marfan Syndrome
163
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.2
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.2.1
Molecular Genetic Analysis . . . . . . . . . . . . . . . . . . . . . . 165
7.2.2
Marfan Phenotype According to the Gent Criteria . . . . . . . . . . 166
7.2.3
Patients Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.2.4
Data Mining Methods . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8 Subspace Clustering
179
8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.2
Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.3
8.4
8.2.1
Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.2.2
Feature Selection for Clustering . . . . . . . . . . . . . . . . . . . . 183
8.2.3
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 8.3.1
General Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.3.2
A Quality Criterion for Subspaces . . . . . . . . . . . . . . . . . . . 186
8.3.3
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 8.4.1
Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.4.2
Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
CONTENTS
8.5
17
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9 Feature Selection for Classification
203
9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
9.2
Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.3
9.2.1
Feature Selection for Classification . . . . . . . . . . . . . . . . . . 207
9.2.2
Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.2.3
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 9.3.1
Step 1: Removing Irrelevant Features . . . . . . . . . . . . . . . . . 213
9.3.2
Step 2: Selecting the Best Ranked Features
9.3.3
Step 3: Selecting the Best Region Representatives . . . . . . . . . . 220
. . . . . . . . . . . . . 214
9.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
9.5
Comparison with Existing Methods . . . . . . . . . . . . . . . . . . . . . . 227
9.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10 High-Perfomance Data Mining on Data Streams
233
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 10.1.1 Problem Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 236 10.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 10.2 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 10.2.1 k-NN Queries on Data Streams . . . . . . . . . . . . . . . . . . . . 239 10.2.2 Skylines and Query Monitoring . . . . . . . . . . . . . . . . . . . . 239 10.2.3 Query Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 10.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 10.3.1 Skyline Based Object Maintenance . . . . . . . . . . . . . . . . . . 241 10.3.2 Object Delaying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
18
CONTENTS
10.3.3 Query Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 10.3.4 Result Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 10.3.5 k-Nearest Neighbor Queries . . . . . . . . . . . . . . . . . . . . . . 254 10.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 10.4.1 Data Sets and Methodology . . . . . . . . . . . . . . . . . . . . . . 256 10.4.2 Query Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 10.4.3 Buffer Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 10.4.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
III
Conclusion
263
11 Summary and Outlook
265
11.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 11.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Part I Preliminaries
19
Chapter 1 Introduction
Medicine, biology and life sciences are very data intensive disciplines.
All kinds of
data are produced in tremendous amounts, e.g. text, semi-structured data, images, time-series, data streams and often very high dimensional feature vectors.
Modern
devices for data acquisition allow to record more and more information. Modern high resolution mass spectrometry for example allows to measure and quantify hundreds of metabolites or peptides from one biosample. At the level of raw mass spectra even hundreds of thousands of features (m/z) are measured per sample. Another application scenario is patient monitoring: Modern wearable sensors record various parameters of vital functions and are suitable for long-term patient monitoring.
Efficient hardware and software solutions are required to support data storage, processing and exchange.
However, the main challenge is the so called knowledge
discovery process, i.e. to extract as much as possible useful knowledge from the data. Proteomic spectra data for example has shown the potential of superior results for early stage cancer detection than traditional biomarkers [1, 2]. However, only very few among several thousands of features are relevant for diagnosis. One challenge in the emerging field of proteomics is to identify biomarkers for various diseases, characterizing e.g.
22
Chapter 1: Introduction
different types and stages of cancer. It is still a long way towards to the clinical use of diagnostic tests but one important goal is to identify significant features (biomarkers) from very high dimensional biological data sets.
In the monitoring scenario, it is
essential to efficiently identify unusual patterns in the streaming time series of sensor measurements. These suspicious observations can be shown to an expert for further analysis.
Going beyond the original intension of the data acquisition, biomedical data may contain valuable previously unknown information which even can be out of the scope of the original study. The inspiration behind the research area called data mining is to reveal such information.
High dimensional data, for example in proteomics and
metabolomics, may exhibit various groups of instances, representing unknown sub-stages of a complex disease. In time series of sensor measurements there may be undiscovered correlations between patterns which are characteristic for an abnormal physiological process.
The following section illustrates the knowledge discovery process in more detail. Section 1.2 elaborates on data mining and Section 1.3 gives an outline of this thesis.
1.1
The Knowledge Discovery Process
To meet the requirements of huge data sets, the research areas of knowledge discovery in databases (KDD) and data mining have emerged in the recent years, with multiple books, e.g. [3, 4, 5] and numerous papers, e.g. [6, 7, 8, 9], surveys and theses, e.g. [10, 11, 12] to mention a few. Often the terms data mining and knowledge discovery are used interchangeably, however, in the strict sense data mining is one step in the KDD process, which is defined as follows in [6]:
1.1 The Knowledge Discovery Process
23
Figure 1.1: The KDD Process. Knowledge discovery in databases is the non-trivial process of identifying novel, potentially useful, and ultimately understandable patterns in data.
Figure 1.1 gives an overview on the KDD process which comprises five major steps: 1. Data Selection. As a first step the data needs to be carefully selected. Selection criteria include e.g. data availability, quality, type, format, semantics. The selection of high quality data semantically corresponding to the goal of the discovery process is essential for the following steps. 2. Preprocessing. The target data often requires preprocessing. Suitable strategies on scaling and normalization of the features and strategies to handle missing attribute values have to be selected and applied. 3. Transformation. To reduce the dimensionality of the data, dimensionality reduction techniques, which derive transformed representations of the original features, e.g. PCA [13], ICA [14], can be applied. Alternatively, feature selection techniques can be used. Feature selection techniques reduce the dimensionality by identifying features which are useful for the goal of the discovery process.
24
Chapter 1: Introduction
4. Data Mining. Depending on the goal of the discovery process, a suitable data mining method is selected. The decision on the data mining algorithm is not easy, since for most tasks there is a huge variety of possibilities. The selected algorithm is applied on the preprocessed and transformed data. For most data mining algorithms it is also a non-trivial task to find appropriate parameter settings. 5. Interpretation and Evaluation. The results of the data mining algorithm are analyzed and interpreted. If the results are not satisfactory, there may be the need to go back one or more steps. In fact, the KDD process is an iterative process. In addition to feature selection, (belonging to the transformation step), this thesis focusses on novel data mining methods suitable for biomedical data sets. The next section gives an overview on the various kinds of data mining methods.
1.2
Data Mining
As a crucial step of the KDD process, data mining requires the selection of a suitable data mining algorithm w.r.t. the goal of the discovery process. Following a common characterization [5], the diverse data mining methods can be categorized as follows: • Clustering: Find a partitioning of the objects of the data sets into groups (clusters) while maximizing the similarity of the objects in a common cluster and minimizing the similarity of the objects in different clusters. • Outlier Detection: Find objects in the data set which are exceptional, i. e. which do not correspond to the general characteristics or model of the data. • Classification: Learn a function, model or other method from a subset of the data objects to assign a data object to one of several predefined classes.
1.2 Data Mining
25
• Association Analysis: Find subsets of the attributes or subsets of attribute ranges which occur frequently together in the data set (called frequent item sets). Derive so called association rules from the frequent item sets. The association rules describe common properties of the data. • Evolution Analysis: Discover and describe regularities or trends for objects with properties that change over time. • Characterization and Discrimination: Summarize general properties of the data set or of a set of features (characterization). Compare different subsets of the data to other subsets (discrimination). This thesis mainly covers methods in the areas of clustering and classification. However, for classification, also techniques originally designed for outlier detection are successfully used. So there are no hard boundaries between the categories mentioned above.
Another common characterization is to distinguish between supervised and unsupervised data mining methods. Supervised data mining requires that class labels are given for all data objects. The class label of an object assigns it to one of several predefined classes. Labeling of the data according to classes requires domain knowledge and is often done by human experts.
The most prominent example for supervised
data mining is the task of classification. A classifier is trained on a data set of labeled instances, the so-called training data set. From the training data set, the classifier extracts information to learn a method to predict the class label of a novel unlabeled instance. Often, only a limited amount of labeled data is available compared to plenty of unlabeled objects. Consequently, the research area of semi-supervised data mining has attracted much attention in the recent years, e.g in [15, 16, 17]. Semi-supervised data mining
26
Chapter 1: Introduction
methods use both, the class labels of the labeled objects and the data distribution of the labeled and the unlabeled objects as a source of knowledge. Semi-supervised data mining methods can be further categorized into semi-supervised clustering and semi-supervised classification. For unsupervised data mining, no class information is required. The most familiar example is clustering. Clustering algorithms aim at finding unknown classes of the data set without any a priori knowledge. This thesis provides contributions to supervised, semi-supervised and unsupervised data mining.
An important issue, especially in the context of biomedical applications, is the performance of data mining methods. In the definition of data mining given in [6] the performance aspect is explicitly highlighted:
Data Mining is a step in the KDD process consisting of applying data analysis algorithms, that, under acceptable efficiency limitations, produce a particular enumeration of patterns over the data.
The next section summarizes the major contributions of this work and provides a detailed outline.
1.3
Outline of This Thesis
In this thesis, novel algorithms in the fields of clustering, semi-supervised clustering and classification and their application on biomedical data are proposed. In addition, approaches to supervised and unsupervised feature selection reducing the dimensionality and enabling the application of data mining algorithms are presented. The major contributions of this thesis can be summarized as follows:
1.3 Outline of This Thesis
27
• A parameter free clustering method, which can deal with noisy data sets, subspace clusters and clusters of various data distributions. • A technique for semi-supervised clustering, which gives a concise visualization of the class and cluster structure of complex high dimensional data, evaluated on e.g. metabolic data sets. • In the field of classification, instance based classification is improved with information of the local object density. The resulting classifier shows superior results on metabolome data. • For unsupervised feature selection, a subspace selection algorithm which is free from sensitive input parameters is presented. This method finds meaningful feature subsets on gene expression data. • In the area of supervised feature selection, a feature selection framework accustomed to the special properties of proteomic mass spectrometry data is proposed. • An index structure for efficiently processing k-nearest neighbor queries on data streams tops of this thesis. The k-nearest neighbor query is an important basis for many data mining algorithms. The remainder of this thesis is organized as follows:
The rest of this part contains a biomedical introduction and an introduction to fundamental concepts of data mining.
Chapter 2 gives some background information on the biomedical application areas of the methods presented in the following chapters.
28
Chapter 1: Introduction
Chapter 3 introduces basic notions of clustering and classification and surveys some fundamental algorithms in more detail. This background information is needed in the following Chapters, predominately in chapter 4 and 5, but also in the Chapters 6, 8 and 9.
Part II (Chapters 4 to 10) is the main part of this thesis.
It presents innovative
approaches for mining complex biomedical data sets. Most of these ideas have been published recently ([18], [19], [20], [21], [22], [23], [24]).
Chapter 4 introduces ”Robust Information-theoretic Clustering” (RIC), a parameter free clustering method, especially useful for noisy data sets and data sets exhibiting non-gaussian data distributions. Experiments on metabolic data and on cat retina image data demonstrate the superior performance of RIC on high-dimensional non-gaussian data sets [18].
Chapter 5 is dedicated to semi-supervised clustering and proposes HISSCLU, a method for hierarchical semi-supervised clustering. HISSCLU determines and visualizes complex class and cluster structures and has been evaluated on metabolic data and on various data sets representing localization sites.
Chapter 6 focusses on classifying unbalanced biomedical data sets.
The proposed
method, called ”Local Classification Factor” (LCF) combines instance based classification with the information of local object density. On metabolic data this method shows superior performance and meets the special requirement of a high sensitivity for the identification of diseased subjects [19], [20].
1.3 Outline of This Thesis
29
Chapter 7 describes the application of clustering and classification algorithms to mine genotype-phenotype correlations in patients with the Marfan-Syndrome. Accurate classification of the patients based on phenotype data allows to early identify patients who are at high risk to develop severe symptoms [21].
Chapter 8 addresses the task of unsupervised feature selection for clustering high dimensional data. A novel method to identify relevant subspaces for clustering (SURFING) is proposed and extensively evaluated on gene expression data sets [22].
Chapter 9 proposes a generic framework for supervised feature selection to identify biomarker candidates on proteomic data. The method is evaluated on SELDI-TOF MS data on ovarian and prostate cancer [23].
Chapter 10 describes a technique for efficiently processing k-nearest neighbor queries on data streams. The k nearest neighbor query is an important building block for many data mining algorithms, such as clustering and classification. The proposed technique offers an exact answer guarantee with very limited memory usage, even on high-throughput data streams as they occur in health monitoring applications [24].
Part III concludes the thesis. Chapter 11 summarizes the contributions of this thesis and points out some directions for future research.
Notations which are frequently used throughout this thesis are summarized in Table 1.1.
30
Chapter 1: Introduction
Symbol DS n d O ~x xi x~i dist C C L L(~x)
Definition data set The number of objects in the data set. The dimensionality of the data set. The set of objects in the data set. A data point. The i-th attribute of the data point ~x. The i-th data point in DS. A metric distance function. The clusters of a data set, C={Ci | i = 1, . . . , k}. A cluster of data points The classes of a data set, L={Li | i = 1, . . . , k}. The class label of ~x. Table 1.1: Table of Symbols and Acronyms
Chapter 2 Biomedical Application Areas
This chapter contains some biomedical background information on the data sets used in this thesis. After a brief general introduction in the next section, biomedical application domains belonging to the areas of genomics, metabolomics and proteomics are surveyed. The chapter ends with a summary pointing out fundamental requirements for data mining on the described types of data.
34
Chapter 2: Biomedical Application Areas
Figure 2.1: From DNA to Cell.
2.1
Introduction
In molecular biology, genomics, metabolomics and proteomics, in short often called the ”omics”, have attracted much attention in the recent years. Most of the data sets discussed in this work derive from these research areas. Due to advances in high throughput technologies for data acquisition, e.g. DNA microarrays or mass spectrometry, data sets of very high dimensional feature vectors are produced. This data potentially contains much unexplored information and is thus very challenging for data mining.
Often, established data mining methods can not be applied on ”omics” data because of special characteristics of the data, first of all the high dimensionality. Developing new data mining algorithms accustomed to this kind of data, or modifying existing algorithms such that they meet the special requirements of the corresponding application area is a major challenge.
The emerging research area of systems biology aims at
integrating genomics, metabolomics and proteomics to get a broader understanding of biological systems [25]. To achieve this goal, it is of fundamental importance to advance the knowledge discovery process on the data sources in terms of effectiveness and efficiency.
2.2 Genomics: Gene Expression Analysis
35
Figure 2.2: cDNA Microarray Technique. Figure 2.1
1
gives an overview on the relationship between genomics, metabolomics and
proteomics. The protein production process and the metabolism of the cell are highly complex processes, thus only a short general outline is given here. For a comprehensive survey see e.g. [26]. First the gene coding the requested protein is transcribed into an intermediate product called messenger RNA (mRNA). This process is called gene expression. The mRNA may be then transformed into a protein. This transformation includes several intermediate steps, e.g. post-transcriptional modification and folding. Proteins are often packed together as so called protein machines to accomplish functions in the working cell. One cell contains thousands of proteins and hundreds of characteristic metabolites.
2.2
Genomics: Gene Expression Analysis
Gene Expression analysis deals with measuring the expression level of genes under different conditions. The expression level is a measure for the activity of a gene, i.e. for the frequency the gene is expressed into its mRNA transcript. The expression level of thousands of genes can be measured simultaneously with the microarray technique. The DNA microarray technique measures the gene expression indirectly by detecting the amount of different mRNAs to find out which genes are expressed. A widespread technique is the 1
Figure from http://genomicsgtl.energy.gov/pubs/overview screen.pdf.
36
Chapter 2: Biomedical Application Areas
cDNA (complementary DNA) microarray [27]. Roughly speaking cDNA microarray data is generated as follows, cf. Figure 2.2: Starting with tissue samples to by analyzed (1), RNA is extracted from the samples (2). After cleansing and preprocessing steps, the RNA is transcribed into cDNA. The cDNA of a mRNA is its complementary counterpart. For detection, the cDNA is marked by flourescent colors (3). Different samples are marked by different colors. The cDNA is then hybridized with the microarray, i.e. the marked cDNA bind with their counterparts on the array (4). After washing the flourescent signal is scanned with a laser (5). The raw data is usually normalized and preprocessed considering application specific effects before analysis. The resulting data is often visualized coding the different expression levels in different colors, cf. Figure 2.3. Gene expression data is typically available in the form of a matrix where the rows represent the different genes and the columns the different samples. The entry i, j of the data matrix represents the expression level of gene i in sample j. The different samples may represent different tissues, such as healthy and cancer tissues. Often, the samples also represent different experimental conditions or different time slots. In the knowledge discovery process, it may be interesting to cluster the rows or the columns of the data matrix, depending on the goal of the discovery process. In Chapter 8 the focus is on clustering genes to find co-expressed genes. Such genes having Figure 2.3: similar expression levels are often functionally related. Since Data.
Microarray
the function of many genes is still not known, studying co-expression gives important indices on the function of these genes. The information which genes are co-expressed under which conditions or in which samples gives a deeper insight into the regulation of the gene expression and the interaction of the produced proteins.
2.3 Proteomics: Proteomic Spectra and Protein Localization
2.3
37
Proteomics: Proteomic Spectra and Protein Localization
Currently, the field of protemics is in a rapid growth phase. This is partly due to the fact, that the major efforts on sequencing the human genome and the genome of other important species has been accomplished recently. Based on the available gene sequences, researchers focus on studying the proteome, i.e. the composition of all proteins expressed by the genome of an organism [28]. As most of the activities in a living cell are performed by proteins, it is essential to study the proteome to get insight into the pathways of the cell metabolism and to identify drug targets. However, the proteome of an organism is very complex, recall that a single cell contains thousands of proteins. It has turned out that there is no direct correlation of the gene sequence and the function of the produced protein. Different proteins can be encoded by one gene, whereas the alternatives originate from alternative splicing of the mRNA. In addition, proteins usually get modified by complex gene interactions, cellular events or environmental conditions [29].
Initially, two-dimensional polyacrylamide gel electrophoresis has been the standard method to create so called protein maps displaying the proteome under different experimental conditions or of different samples. This technique is especially suitable to answer general questions, such as the quantitative measurement of the global levels of proteins. However, the recent advances in mass-spectrometry technology increased the precision of measurement and provide high resolution data. Mass spectrometry is a technique to measure the molecular weight of biomolecules based on their time of flight (TOF) in an electric or magnetic field. A mass spectrometer consists of three major parts: An ion source, which converts the sample molecules into ions, a mass analyzer which seperates the ions regarding their mass to charge ratio (m/z) and a detector which generates a signal output finally displayed as mass spectrum. Mass spectrometers can be classified into different types w.r.t. the technique used for ionisation. Among the most wide spread
38
Chapter 2: Biomedical Application Areas
are MALDI (matrix assisted laser desorption ionisation), SELDI (surface enhanced laser desorption) and ESI (electro spray ionisation). The proteomic spectrum from a tissue or serum sample consists of intensity measurements of the contained peptides and proteins at distinct m/z values. With modern high resolution techniques, intensities can be measured at up to some hundred thousands of m/z values. An important application area is pattern discovery from proteomic spectra to detect biomarker candidates for different diseases or disease stages. In Chapter 9 we focus on supervised feature selection on SELDI-TOF MS data to find biomarker candidates for ovarian and prostate cancer.
Another application domain in this thesis within the fields of proteomics is to determine the subcellular localization of proteins. Due to the complexity of the proteome, the functions and interaction partners of proteins within the cell are in large part unknown.
One step towards a better understanding of the functions proteins is to
determine their subcellular location.
In Chapter 5 and 6 we use data sets on the
prediction of protein localization sites in yeast and ecoli [30, 31], which are two well studied simple model organisms.
2.4
Metabolomics: Metabolite Profiling
Metabolomics deals with the investigation of normal and abnormal metabolic processes and the characteristics of involved biomolecules (metabolites).
The metabolome
comprises all metabolites of an organism. The metabolism of cells, and even more the metabolism of organisms is a very complex process, which is not predefined by the genetic information. Organisms interact with their environment. Environmental factors, such as nutrition, medicamentation or exercise have an impact on the metabolism. Substances added to a cell or an organism, such as food, are triggering complex biochemical reactions, the metabolic pathways. These pathways are regulated by enzymes. The absence or over
2.5 Conclusion
39
expression of the corresponding enzymes can cause severe metabolic disorders.
Due to the advances in high-throughput data acquisition technologies, metabolomics has a high potential in early stage disease detection and disease prevention. Metabolite profiling aims at measuring the metabolites contained in the cell or in tissues [32]. The dominant method of data acquisition is tandem mass spectronometry (MS/MS). This technique is a multi-step MS process which allows to detect quantified amounts of hundreds of metabolites from a single sample using internal standards. Most of the metabolic data sets used in this thesis derive from a project on metabolite screening of newborns [33]. Metabolic disorders caused by genetic defects are rare but often live threatening. However, many of these disorders respond well to medicamentation or nutritional directions if diagnosed early. Figure 2.4 gives an overview on metabolic disorders included in the newborn screening program. In Chapter 6 the focus is on classifying the metabolic profiles derived of this study with special focus on a high sensitivity in identifying diseased individuals. Clustering of metabolic data is also an interesting task, e.g. to identify unknown disorders or subtypes of disorders. The Chapters 4 and 5 and 8 deal with clustering metabolic data.
2.5
Conclusion
In this chapter, biomedical background information on the data sets used in this thesis is given. Most of these biomedical data sets derive from the ”omics” areas and consist of high dimensional feature vectors. As the performance of many conventional data mining algorithms tends to decline in terms of efficiency and effectiveness on high dimensional data, there is a need to develop novel solutions or to extend existing methods respecting the special characteristics of the concrete biomedical application. Besides the high dimensionality, most biomedical data sets, such as proteomic spectra or metabolic data, contain
40
Chapter 2: Biomedical Application Areas
noise objects. In addition, complex data distributions, which need not to be gaussian, can complicate clustering. Multimodal class structures, i.e. classes distributed over several clusters of various object densities can be observed in metabolic data. In general, it is often difficult to find appropriate parameter settings for clustering and classification. Most of the data sets used here also exhibit a huge set of attributes in relation to the number of instances. For proteomic spectra, the used data sets have about 15,000 attributes and only 200 instances. This disproportion makes it hard to train classifiers. Many biomedical data sets are also imbalanced in the number of instances per class. This is most evident on the used metabolic data sets: As inherited metabolic disorders are very rare, there are only very few instances representing the different metabolic diseases compared to a large amount of data available representing the healthy control group. The general requirements emerging of the described application areas can be summarized as follows: • High dimensionality, • noise, • complex data distributions, • lack of instances and • imbalanced classes. In this thesis, some data mining algorithms are proposed with the goal to cope with these challenges. Of course, there is no general solution and no algorithm that performs well on all types of data. However, the proposed techniques contribute to facilitate the data mining process on biomedical data, e.g. a noise robust and parameter free clustering method or a feature selection technique accustomed to the special characteristics of proteomic spectra are proposed.
2.5 Conclusion
41
A further important type of biomedical data used in the following and not mentioned above is phenotypic data. These data sets show the consequences of diseases, e.g. metabolic disorders at a higher level. Instead of concentrations of molecules, these data sets typically consists of information on the manifestation of symptoms. The phenotypic data sets are described in detail in the corresponding chapters.
The next chapter gives a brief introduction to clustering and classification. fundamental algorithms are covered in more detail.
Some
42
Chapter 2: Biomedical Application Areas
Disorder
Enzyme defect/ affected pathway Phenylalanine hydroxylase or impaired synthesis of biopterin cofactor Glutaryl CoA dehydrogenase
Diagnostic metabolites PHE ↑ TYR ↓
Symptoms if untreated
C5DC ↑
3-Methylcrotonylglycinemia deficiency (3-MCCD)
3-methyl-crotonyl CoA carboxylase
C5OH ↑
Methlymalonic acidemia (MMA)
Methlymalonyl CoA mutase or synthesis of cobalamin (B12) cofactor
C3 ↑ C4DC↑
Propionic acidemia (PA)
Propionyl CoA carboxylase α or β subunit or biotin cofactor Medium chain acyl CoA dehydrogenase
C3 ↑
Macrocephaly at birth, neurological problems, episodes of acidosis/ ketosis, vomiting Metabolic acidosis and hypoglycemia, some asymptomatic Life threatening/fatal ketoacidosis, hyperammonemia, later symptoms: failure to thrive, mental retardation Feeding difficulties, lethargy, vomiting and life threatening acidosis
Long chain acyl CoA dehydrogenase or mitochondrial trifunctional protein
C16OH ↑ C18OH ↑ C18:1OH ↑
Phenylketonuria (PKU)
Glutaric acidemia, Type I (GA-I)
Medium chain acyl CoA dehydrogenase deficiency (MCADD)
3-OH long chain acyl CoA dehydrogenase deficiency (LCHADD)
C6 ↑ C8 ↑ C10 ↑ C10:1 ↑
Microcephaly, mental retardation, autistic-like behavior, seizures
Fasting intolerance, hypoglycemia, hyperammonemia, acute encephalopathy, cardiomyopathy Hypoglycemia, lethargy, vomiting , coma, seizures, hepatic disease, cardiomyopathy
Figure 2.4: Metabolic Disorders. Arrows indicate abnormally enhanced and diminished metabolite concentrations. Bold metabolites denote the established primary diagnostic markers. American College of Medical Genetics/American Society of Human Genetics Test and Technology Transfer Committee Working Group, 2000, [34].
Chapter 3 Clustering and Classification
This chapter starts with a general introduction to clustering. In addition, a brief survey on some examples of common partitioning (cf. Section 3.2) and hierarchical clustering algorithms (cf. Section 3.3), is provided. The second part of this chapter introduces the problem of classifying objects, cf. Section 3.4. Section 3.5 covers some selected classification algorithms in more detail and Section 3.6 deals with the validation of the classification results and with quality measures for comparing classifiers. As there is a huge variety of clustering and classification algorithms, only methods which are fundamental for the techniques described in this thesis can be surveyed here. See e.g. [4] or [3] (in german) for a comprensive introduction to data mining. The chapter ends with a short conclusion in Section 3.7.
46
3.1
Chapter 3: Clustering and Classification
The Clustering Task
Clustering is the task to group the objects of a data set into clusters while maximizing the intra cluster similarity and minimizing the inter cluster similarity, cf.
Section
1.2. To determine the similarity between objects, an appropriate similarity measure is needed. For complex data objects, such as images, protein structures or text data, it is a non-trivial task to define domain specific similarity measures. In Chapter 7 a similarity measure for different phenotypes in patients with the Marfan Syndrome is introduced.
When ever not otherwise specified, throughout this thesis, object similarity means feature based similarity. For each data object of a data set DS of dimensionality d, d numerical features are given. The object is thus a point in the d-dimensional data space. The features or attributes of ~x are donated by F = {x1 , ...xd }. To detect the similarity among two data objects ~x and ~y a metric distance function dist is used. More specifically, let dist be one of the Lp norms, which is defined as follows for an arbitrary p ∈ N. v u d uX p dist(~x, ~y ) = t (xi − yi )2 i=1
Usually the Euclidean distance is used, i.e. p = 2 when ever not differently specified.
Besides selecting suitable similarity measure, the clustering task requests to select an appropriate clustering algorithm for the data set. In general, clustering algorithms can be classified into partitioning and hierarchical methods. The following discussion is illustrated by the examples depicted in Figure 3.1 showing 2-d synthetic data sets with different properties for clustering:
3.2 Partitioning Clustering
47
(b)
(a)
(c)
(d)
Figure 3.1: Example Data Sets. a) A simple data set with three gaussian clusters of equal size and approximately equal object density. b) A data set consisting of three gaussian clusters of different sizes and densities. One of the two smaller and denser clusters is nested in the cluster of less object density. c) A data set consisting of three different clusters of different sizes, densities and model distribution functions: One spherically gaussian cluster, one cluster of laplacian data distribution and a small correlation cluster. The coordinates of this cluster are highly correlated. d) The same data set as (c) but with uniformly distributed noise points.
3.2
Partitioning Clustering
Partitioning clustering methods assign each data object to exactly one cluster. This section briefly surveys three prominent representatives of partitioning clustering methods and gives a short list of some related approaches. For a comprehensive survey on basic clustering algorithms see e.g. [35].
48
Chapter 3: Clustering and Classification
3.2.1
K-Means
The most common partitioning clustering algorithm is K-Means [36]. The user needs to specify as an input parameter the number K of clusters to be determined. The algorithm starts with an arbitrary initial partitioning of the data set into K clusters. Usually, an arbitrary partitioning is used, which is of course most of the times not optimal. The algorithm iteratively improves the initial clustering by minimizing the sum of squared distances of the data objects to the mean vector of respective cluster. Formally, this sum, often called dispersion (D) is a measure for the compactness of the clustering.
D=
k X
D(Ci ), where D(Ci ) = dist(~x, ~µ)2
i=1
Considering a single cluster Ci , the smaller D(Ci ) the closer are the objects populated around the mean vector µ of the cluster Ci . To obtain a partitioning into K clusters which is as compact as possible, the goal is to minimize D. K-Means accomplishes this goal by iterating the following two steps until convergence: a) Assign all points ~x to the closest cluster center ~µ. b) Update the cluster centers ~µ. The algorithm terminates when no changes in the cluster assignment of the objects occur from one iteration to the next. The main advantage of K-Means is its efficiency. It can be proven that convergence is reached in 0(n) iterations. However, this method has three major drawbacks: First, the number of clusters K has to be specified in advance, secondly, the cluster compactness measure D and thus the clustering result is very sensitive to noise and outliers. In addition, K-Means implicitly assumes a gaussian data distribution, and is thus restricted to detect spherically compact clusters. Assumed to be correctly parameterized with K=3, K-Means has no problem to detect the clusters
3.2 Partitioning Clustering
49
in the example data set depicted in Figure 3.1 (a). For this simple data set, K-Means is the best method because of its efficiency. However, K-Means is not suitable for the rest of the example data sets: K-Means implicitly favors clusters of balanced object sizes and will thus probably miss the smaller cluster which is embedded in the large cluster in example (b). K-Means can also not deal with correlation clusters and clusters with non gaussian data distribution and is very sensitive to noise and thus not a good choice to cluster the examples (c) and (d).
Clustering algorithms minimizing a clustering objective function, such as the dispersion D in K-Means, are often called optimization based methods. K-Medoid methods [37, 7] are optimization based methods which are suitable for general metric data. The EM algorithm [38] extends K-Means by using a distribution based clustering objective function. The overall clustering is modeled as a mixture of gaussian probability distribution functions.
3.2.2
DBSCAN
The DBSCAN algorithm [8] finds clusters of arbitrary shape and number without requiring the user to specify the number of clusters K. DBSCAN relies on a density based clustering notion: Clusters are connected dense regions in the feature space that are separated by regions of lower object density.
This idea is formalized using two parameters, M inP ts specifying a minimum number of objects, and , the radius of a hyper-sphere. These two parameters determine a density threshold for clustering. An object is called a core object of an (, M inP ts)-cluster if there are at least M inP ts objects in the -neighborhood. If one object P is in the -neighborhood of a core-object Q, then P is said to be directly density-reachable from
50
Chapter 3: Clustering and Classification
noise obj.
MinPts=3
ε core obj.
density-connected objects
Figure 3.2: Definitions of DBSCAN. Q. The density-connectivity is the symmetric, transitive closure of the direct density reachability, and a density based (, M inP ts)-cluster is defined as a maximal set of density connected objects. This principle is visualized in Figure 3.2.
In contrast to many other partitioning clustering methods such as k-Means and k-Medoid methods, DBSCAN has a determinate result and robust against noise objects. However, the clustering result strongly depends on an appropriate choice of the density parameters. For some data sets, it is even impossible to find a suitable parametrization, such as for the example (b). DBSCAN can not detect the three clusters in this case. Depending on the parametrization, the algorithm can either detect the two denser clusters (and assigning the objects of the sparser cluster to noise) or detect two clusters, and no noise points. In principle, DBSCAN yields quite good results on the examples (c) and (d), because the density based clustering paradigm does not assume any particular data distribution or intrinsic dimensionality of the clusters and the object density of the clusters is rather uniform here. But no information on the cluster content is provided.
3.3
Hierarchical Clustering
A hierarchical clustering of a data set provides more information on the data structure than a flat partitioning clustering. Biomedical data often exhibit hierarchical structures. Considering e.g. metabolic disorders, there are often different stages of a disorder which
Grundlagen 3.3 Hierarchical Clustering Beispiel eines Dendrogramms
51
2 8 9 7
5
1
1
1
2 4 6 3 5
distance
0 1
5
1
2
3
4
5
6
7
8
9
Figure 3.3: Example data set (left) and dendrogram (right).
Typen von hierarchischen Verfahren
afflict the metabolism to a different degree. The metabolic disorder is represented as a • Bottom-Up Konstruktion des Dendrogramms (agglomerative) cluster containing different sub-clusters. Hierarchical clustering algorithms compute and • Top-Down Konstruktion des Dendrogramms (divisive) visualize the hierarchical cluster structure rather than assigning the objects to distinct Vorlesung KDD, Ludwig-Maximilians-Universität München, WS 2000/2001 clusters. Different flat clusterings corresponding to different levels of the hierarchy can be derived of this representation. 3.3.1
Single Link
Single link [35] is a very simple hierarchical clustering method requiring no input parameters. The algorithm starts with n singleton clusters, each containing one data object. The algorithm then merges in each step the pair of clusters having the minimal distance until all objects are in one common cluster. The hierarchical cluster structure is visualized in a tree like structure the so called dendrogram. The leaves of the dendrogram represent the single data objects, the root the whole data set and the intermediate levels represent the different steps of the cluster merging procedure. Figure 3.31 depicts the dendrogram of an example data set.
1
Figure form http://www.dbs.informatik.uni-muenchen.de/buecher/.
101
52
Chapter 3: Clustering and Classification
To decide which clusters to merge next, a distance function between two groups of objects is needed. The variant single link uses the minimal distance between the two groups of objects. The variant complete link utilizes the maximum maximum occurring distance, and the variant average link the average among the distances between both groups of objects. Requiring O(n) distance calculations in each step, single link and its variants have an overall runtime complexity of O(n)2 . A flat clustering can be obtained by horizontally cutting the dendrogram at an arbitrary level.
A major drawback of single link and its variants is the so called single link effect: Whenever two actual clusters are connected by a small chain of equi-distant objects, the two clusters cannot be separated. Thus the clustering result is sensitive to noise objects and outliers. Although the single link effect is not explicitly depicted in the example data sets, for a noisy data set like example (d) the dendrogram will lack of clarity. 3.3.2
Optics
OPTICS [39] is a hierarchical extension of DBSCAN but also related to single link. The main idea of OPTICS is to compute a complex hierarchical cluster structure, i.e. all possible clusterings with the parameter varying from 0 to a given max simultaneously during a single traversal of the data set. The output of OPTICS is a linear order of the data objects according their hierarchical cluster structure which is visualized in the so-called reachability-plot. Figure 3.4 depicts the reachability plot of example (b). The clusters can be recognized as valleys in the reachability plot. It is clearly visible that the cluster denoted by c2 contains a dense sub cluster. Single link is a special case of OPTICS (where M inP ts = 1 and max = ∞). In this case, OPTICS has the same drawbacks like single link, i.e. missing stability with respect to noise objects and the single link effect. OPTICS overcomes this drawback if M inP ts is set to higher values.
3.4 The Classification Task
53
c1
c2
sub cluster
Figure 3.4: OPTICS Reachability Plot of Example (b).
Compared to DBSCAN, OPTICS turns the definitions of core objects and density connectivity around: Instead of specifying a distance parameter and defining whether or an object is a core object or two objects are density connected, OPTICS defines the core distance of an object ~x as the minimal distance ~x from which DBSCAN would consider ~x as a core object. The reachability distance is analogously defined as the minimal distance ~x,~y between ~x and ~y starting from which DBSCAN would consider these objects as directly density connected (also considering that DBSCAN requires ~x to be a core object for the density reachability).
3.4
The Classification Task
For a given set of classes L the aim of classification is to learn a function or rule F : O → L that maps as much objects ~x of a set of objects O to their correct class L(~x) ∈ L as possible. To train the classifier, there are some objects for which the class labels are known. This training data set is given in the form of a set of tuples T = {(~x, L(~x)|~x ∈ DS, L(~x) ∈ L}. Many classification methods require a metric distance function dist as specified in Section 3.1. Some common classification algorithms are described in the next section.
54
3.5
Chapter 3: Clustering and Classification
Classifiers
Logistic regression analysis (LRA). LRA constructs a linear separating hyperplane between two classes which have to be distinguished by the classifier. A logistic function p=
1 1+e−z
is used to determine the distance from the hyperplane as a probability measure
of class membership, z is the logit of the model. LRA uses a maximum-likelihood method to maximize the probability of getting the observed results given the fitted coefficients [40]. It is also possible to build multinomial logistic regression models [41].
Support vector machines (SVM). Standard SVM (or binary SVM) classify objects into two classes by calculating the maximum margin hyperplane between the training objects of both given classes. Structural risk minimization is associated with such a scheme to show a good trade-off between low empirical risk and small capacity ([42, 43]). The use of soft margins by specifying a cost factor c and kernel functions (e.g. polynomial or Gaussian radial basis kernels) in which the non-linear mapping is implicitely embedded enables them to classify any kind of data. To apply SVMs for discriminating more than two classes, several approaches have been introduced [44].
k-nearest neighbor classifier (K-NN). The K-NN classifier constructs decision boundaries by simply storing of the complete training data. An object ~x is classified by choosing the majority class among the k closest objects of the training data set. Standard or weighted metric distance functions are used to determine the distance of the k nearest neighbors from ~x. In general, k is determined empirically. Larger values of k often smooth over local characteristics, whereas smaller values of k lead to limited neighborhoods [45].
3.5 Classifiers
55
Decision tree (DT). DT are usually rooted, binary trees with simple classifiers at each internal node which split the feature space recursively based on tests that compare one feature variable against a threshold value. The class label is represented at each leaf. The most often used DT algorithms are the ID3 or its successors C4.5 and C5.0, respectively. Thereby the information gain criterion is applied for choosing the best test by expecting a reduction of entropy, a measure of the impurity in the training data. Reduced error pruning may help to reduce overfitting ([46, 47]).
Na¨ıve Bayes (NB). The NB classifier is an approximation to an ideal Bayesian classifier which would classify an object based on the probability of each class given the object’s feature variables. NBs main assumption is that the different features are independent of each other. Although the assumption that the predictor variables are independent is not always accurate, it does simplify the classification task dramatically, since it allows the class conditional densities to be calculated separately for each variable ([48, 49]).
Artificial neutral networks (ANN). An ANN is a processing algorithm that is inspired by the biological nervous system consisting of several layers of neurons. In general an input layer takes the input and distributes to the hidden layers which do all the necessary computations and outputs the result to the output layer. The standard algorithm used for classification is a multilayered ANN trained using backpropagation and the delta rule. It is recommended to design not too many or too few hidden layers which can lead to over- or underestimation of training data ([45, 50]).
56
Chapter 3: Clustering and Classification
3.6
Validation of the Classification Result
To compare the performance of different classification methods usually three different measures are used [51]: • Accuracy: the proportion of correctly classified objects divided by the total number of objects of all classes n. • Precision of a class Li : the number of correctly classified objects of class Li divided by the total number of objects labeled as Li . • Recall of a class Li : the number of correctly classified objects of class Li divided by |Li |, the total number of objects of class Li . In a biomedical context, classification often aims at separating diseased instances from a heathy control group. For two class problems in the biomedical context the notions of sensitivity and specificity are often used to characterize the performance of a classifier. • Sensitivity: the number of correctly identified diseased individuals divided by the total number of diseased individuals, also called the true positive rate. • Specificity: the number of correctly classified healthy instances divided by the total number of individuals labeled as healthy, also called the true negative rate. The sensitivity of a classifier, also called the true positive rate corresponds to the recall of the class representing the diseased individuals. Analogously, the specificity or true negative rate corresponds to precision of the healthy class.
To estimate the performance of a classifier on new, unlabeled instances K-fold cross-validation is often used, especially in the case of few training data. The training data set is separated into K subsets of equal size, called folds. The classifier is trained
3.7 Conclusion
57
on one K-1 subsets and the remaining subset is used for testing. Quality measures, such as accuracy or sensitivity are computed using every fold once as test set. The average quality measure on the K folds is returned.
3.7
Conclusion
In this chapter, a brief introduction to clustering and classification was given. Some wide spread algorithms which are fundamental for the techniques described in the following chapters have been shortly characterized. Advantages and drawbacks of these algorithms have been illustrated. Besides the biomedical background provided in the previous chapter, this methodical background is essential for the following chapters introducing novel techniques for mining biomedical data starting with clustering. In the next chapter presents an approach to partitioning clustering founded on the notions of information theory. Applied after an arbitrary initial clustering, e.g. K-Means even on very difficult data sets, such as the example (d), good results are obtained.
Part II Techniques for Mining Biomedical Data
59
Chapter 4 Robust Information-theoretic Clustering
Besides classification, clustering is the most important technique for data mining on biomedical data. Clustering algorithms can be applied e.g. on metabolic data to discover groups of patients suffering from unknown metabolic disorders. Clustering can also be used as a preprocessing step for classification. Complex data sets contain often classes which are distributed among several clusters. However, the choice of an appropriate clustering algorithm and suitable of parameter settings is often difficult. This chapter proposes a robust framework to find a natural clustering of a data set based on the minimum description length (MDL) principle [52]. The proposed framework, Robust Information-theoretic Clustering (RIC), is orthogonal to any known clustering algorithm: given a preliminary clustering, RIC purifies these clusters from noise, and adjusts the clusterings such that it simultaneously determines the most natural amount and shape (subspace) of the clusters. RIC can be combined with any clustering technique, e.g. KMeans, DBSCAN. After an introduction in Section 4.1 a brief summary of related work is given in Section 4.2. In Section 4.3 the RIC framework is explained in detail. Section 4.4 an extensive experimental evaluation on synthetic an biomedical data sets is given. In particular, a high dimensional metabolic data set and a phenotypic data set on retina detachment is used. Finally, Section 4.5 gives a summary and concludes this chapter.
62
4.1
Chapter 4: Robust Information-theoretic Clustering
Introduction
The problem of clustering has attracted a huge volume of attention for several decades, with multiple books [53], surveys [10] and papers (X-means [54], G-means [55], CLARANS [7], CURE [56], CLIQUE [57], BIRCH [58], to name a few). Most of these algorithms are extensions of the basic methods described in Chapter 3. Recent interest in clustering has been on finding clusters that have non-Gaussian correlations in subspaces of the attributes, e. g. [59, 60, 61]. These techniques are very promising for the use on high dimensional biomedical data. Given a biomedical data set, it is a non-trivial question to choose an appropriate clustering algorithm out of this large variety of possibilities. In addition, for most of the algorithms mentioned above the clustering result strongly depends on suitable parameter settings. Many methods, such as K-Means, G-Means implicitly assume a Gaussian data distribution and are sensitive to noise. Biomedical data sets often exhibit non-gaussian data distributions and tend to be contamined by noise points. Thus, clustering biomedical data sets to obtain a biological meaningful partitioning is often hard.
For example, in Figure 4.1, we show a fictitious set of points in 2-d containing noise points and a subspace cluster. Figure 4.1(a) shows a grouping of points that most humans would agree is ’good’: a Gaussian-like cluster at the left, a line-like cluster at the right, and a few noise points (’outliers’) scattered throughout. However, typical clustering algorithms, like K-means may produce a clustering like the one in Figure 4.1(b): a bad number of clusters (five, in this example), with Gaussian-like shapes, fooled by a few outliers. There are two questions we try to answer in this work:
Q1: goodness How can we quantify the ’goodness’ of a grouping? We would like a function that will give a good score to the grouping of Figure 4.1(a) and a bad score to the one of Figure 4.1(b).
4.1 Introduction
63
(a) ‘good’
(b) ‘bad’
Figure 4.1: A fictitious dataset, (a) with a good clustering of one Gaussian cluster, one sub-space cluster, and noise; and (b) a bad clustering. Q2: efficiency How can we write an algorithm that will produce good groupings, efficiently and without getting distracted by outliers. The overview and contributions of this chapter are exactly the answers to the above two questions: For the first, we propose to envision the problem of clustering as a compression problem and use information-theoretic arguments. The grouping of Figure 4.1(a) is ’good’, because it can succinctly describe the given dataset, with few exceptions: The points of the left cluster can be described by their (short) distances from the cluster center; the points on the right line-like cluster can be described by just one coordinate (the location on the line), instead of two; the remaining outliers each need two coordinates, with near-random (and thus un-compressible) values. Our proposal is to measure the goodness of a grouping as the Volume after Compression (VAC): that is, record the bytes to describe the number of clusters k; the bytes to record their type (Gaussian, line-like, or something else, from a fixed vocabulary of distributions); the bytes to describe the parameters of each distribution (e.g., mean, variance, covariance, slope, intercept) and then the location of each point, compressed according to the distribution it belongs to.
Notice that the VAC criterion does not specify how to find a good grouping; it can only say which of two groupings is better. This brings us to the next contribution:
64
Chapter 4: Robust Information-theoretic Clustering
We propose to start from a sub-optimal grouping (e.g., using K-means, with some arbitrary k). Then, we propose to use two novel algorithms: • Robust fitting (RF), instead of the fragile PCA, to find low-dimensionality sub-space clusters and • Cluster merging (CM), to stitch promising clusters together. We continue fitting and merging, until our VAC criterion reaches a plateau. The sketch of our algorithm above has a gradient descent flavor. Notice that we can use any and all of the known optimization methods, like simulated annealing, genetic algorithms, and everything else that we want: our goal is to optimize our VAC criterion, within the useracceptable time frame. We propose the gradient-descent version, because we believe it strikes a good balance between speed of computation and cluster quality.
4.1.1
Contributions
The proposed method, RIC, answers both questions that we stated earlier: For cluster quality, it uses the information-theoretic VAC criterion; for searching, it uses the two new algorithms (Robust Fit, and Cluster Merge). The resulting method has the following advantages: a) It is fully automatic, i.e. no difficult or sensitive parameters must be selected by the user. b) It returns a natural partitioning of the data set, thanks to the intuitive information theoretic principle of maximizing the data compression. c) It can detect clusters beyond Gaussians: clusters in full-dimensional data space as well as clusters in axis-parallel subspaces (so called subspace-clusters) and in
4.1 Introduction
65
Symbol V AC RF CM RIC Ccore Cout ~µ µ~R Σ (Σi ) ΣC ΣR V (or Vi ) V AC(C) saveCost(Ci , Cj )
Definition Volume After Compression. Robust Fit. Cluster Merge. Robust Information-theoretic Clustering. The set of core points in C. The set of noise points (outliers) in C. A cluster center of cluster S. A robust cluster center of cluster S. The covariance matrix of points in cluster C (or Ci ). The conventional version of Σ (from averaging). The robust version of Σ (from taking medians). The candidate direction matrix derived from Σ (or Σi ). The VAC value of points in cluster C. The improvement on the VAC if Ci and Cj are merged.
Table 4.1: Symbols and Acronyms Used in Chapter 4. arbitrarily oriented subspaces (correlation clusters), and combinations and mixtures of clusters of all different types during one single run of the algorithm. d) It can assign model distribution functions such as uniform, Gaussian, Laplacian (etc.) distribution to the different subspace coordinates and gives thus a detailed description of the cluster content. e) It is robust against noise. Our Robust Fitting (RF) method is specifically designed to spot and ignore noise points. f) It is space and time efficient, and thus scalable to large data sets. To the best of our knowledge, no other clustering method meets all of the above properties. Table 4.1 gives a list of symbols used in this chapter in addition to the symbols given in Table 1.1.
66
4.2
Chapter 4: Robust Information-theoretic Clustering
Survey
As mentioned, clustering has attracted a huge volume of interest over the past several decades. Recently, there are several papers focusing on scalable clustering algorithms, e. g. CLARANS [7], CURE [56], CLIQUE [57], BIRCH [58], DBSCAN [8] and OPTICS [39]. There are also algorithms that try to use no user-defined parameters, like X-means [54] and G-means [55].
However, they all suffer from one or more of the following
drawbacks: they focus on spherical or Gaussian clusters, and/or they are sensitive to outliers, and/or they need user-defined thresholds and parameters.
Gaussian clusters: Most algorithms are geared towards Gaussian, or plain spherical clusters: For example, the well known K-means algorithm, BIRCH [58] (which is suitable for spherical clusters), X-means [54] and G-means [55]. These algorithms tend to be sensitive to outliers, because they try to optimize the log-likelihood of a Gaussian, which is equivalent to the Euclidean (or Mahalanobis) distance - either way, an outlier has high impact on the clustering.
Non-Gaussian clusters:
Density based clustering methods, such as DBSCAN
and OPTICS can detect clusters of arbitrary shape and data distribution and are robust against noise. For DBSCAN the user has to select a density threshold, and also for OPTICS to derive clusters from the reachability-plot. K-harmonic means [62] avoids the problem of outliers, but still needs k. Spectral clustering algorithms [63] perform K-means or similar algorithms after decomposing the n × n gram matrix of the data (typically using PCA). Clusters of arbitrary shape in the original space correspond to Gaussian clusters in the transformed space. Here also k needs to be selected by the user. Recent interest in clustering has been on finding clusters that have non-Gaussian correlations in subspaces of the attributes [59, 60, 61]. Finding correlation clusters has
4.2 Survey
67
diverse applications in bio-informatics.
Parameter-free methods: A disproportionately small number of papers has focused on the subtle, but important problem of choosing k, the number clusters to shoot for. Such methods include the above mentioned X-means [54] and G-means [55], which try to balance the (Gaussian) likelihood error with the model complexity. Both X-means and G-means are extensions of the K-means algorithm, which can only find Gaussian clusters and cannot handle correlation clusters and outliers. Instead, they will force correlation clusters into un-natural, Gaussian-like clusters.
In our opinion, the most intuitive criterion is based on information theory and compression.
There is a family of closely related ideas, such as the Information
Bottleneck Method [64], which is used by Slonim and Tishby for clustering terms and documents [65]. Based on information theory they derive a suitable distance function for co-clustering, but the number of clusters still needs to be specified in advance by the user.
There are numerous information theoretic criterions for model selection, such as the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and Minimum Description Language (MDL) [52]. Among them, MDL is the inspiration behind our VAC criterion, because MDL also envisions the size of total, lossless compression as a measure of goodness. The idea behind AIC, BIC and MDL is to penalize model complexity, in addition to deviations from the cluster centers. However, MDL is a general framework, and it does not specify which distributions to shoot for (Gaussian, uniform, or Laplacian), nor how to search for a good fit. In fact, all four methods (BIC, G-means, X-means and RIC) are near-identical for the specific setting of noise-free mixture of
68
Chapter 4: Robust Information-theoretic Clustering
Gaussians. The difference is that our RIC can also handle noise, as well as additional data distributions (uniform, etc.).
PCA: Principal Component Analysis (PCA) is a powerful method for dimensionality reduction, and is optimal under the Euclidean norm. PCA assumes a Gaussian data distribution and identifies the best hyper-plane to project the data onto, so that the Euclidean projection error is minimized. That is, PCA finds global correlation structures of a data set [13]. Recent work have extended PCA to identify local correlation structures that are linear [59] or nonlinear [60], however, some method-specific parameters such as neighborhood size or the dimensionality of microclusters, are still required. It is desirable to have a method that is efficient and robust to outliers, minimizing the need of pre-specified parameters.
4.3
Proposed Method
The quality of a clustering usually depends on noise in the data set, wrong algorithm parameters (e.g., number of clusters), or limitations on the method used (e.g., unable to detect correlation clusters), and resulting in a un-natural partition of the data set. Given an intial clustering of a data set, how do we systematically adjust the clustering, overcome the influence of noise, recognize correlation patterns for cluster formation, and to eventually obtain a natural clustering?
In this section, we introduce our proposed framework, RIC, for refining a clustering and discovering a most natural clustering of a data set. In particular, we propose a novel criterion, VAC, for determining the goodness of a cluster, and propose methods for:
4.3 Proposed Method
69
• (M1) robust estimation of the correlation structure of a cluster in the presense of noise, • (M2) identification and separation of noise using VAC, and • (M3) construction of natural correlation clusters by a merging procedure guided by VAC. The proposed algorithms and the criterion VAC are described in details in the following subsections. 4.3.1
Goodness Criterion: VAC
The idea is to invent a compression scheme, and to declare as winner the method that minimizes the compression cost, including everything: the encoding for the number of clusters k, the encoding for the shape of each cluster (e.g., mean and covariance, if it is a Gaussian cluster), the encodings for the cluster-id and the (relative) coordinates of the data points.
We assume that all coordinates are integers, since we have finite precision, anyway. That is, we assume that our data points are on a d-dimensional grid. The resolution of the grid can be chosen arbitrarily.
The description of the method consists of the following parts: (a) how to encode integers (b) how to encode the points, once we determine that they belong in a given cluster.
The idea is best illustrated with an example.
Suppose we have the dataset of
Figure 4.1. Suppose that the available distributions in our RIC framework are two:
70
Chapter 4: Robust Information-theoretic Clustering
Gaussian, and uniform (within a Minimum Bounding Rectangle). Once we decide to assign a point to a cluster, we can store it more economically, by storing its offset from the center of the cluster, and using Huffman-like coding, since we know the distribution of points around the center of the cluster.
Self-delimiting encoding of integers.
The idea is that small integers will re-
quire fewer bytes: we use the Elias codes, or self-delimiting codes [66], where integer i is represented using O(log i) bits. As Table 4.2 shows, we can encode the length of the integer in unary (using log i zeros), and then the actual value, using log i more bits. Notice that the first bit of the value part is always ’1’, which helps us decode a string of integers, without ambiguity. The system can be easily extended to handle negative numbers, as well as zero itself. number 1 2 3 8
coding length value 0 1 00 10 00 11 0000 1000
Table 4.2: Self-delimiting integer coding Encoding of points. Associated with each cluster C is the following information: Rotatedness R (either false or a orthonormal rotation matrix to decorrelate the cluster), and for each attribute (regardless if rotated or not) the type T (Gaussian, Laplacian, uniform) and parameters of the data distribution. Once we decide that point P belongs to cluster C, we can encode the point coordinates succinctly, exploiting the fact that it belongs to the known distribution. If p is the value of the probability density function for attribute Pi then we need O(log 1/p) bits to encode it. For a white Gaussian distribution, this is proportional to the Euclidean distance; for an arbitrary Gaussian distribution,
4.3 Proposed Method
71
19%Æ2.3 bit
0.4
0.2
3
4
5
6
7
0.4
2
0.2
1
0
0
pdfLapl(3.5,1)(x)
1
5%Æ4.3 bit 2 3 5 6
pdfGauss(3.5,1)(y)
4 7
Figure 4.2: Example of VAC.
this is proportional to the Mahalanobis distance. For a uniform distribution in, say, the Minimum Bounding Rectangle (MBR) (lbi , ubi , with 0 ≤ i < d and lb for lower bound, ub for upper bound, respectively), the encoding will be proportional to the area of the MBR.
The objective of this section is to develop a coding scheme for the points ~x of a cluster C which represents the points in a maximally compact way if the points belong to the cluster subspace and to the characteristic distribution functions of the cluster. Later, we will inversely define that probability density function which gives the highest compression rate to be the right choice. For this section, we assume that all attributes of the points of the cluster have been decorrelated by PCA, and that a distribution function along with the corresponding parameters has already been selected for each attribute. For the example in Figure 4.2 we have a Laplacian distribution for the x-coordinate and a Gaussian distribution for the y-coordinate. Both distributions are assumed with
72
Chapter 4: Robust Information-theoretic Clustering
µ = 3.5 and σ = 1. We need to assign code patterns to the coordinate values such that coordinate values with a high probability (such as 3 < x < 4) are assigned short patterns, and coordinate values with a low probability (such as y = 12 to give a more extreme example) are assigned longer patterns. Provided that a coordinate is really distributed according to the assumed distribution function, Huffman codes optimize the overall compression of the data set. Huffman codes associate to each coordinate xi a bit string of length l = log2 (1/P (xi )) where P (xi ) is the probability of the (discretized) coordinate value. Let us fix this in the following definition:
−→ Definition 1 (VAC of a point ~x) Let ~x ∈ Rd be a point of a cluster C and pdf (~x) be a d-dimensional vector of probability density functions which are associated to C. Each pdfi (xi ) is selected from a set of predefined probability density functions with the corresponding parameters, i.e. P DF = {pdfGauss(µi ,σi ) , pdfunif orm(lbi ,ubi ) , pdfLapl(ai ,bi ) , ...}, µi , lbi , ubi , ai ∈ R, σi , bi ∈ R+ . Let γ be the grid constant (distance between grid cells). The VACi (volume after compression) of coordinate i of point ~x corresponds to V ACi (x) = log2
1 pdfi (xi ) · γ
The VAC (volume after compression) of point ~x corresponds to V AC(x) = (log2
X n )+ V ACi (x) |C| 0≤i
P
~ x∈C
V AC(x)
The information which of the two cases is true, is coded by 1 bit. The matrix V is coded using d × d floating values using f bits. The identity matrix needs no coding (0 bits):
V AC(dec(C)) =
P P 2 1 if ~ y ∈Y V AC(y) + d f > ~ x∈C V AC(x) d2 · f + 1 otherwise
The following definition puts these things together. Definition 4 (Cluster Model) The cluster model of a cluster C is composed from the −→ decorrelation dec(C) and the characteristic pdf (~y ) where ~y = dec(C) · ~x for every point ~x ∈ C. The Volume After Compression of the cluster V AC(C) corresponds to V AC(C) = V AC(dec(C)) +
X
V AC(dec(C) · ~x)
~ x∈C
4.3.2
Robust Fitting (RF)
We consider the combination of the cluster’s subspace and the characteristic probability distribution as the cluster model. A data point in a (tentative) cluster could be either a core point or an outlier, where core points are defined as points in the cluster’s subspace which follow the characteristic probability distribution of the cluster model, while the
76
Chapter 4: Robust Information-theoretic Clustering
outliers are points that do not follow the distribution specified by the cluster model. We will also call the outliers noise (points).
Having outliers is one reason that prevents conventional clustering methods from finding the right cluster model (using e.g.
PCA). If the cluster model is known,
filtering outliers is relatively easy – just remove the points which fit the worst according to the cluster model.
Likewise, determining the model when clusters are already
purified from outliers is equally simple. What makes the problem difficult and interesting is that we have to filter outliers without knowledge of the cluster model and vice versa.
Partitioning clustering algorithms such as those based on K-means or K-medoids typically produce clusters that are mixed with noise and core points. The quality of these clusters is hurt by the existence of noise, which lead to a biased estimation of the cluster model.
We propose an algorithm for purifying a cluster that, after the processing, noise points are separated from their original cluster and form a cluster of their own. We start with a short overview of our purification method before going into the details. The procedure starts with getting as input a set of clusters C={C1 , . . . , Ck } by an arbitrary clustering method. Each cluster Ci is purified one by one: First, the algorithm estimates an orthonormal matrix called decorrelation matrix (V ) to define the subspace of cluster Ci . A decorrelation matrix defines a similarity measure (an ellipsoid) which can be used to determine the boundary that separates the core points and outliers. Our procedure will pick the boundary which corresponds to the lowest overall VAC value of all points in cluster Ci . The noise points are then removed from the cluster and stored in a new cluster. Next, we elaborate on the steps for purifying a cluster of points.
4.3 Proposed Method
77
Robust Estimation of the Decorrelation Matrix The decorrelation matrix of a cluster Ci contains the vectors that define (span) the space in which points in cluster Ci reside. By diagonalizing the covariance matrix Σ of these points using PCA (Σ = V ΛV T ), we obtain an orthonormal Eigenvector matrix V , which we defined as the decorrelation matrix. The matrices V and Λ have the following properties: the decorrelation matrix V spans the space of points in C, and all Eigenvalues in the diagonal matrix Λ are positive. To measure the distance between two points ~x and ~y , taking into account the structure of the cluster, we use the Mahalanobis distance defined by Λ and V : dΣC (~x, ~y ) = (~x − ~y )T · V · Λ−1 · V T · (~x − ~y ).
Given a cluster of points C with center ~µ, the conventional way to estimate the covariance matrix Σ is by computing a matrix ΣC from points ~x ∈ C by the following averaging: ΣC = 1/|C|
X
(~x − ~µ) · (~x − ~µ)T ,
~ x∈C
where (~x − ~µ) · (~x − ~µ)T is the outer vector product of the centered data. In other words, the (i, j)-entry of the matrix ΣC , (ΣC )i,j , is the covariance between the i-th and j-th attributes, which is the product of the attribute values (xi − µi ) · (xj − µj ), averaged over all data points ~x ∈ C. ΣC is a d×d matrix where d is the dimension of the data space.
The two main problems of this computation when confronted with clusters containing outliers are that (1) the centering step is very sensitive to outliers, i.e. outliers may heavily move the determined center away from the center of the core points, and (2) the covariances are heavily affected from wrongly centered data and from the
78
Chapter 4: Robust Information-theoretic Clustering
1900
VAC
1800 1700 1600 1500 1400
Conventional estimation Robust estimation
0
5 10 15 20 25
number of core points
Figure 4.3: Conventional and robust estimation. outliers as well.
Even a small number of outliers may thus completely change the
complete decorrelation matrix. This effect can be seen in Figure 4.3 where the center has been wrongly estimated using the conventional estimation. In addition, the ellipsoid which shows the estimated “data spread” corresponding to the covariance matrix has a completely wrong direction which is not followed by the core points of the clusters.
To improve the robustness of the estimation, we apply an averaging technique which is much more outlier robust than the arithmetic means: The coordinate-wise median. To center the data, we determine the median of each attribute independently. The result is a data set where the origin is close to the center of the core points of the cluster (µ~R ), rather than the center of all points (~µ).
A similar approach is applied for the covariance matrix: Here, each entry of the robust covariance matrix (ΣR )i,j is formed by the median of (xi − µRi ) · (xj − µRj ) over all points ~x of the cluster. The matrix ΣR reflects more faithfully the covariances of the core points, compared to the covariance matrix obtained by the arithmetic means.
4.3 Proposed Method
79
The arithmetic-mean covariance matrix ΣC has the diagonal dominance property, where the each diagonal element Σi,i is greater than the sum of the other elements of the row Σ∗,i . The direct consequence is that all Eigenvalues in the corresponding diagonal matrix Λ are positive, which is essential for the definition of dΣ (~x, ~y ).
However, the robust covariance matrix ΣR might not have the diagonal dominance property. If ΣR is not diagonally dominant, we can safely add a matrix φ · I to it without affecting the decorrelation matrix. The value φ can be chosen as the maximum difference of all column sums and the corresponding diagonal element (plus some small value, say 10%): φ = 1.1 · max {( 0≤i 1 and savedCost > 0 Find the best pair of clusters to merge: [(C∗1 , C∗2 ), mergedVAC(C∗1 ,C∗2 )]= findMergePair(clusters C); Merge C∗1 and C∗2 as Cnew = {C ∗1 ∪C∗2 }: C = C − {C∗1 , C∗2 } ∪ {Cnew }. Set VAC(Cnew ) := mergedVAC(C∗1 ,C∗2 ); end while while |C| > 1 and counter < t //Improved search: Getting out of local minimum Find the best pair of clusters to merge: [(C∗1 , C∗2 ), mergedVAC(C∗1 ,C∗2 )]= findMergePair(clusters C); counter + +; Merge C∗1 and C∗2 as Cnew = {C ∗1 ∪C∗2 } Set VAC(Cnew ) := mergedVAC(C∗1 ,C∗2 ); end while return the clustering C with the minimum overall VAC value, found during the t iterations; subroutine [(C∗1 , C∗2 ), mergedCost(C∗1 ,C∗2 )]= findMergePair(clusters C) // Find cluster pair with the best (maximal savedCost). for all cluster pairs (Ci , Cj ) ∈ C × C mergedVAC(Ci ,Cj ) := VAC(Ci ∪ Cj ); savedCost(Ci ,Cj ):=(VAC(Ci )+VAC(Cj ))-mergedVAC(Ci ,Cj ); find the cluster pair to merge: (C∗1 , C∗2 )=argmax(Ci ,Cj ) savedCost(Ci , Cj ); return The cluster pair (C∗1 , C∗2 ), and their mergedCost(C∗1 ,C∗2 );
Figure 4.11: RIC algorithm.
Chapter 5 Semi-Supervised Clustering
The previous chapter was dedicated to unsupervised clustering, i.e. to derive a grouping of the data set into distinct classes without any prior knowledge. In this chapter, a semisupervised clustering method is presented which considers class labels given for some of the objects in addition to the data distribution when assigning the objects to clusters. Again, the focus was on avoiding sensitive input parameters, but in contrast to the previous chapter, a hierarchical clustering notion is used. This enables the proposed algorithm HISSCLU (Hierarchical Density Based Semi-supervised Clustering) to provide a concise visualization of the class- and cluster structure of a data set. After an introduction to the semi-supervised clustering task in the next section, a survey of related work is given in Section 5.2. HISSCLU consists three major parts which are elaborated in the following section: An algorithm for assigning the unlabeled objects to clusters in a way which is most consistent with the class labels of the labeled objects and the data distribution of all objects (cf. Section 5.3.1) and the careful use of a local distance weighting function in cases when no natural cluster boundaries exist (cf. Section 5.3.2). In Section 5.3.3 the visualization of the result is explained. Section 5.4 provides an extensive experimental evaluation using metabolic data and data on protein localization sites. The chapter ends with some concluding remarks in Section 5.5.
98
5.1
Chapter 5: Semi-Supervised Clustering
Introduction
In many biomedical applications, huge amounts of unlabeled data are available, e.g. metabolite concentrations from blood samples of thousands of patients.
Labeling
unlabeled data according to classes, e.g. healthy - disease A - disease B, is a complex task often requiring domain knowledge and has to be done by human experts. Therefore class labels are often only given for a part of the objects. Consequently, semi-supervised learning, which considers both, labeled and unlabeled data has attracted much attention in the recent years [16, 17, 68, 69]. In this chapter we focus on semi-supervised clustering, i.e. the use of given class labels (maybe only very few) to improve unsupervised clustering. Most approaches, e.g. [69, 68] and also [16, 70] to a certain extent, achieve this goal by enforcing two types of constraints during the clustering process: Cannot-links are applied to prevent objects with different labels from being grouped together. Must-links between identically labeled objects force them into a common cluster.
In this chapter, we propose HISSCLU a hierarchical density based approach to semi-supervised clustering which avoids the use of explicit constraints due to the following reasons: To gain insight into the modality of the data set and to be informed about previously unknown sub-classes, it is not helpful to apply must-links forcing objects into the same cluster which are actually dissimilar. For example consider Figure 5.1, depicting a 2-d data set with four clusters of different object densities. In addition for seven objects a class label is known. These labeled objects from two different classes are marked by the symbols X and . Both classes are multimodal, i.e. they consist of different sub-classes distributed over different clusters. In addition we have one labeled object of class X which is an outlier. Must-link constraints force this object into a common cluster with the other labeled objects of its class. The information that this object is completely different to all other objects in the data set is lost.
5.1 Introduction
99
Figure 5.1: Example Data Set. Cannot-links remove any information about how similar the objects of different classes actually are. In our example one cluster is shared by three labeled objects of both classes which are very similar. Taking more than two classes into account, the important information that objects of some classes A and B are more similar than objects of classes C and D (indicating a hierarchical relationship of the classes A and B) would be lost by the enforcement of cannot-links. Instead of deriving constraints from the labeled objects, HISSCLU expands the clusters starting at all labeled objects simultaneously. As an additional add-on over comparative methods, HISSCLU assigns class labels to the unlabeled objects during the cluster expansion. The labeling is maximally consistent to both, the cluster structure of all objects and the given class labels of the labeled objects. The result of HISSCLU is visualized in the semi-supervised cluster diagram giving a concise illustration of the hierarchical class and cluster structure. The cluster diagram provides answers to the following questions even for high-dimensional and moderate to large scale data sets: • Q1: Are there clusters shared by more than on class? • Q2: Are there multi-modal classes distributed over more than one cluster? • Q3: Which are the most similar/dissimilar classes?
100
Chapter 5: Semi-Supervised Clustering
• Q4: Is there any class hierarchy and how well does it correspond to the cluster hierarchy? To determine the hierarchical class and cluster structure we found our approach on a hierarchical density based clustering notion ([8], [39]). More specifically we give the following problem specification for semi-supervised clustering.
Problem Specification.
The objective of our method is to determine a hierar-
chical clustering of the labeled and unlabeled objects with maximally large class pure sub-clusters of high density. We can identify the following two goals: • G1: High density: Clusters are regions of high density separated by regions of lower density. • G2: Class-purity: As many clusters as possible are uniformly labeled. For a hierarchical clustering approach, G1 means that sub-clusters have a higher density than the super-clusters in which they are nested (cf. [39]). G2 means that maximally large sub-clusters are uniformly labeled. In this chapter, we propose an efficient algorithm that exploits the concepts of density based clustering to address G1 and carefully applies a local distance weighting technique to increase the class-purity of clusters (G2).
Solution Overview.
We can exploit the cluster hierarchy for cluster expansion
and labeling of the unlabeled objects: Starting from each object ~x, we can go upwards in the cluster hierarchy until we have reached an inner node n which contains at least one labeled object in its subtree. In many cases, the contained labeled objects are class-pure, and we can safely assign ~x to this sub-cluster and assign the corresponding class label to ~x. Whenever the subtree rooted by n is class-impure, we additionally consider spatial coherency for clustering and labeling thus assigning areas of neighboring objects to the
5.1 Introduction
101
Symbol HISSCLU P U P U N (~x) N Nk (~x) nn-distk (~x)
Definition Hierarchical Density based Semi-supervised Clustering. Set of pre-labeled objects. Set of unlabeled objects. A pre-labeled object. An unlabeled object. Epsilon-neighborhood of point ~x. Set of k nearest neighbors of point ~x. Distance to the k-th nearest neighbor of ~x.
Table 5.1: Symbols and Acronyms Used in Chapter 5.
same cluster. The border between two areas of differently labeled objects should be positioned in the area of least data density. Since in areas of a relatively uniform data density the border is rather random, we propose the careful use of a local weighting function overriding the random differences of inter-point distances but not overriding actual cluster-boundaries.
Table 5.1 summarizes some notations which are frequently used in this chapter (see also Table 1.1 for general notations). In addition to d feature attributes, for some of the objects a categorial class label Li ∈ L is given. We call them the pre-labeled objects P ∈ P. During the run of the algorithm, also the previously unlabeled objects U ∈ U obtain a class label. Wherever no distinction between pre-labeled and unlabeled objects is required, we denote the objects by ~x. We will also use the notions of -neighborhood N (~x) = {~y ∈ DB | dist(~x, ~y ) ≤ } and the set of k-nearest neighbors of an object ~x, denoted by N Nk (~x). We use the notion nn-distk (~x) for the distance of the k-nearest neighbor of an object ~x ∈ DS.
102
Chapter 5: Semi-Supervised Clustering
a
b c
e h d f g i j l k
∞
∞
a b c f d e h g i j k l Figure 5.2: CCL algorithm - example.
5.2
Survey
In this section we give a brief survey on related work on semi-supervised clustering and label propagation. Since we found our approach on a hierarchical density based cluster notion, some basic concepts which are closely related to OPTICS [39] are defined (see also Chapter 3). 5.2.1
Semi-Supervised Clustering
Several constraint-based approaches in the field of semi-supervised clustering have appeared.
Most of them extend existing clustering methods, such as complete link
to incorporate constraints, e.g. [69] and [71] for numerical constraints. In the CCL algorithm [69] complete link clustering [35] is applied after replacing the Euclidean distance by a shortest path algorithm. The distance matrix is modified by setting the distance of all pairs of labeled objects to zero and and the distance between all pairs of labeled objects of different classes to a value larger than the maximal distance appearing in the data set. Due to this rigid and global transformation of the data space, objects of the same class are forced to be in the same cluster. This principle is illustrated in Figure 5.2.
5.2 Survey
103
COP-K-Means [68] is a K-Means [36] based algorithm enforcing constraints. Must-link constraints are established between all pairs of identically labeled objects and cannot-links between all pairs of differently labeled objects. Objects are assigned to clusters without violating any of the constraints. Recently, constraints have been used in a softer way to improve the clustering result, e.g. by using probabilistic models [72] or fuzzy clustering [73]. In [70] the authors propose MPC-K-Means, a K-Means based algorithm that considers both, constraints and the data distribution to assign objects to clusters. A cost function for violating must-link and cannot-link constraints is defined. The clustering objective function minimizes both, the link violation cost and the deviation of the objects from the cluster centers. In addition a metric learning step is performed after each iteration to globally adapt a weighted Euclidean distance. In [16] the authors generalize this technique by proposing a probabilistic framework for semi-supervised clustering to additionally support several non-Euclidean distance measures e.g. cosine similarity for text data.
5.2.2
Label Propagation
The label propagation algorithm [74] is related to our approach since HISSCLU assigns class labels to the unlabeled objects during the clustering process. Label probpagation first constructs an n × n similarity matrix T (n being the number of all – labeled and unlabeled – objects) and a second matrix Y (n × |L|) for fuzzy class assignment with the information in Yi,j indicating to what degree the object x~i belongs to class Lj . Then labels are propagated by iteratively multiplying Y := T ·Y until matrix Y converges. This algorithm requires a storage complexity of O(n2 ) and is, hence, not scalable to database environments. Moreover, no clustering method and no visualization technique is proposed, which is the main focus of our approach.
104
5.2.3
Chapter 5: Semi-Supervised Clustering
Density Based Clustering
Density based and hierarchical clustering algorithms, such as DBSCAN [8], Single-Link [35] and OPTICS [39] find clusters of arbitrary shape and number. Clusters are connected dense regions in the feature space that are separated by regions of lower density. The hierarchical density based clustering notion is the basis for HISSCLU with the following two main advantages: In contrast to partitioning methods, no sensitive input parameters are required and a visualization technique illustrating even complex class- and cluster hierarchies is provided. Therefore, the fundamental notions of Core Distance and Reachability Distance are now formally defined. For a survey on hierarchical density based clustering see also Chapter 3, Section 3.3.2. Let us note that the original definitions of OPTICS use an additional parameter which is left out here for simplicity reasons. This parameter specifies the maximum reachability distance for cluster expansion and should be set to a large value. Definition 5 (Core Distance) The core distance of object ~x ∈ DS w.r.t. M inP ts ∈ N is defined as CoreM inP ts (~x) = nn-distMinPts (~x). The core distance of an object ~x measures the density around ~x. It is defined as the M inP ts-nearest neighbor distance of ~x. Definition 6 (Reachability Distance) The reachability distance of an object ~y ∈ DS relative from object ~x ∈ DS w.r.t. M inP ts ∈ N is defined as ReachM inP ts (~x, ~y ) = max{CoreM inP ts, (~x), dist(~x, ~y )}
5.2 Survey
105
The OPTICS algorithm operates on a seed list SL which is initialized with an arbitrary, unprocessed object whenever empty. The unprocessed objects are stored in the seed list, ordered by a criterion which is described later. In the main loop of the algorithm, the algorithm selects the top (minimum) element ~t of SL and appends it to the output. The ordering criterion for each object ~x in the seed list is the minimum of all reachability distances from any of the objects in the output to ~x: ~x.order = ~ y∈
min {ReachM inP ts (~x, ~y )} output
Consequently, the seed list is updated after each iteration of the loop: Some new objects which are density-reachable from ~t may be inserted, and ~x.order is updated for all those objects for which ReachM inP ts (~t, ~x) is less than the previously stored ~x.order.
5.2.4
Contributions
Our new method HISSCLU has the following main advantages over previous methods: a) The result is determinate, robust against noise, and the method does not favor clusters of convex shape. b) In contrast to constraint-based methods the original cluster structure is preserved. Therefore, the result gives valuable information about previously unknown class and cluster hierarchies. c) Our method assigns class labels to the unlabeled objects in a way which is maximally consistent with the cluster structure and observes spatial coherency of class labeling.
106
Chapter 5: Semi-Supervised Clustering
d) We propose a visualization method which allows a clear and concise illustration of the semi-supervised cluster structure even for moderate to high numbers of objects.
5.3
Proposed Method
This section elaborated the three major ideas of HISSCLU in detail. Section 5.3.1 explains the cluster expansion process. Section 5.3.2 introduces a local distance weighting function to smooth the borders between differently pre-labeled objects and Section 5.3.3 describes the visualization of the result.
5.3.1
Cluster Expansion
In this section we elaborate how our algorithm expands the clusters starting at each of the pre-labeled objects simultaneously and how the class labels are naturally assigned to the unlabeled objects during this process. Two objects share a common density based cluster if they are density connected, i.e. if there exists a path of core objects between them and there is no higher distance between path-neighbors than . For an unlabeled object U we consider the paths each starting from one of the pre-labeled objects P ∈ P and ending in U . The object U belongs to the cluster and adopts the class label of that pre-labeled object P for which a path with minimum exists. We define in the following the notions of a path and the path reachability distance which corresponds to the minimum in the DBSCAN algorithm. First of all, we define a path to be an arbitrary sequence of objects starting with a pre-labeled object and, apart from that, containing only unlabeled objects, because we are only interested in such sequences here. Definition 7 (Path) Let S = hO1 , O2 , ..., On i be a sequence of objects, where O1 is a labeled object and O2 , ..., On are distinct, unlabeled objects. Then we call S a path from O1 to On .
5.3 Proposed Method
107
We now have to define an ordering predicate denoted by
< P RD
to compare two or more
paths, deciding which of them is shorter or better. The main criterion for the comparison is the maximum distance between subsequent objects on the path. If two paths share the same maximum distance we consider the second largest distance on the path, and so on. For simplicity reasons we assume for the following definitions that the distance between every pair of objects is different, i.e. dist(~x, ~y ) = dist(~y , ~x) ⇒ ~x = ~y . In practice tie situations can be solved in a nondeterministic way. Let also ∈ R+ 0 , M inP ts ∈ N. Definition 8 (Restricted path reachability distance) Let S = hO1 , ..., On i be a path. Then I() ⊆ {1, ..., n − 1} is the index set such that i ∈ I() :⇔ ReachM inP ts (Oi , Oi+1 ) < . The restricted path reachability distance of a path S w.r.t. and M inP ts, denoted by rP RD,M inP ts (S) is the maximum of those reachability distances of two adjacent objects on S, less than . rP RD,M inP ts (S) = 0 if I() = ∅ max {Reach i∈I()
M inP ts (Oi , Oi+1 )}
otherwise.
The above definition allows us to determine the actual maximum distance on the path by setting = ∞ or also to determine the maximum distance below some specified threshold 0 . We now inductively define the ordering of paths. Definition 9 (Ordering different paths) Let S1 and S2 be two paths.
We say that S1 is of less path reachability distance
< (S1 rP RD(,M S ) if one of the following conditions applies inP ts) 2
1. rP RD,M inP ts (S1 ) < rP RD,M inP ts (S2 ) 2. rP RD,M inP ts (S1 ) = rP RD,M inP ts (S2 ) =: 0 ∧ S1 rP RD(00
1 4
dAP (A, U ) · dAP (P, U ) · (dAP (A, U ) + dAP (P, U ))2
.
Figure 5.6 visualizes the overall weight function w(U ) = wA,B (U ) for a 2-d synthetic example with 7 pre-labeled objects of two different classes (see also Figure 5.12). The
114
Chapter 5: Semi-Supervised Clustering
Figure 5.6: Weighting - example. weight function w(U ) is applied for weighting the distance between two different objects Ui and Uj . To obtain a stable and consistent result we have to use the maximum weight when substituting every point X on the line segment [Ui , Uj ] in w(X). Due to monotonicity we can derive the following: 1. If Ui and Uj have different nearest neighbors in P with different class labels, then the line segment crosses one of the perpendicular bisector planes, and therefore, the maximum weight corresponds to ρ. 2. Otherwise, the maximum weight function must be at either of the two end points Ui or Uj because inside the same Voronoi cell, the weight function is monotonic. We define the following weighting function for a pair of objects:
w(Ui , Uj ) =
ρ ifL(N N (Ui )) 6= L(N N (Uj )), max {w(Ui ), w(Uj )} otherwise.
5.3.3
Visualization
For visualization we merge the |L| different cluster order lists into a common one. Some of them have to be partially reordered because the start object changes from the pre-labeled
5.3 Proposed Method
115
Reachability area c1
c2
c3
c4
c5
Evaluation area Pre-labeled objects
k- clusters
Figure 5.7: Cluster Diagram.
object into that object which is closest to the previous cluster in the overall order. This does not affect the overall runtime complexity of O(n2 ). In the semi-supervised cluster diagram each object is represented by a histogram bin with a length corresponding to the reachability distance of the object in the final (merged) cluster order. Therefore, clusters of the data set can be recognized as valleys in the cluster diagram. Hierarchically nested clusters correspond to sub-valleys in a common valley delimited by higher peaks than those between the sub-valleys. The class labeling is coded in different colors. Therefore, the consistence between cluster and class structure can be easily recognized in the cluster diagram. To mark the pre-labeled objects in the diagram, we stretch the corresponding bins slightly below the x-axis of the diagram. To facilitate the evaluation of our technique, we extract clusters using one certain density threshold = k · maxRd (where maxRd denotes the maximum reachability distance) which we call k-clusters. The k-clusters are depicted as horizontal lines underneath the diagram. We also draw the histogram bars in the color coding the true class label of the objects below the cluster diagram. See above the annotated cluster diagram of our 2-d example with 5 extracted clusters. The cluster diagram shows interesting properties of the data set: Both classes are multimodal, one class is distributed over the clusters c1 and c5, the other class over c2, c3, c4. The clusters c1 and c2 are the most similar
116
Chapter 5: Semi-Supervised Clustering
clusters among all extracted clusters, i.e. separated by the smallest reachability distance and contain differently pre-labeled objects. For comparison see also Figure 5.1 depicting the data set.
5.4
Experiments
For evaluation and for comparison with partitioning methods we extract clusters from the cluster diagram using a certain density threshold. To automatically extract clusters from the OPTICS reachability-plot, the algorithms ξ-cluster [39] and cluster-tree [75] have been proposed. These algorithms can also be applied to a cluster diagram but they do not extract a flat cluster structure. Therefore, we just horizontally cut the reachability-plot and the cluster diagram. Let dmax be the maximum reachability distance. We require that clusters are separated by a distance of at least dsep = k · dmax , with k ∈ [0..1]. In addition, we require that each cluster has at least M inP ts objects. Definition 12 (k-clustering) Let DS be a set of objects and dmax be their maximum reachability distance k, n, M inP ts ∈ N and n ≥ M inP ts. A sequence S =< S1 , ...Sn > in the reachability-plot or the cluster diagram of DB is called a k-cluster if 1. Reach(S1 ) ≥ k · dmax . 2. Reach(S2 ), ...Reach(Sn ) < k · dmax . Let C = {C1 , ..., Cn } be the set of all k-clusters in DS. All objects o ∈ DS with o 6∈ Ci are called noise objects. Let N be the set of noise objects. We call C ∪ N a k-clustering. For comparison with partitioning methods we compute the Mutual Information of the k-clustering [76]. Definition 13 (Mutual Information of the k-clustering) Let L = {L1 , ...Li } be the set of classes, C = {C1 , ...Cj } be the set of k-clusters and Oc
5.4 Experiments
117
be the set of cluster objects, i. e Oc = {~o ∈ DS|~o ∈ C}. Let (h(Li , Cj )) be the number of objects of class Li assigned to cluster Cj , h(Cj ) the total number of objects assigned to cluster Cj and h(Li ) the total number of objects belonging to class Li .
MI = −
|L| X h(Li ) i=1
|L|
|C|
h(Li ) X X h(Li , Cj ) h(Li , Cj ) · log2 + · log2 . |Oc | |Oc | |Oc | h(Cj ) i=1 j=1
The Mutual Information M I ∈ R+ reflects to which degree a k-clustering corresponds to the class label distribution. We scale this quality measure in the range of [0..1].
In the following, we present results on synthetic data, various data sets obtained from the UCI machine learning repository [77] (Ecoli, Yeast, Glass, Liver) and on a high dimensional metabolic data set. See Chapter 2 for more information on the Ecoli and Yeast data set on predicting protein localization sites (cf. Section 2.3) and on metabolic data (cf. Section 2.4). Besides biomedical data, the glass data set has been used because it is a well-known benchmarking data set for classification problems and it contains a distinct class hierarchy. In Section 5.4.1 we demonstrate that the cluster diagram provides valuable information on the hierarchical class and cluster structure. We compare the cluster diagram with the OPTICS reachability plot here, since no semi-supervised clustering method offers a visualization. In Section 5.4.2 and 5.4.3 we compare the performance of HISSCLU with COP-K-Means and MPC-K-Means in terms of clustering quality and in Section 5.4.4 we give hints on parameter selection. 5.4.1
Visualizing Class- and Cluster-Hierarchies
Ecoli Data In Figure 5.8 the OPTICS reachability-plot and two cluster diagrams generated by HISSCLU of the Ecoli data set are depicted (M inP ts = 5, ρ = 2, ξ = 0.5). We extracted
118
Chapter 5: Semi-Supervised Clustering
a
d
c c1
c2 a
c
b
c1
(a) OPTICS Reachability-plot.
(b) 20 pre-labeled objects (random).
c2
(c) 8 pre-labeled objects (one per class).
Figure 5.8: Visualizing semi-supervised class- and cluster-hierarchies: Results on Ecoli data set. clusters for k = 0.2. This 7-dimensional data set on predicting protein localization sites with 336 data objects and 8 classes is highly unbalanced having from 2 to 142 objects per class [77]. The unsupervised reachability-plot in Figure 5.8(a) shows 2 k-clusters each containing objects of mainly two classes and a large amount of noise. In Figure 5.8(b) the cluster diagram using 20 randomly sampled objects as pre-labeled objects is shown. We extracted 5 clusters corresponding to the 5 largest classes. It can be seen that some of the classes are multimodal, e.g. the class of periplasm proteins forming cluster (a). The classes of inner membrane proteins without signal sequence (cluster (c1)) and inner membrane proteins with an uncleavable signal sequence (cluster (c2)) are the most similar classes in this data set. The maximum separating reachability distance between these classes is the smallest one among all clusters. In fact these classes share the common cluster (c) at a higher level of the cluster hierarchy. This corresponds well to the biological ground truth since the presence or absence of an uncleavable signal sequence is somewhat arbitrary [31]. Figure 5.8(c) shows the cluster diagram using one object per class as pre-labeled object (7 k-clusters for k = 0.15). The class of outer membrane lipoprotein consisting of 5 instances (cluster (d)) shows the maximal difference to all other classes. The two instances of inner membrane proteins with cleavable signal sequence are quite similar to periplasm proteins, forming together cluster (a). Already with this minimal amount of
5.4 Experiments
119
a
b
c a1
(a) OPTICS Reachability-plot.
(b) 100 pre-labeled (random).
objects
a2
(c) 500 pre-labeled (random).
b
cd
objects
Figure 5.9: Visualizing semi-supervised class- and cluster-hierarchies: Results on Yeast data set. supervision HISSCLU archieves to determine the correct class hierarchy, cf. clusters (c1), (c2) and (c). Yeast Data The Yeast data set is a 9 dimensional data set consisting of 1448 instances belonging to 10 classes. Similar to Ecoli, this data set on predicting protein localization sites is highly unbalanced (5 to 423 objects per class) but much more challenging for classification [31]. The OPTICS reachability plot of this data set shows no cluster structure at all (cf. Figure 5.9(a)). In the cluster diagram (M inP ts = 5, ρ = 5.0, ξ = 5.0) already for 100 pre-labeled objects several distinct clusters (extracted for k = 0.2) can be observed (cf. Figure 5.9(b)). Cluster (a) consists predominately of objects from three classes: cytosolic, nuclear and mitochondiral proteins. Cluster (b) represents membrane proteins without N-terminal signal. Cluster (c) with two sub-clusters represents the classes of membrane proteins with uncleaved and cleaved signal. For 500 pre-labeled objects the cluster purity increases, as expected (cf. Figure 5.9(c)). The two sub-clusters of cluster (c) get more separated forming clusters (c) and (d). For cluster (a) two sub-clusters can now be observed: Cluster (a2) predominately consists of objects of class cytosolic proteins, whereas cluster (a1) is mainly shared by mitochandrial and nuclear proteins. This reflects a fundamental
120
Chapter 5: Semi-Supervised Clustering
a
d
c
(a) Glass data.
(b) Metabolic data.
(c) Liver data.
Figure 5.10: OPTICS Reachability Plot. difficulty in identifying nuclear proteins which can also be observed for most classification methods [19] and has a biological reason. The nuclear localization signal is not limited to one portion of a proteins primary sequence. In some cases a protein without a nuclear localization signal may be transported to the nucleus. [78]. However cluster (a1) contains several class-pure sub-clusters of nuclear proteins extracted for k = 0.05. Glass Data The glass data comprises 9 numerical attributes representing different physical and chemical properties. 214 instances are labeled to 7 classes representing various types of glass. Two clusters have been extracted from the OPTICS reachability plot for k = 0.2, cf. Figure 5.10(a). Only the class ”tableware” is well separated forming a distinct cluster (d), the other cluster (a) consists of objects of all other classes. In the cluster diagram (cf. Figure 5.11(a)) the cluster hierarchy becomes obvious. Instances of the classes ”building windows float processed” and ”building windows non float processed” and ”vehicle windows float processed” are very similar, forming together one cluster of objects of type window glass (a).
This cluster contains four sub-clusters for k = 0.2: Clusters (a1)
and (a2) mainly represent objects of the class ”building window float processed” and (a4) ”building window non float processed”, whereas (a3) contains objects of all three classes. Objects of the class ”container” forming cluster (b) are more similar to the objects of
5.4 Experiments
121
a
a
b c
d
b a1
a2
a3
b1
a4
(a) Glass data.
c b2
(b) Metabolic data.
(c) Liver data.
Figure 5.11: HISSCLU Cluster Diagram.
type window glass (cluster (a)) than to the other two clusters representing the classes ”headlamps” (cluster (c)) and ”tableware” (cluster (d)).
Metabolic Data Figure 5.11(b) shows the cluster diagram of a metabolic data set for M inP ts = 3, ρ = 4, ξ = 0.5. This 41-dimensional data set (132 instances) was produced by modern screening methodologies and represents cases of phenylalanine hydroxylase deficiency (PAHD) consisting of two different expressions of this metabolic disorder, the milder form called HPA, PKU, the stronger expression, and a control group [79]. We used 15 sample points as pre-labeled objects. For both diagrams, we extracted k-clusters for k = 0.2. Most of the instances of class HPA are in cluster (b) which is more similar to the class-pure cluster (c) representing the healthy control group than to cluster (a) comprising predominantly instances of class PKU. Both, PKU and even more HPA form different sub-clusters corresponding to different sub-stages of the disease with blurred borders. For class HPA there are two distinct sub-clusters marked by (b1) and (b2). In the OPTICS reachability plot (for comparison depicted in Figure 5.10(b)) there is only one distinct cluster(c) representing the control group.
122
Chapter 5: Semi-Supervised Clustering
c3
c4
c2
c4
c1 c2 c5
c2 c1
c1 c3
c5
(a) COP-k-means k = (b) MPC-k-means k = (c) MPC-k-means k = 2. 2. 5.
(d) HISSCLU.
Figure 5.12: Cluster assignment.
5.4.2
Spatial- and Class-Coherent Cluster Assignment.
Due to simultaneous cluster expansion and careful local distance weighting HISSCLU preserves the original cluster structure much better than comparative methods. Figure 5.12 shows the cluster assignment for COP-K-Means [68] and MPC-K-Means [16] for k = 2 (number of classes) and k = 5 (number of clusters extracted of the cluster diagram) on our 2-d example with constraints generated between all pairs of pre-labeled objects. Not violating any of the constraints, COP-K-Means obtains an unnatural clustering result where even pre-labeled objects situated in the center of dense clusters are assigned to a different clusters (cf. Figure 5.12(a)). MPC-K-Means performs better (cf. Figure 5.12(b)) considering both, constraints and the data distribution when assigning objects to clusters. Due to this there are pre-labeled objects of both classes assigned to one cluster. For k = 5 COP-K-Means does not perform better (not depicted) and MPC-K-Means (cf. Figure 5.12(c)) splits up the cluster on top although there are must-linked objects inside. This reflects the inherent tendency of K-means to detect spherically compact clusters. HISSCLU achieves to assign objects to clusters in the best coherent way with class labels and local cluster structure (cf. Figure 5.12(d)).
5.4 Experiments
5.4.3
123
Making Use of Supervision
In cases when no natural spatial cluster boundaries exist the information provided by the pre-labeled objects should be used to improve the clustering. To examine how efficient and effective the algorithms make use of the supervision in this case we compared their performance w.r.t. Mutual Information [76] on the liver data set (351 instances, 7 attributes, 2 classes, [77]). We selected a two-class data set showing no cluster structure at all in the reachability plot (cf. Figure 5.10(c)). We used M inP ts = 5 and generated cluster diagrams with ρ = 20 and ξ = 10. This strong distance weighting is applied in order to obtain k-clusterings with two clusters no noise (always done for k = 0.9) which are are directly comparable to the partitionings into two clusters generated by MPC-K-Means and COP-K-Means (cf. Figure 5.11(c)). The task here is to achieve a clustering maximally respecting the class structure with as less supervision as possible. To provide to all algorithms the same amount of supervision, we randomly sampled objects out of the data set, which we directly used as pre-labeled objects for HISSCLU. For the other algorithms we generated all possible constraints between all pairs of these objects. MPC-K-Means has been parameterized as described in [16], for COP-K-Means no advanced parameter settings are possible. The Mutual Information reflects to which extend a clustering corresponds to the class labels. Figure 5.13 shows the Mutual Information on the whole data set and on the unlabeled data w.r.t. the number of pre-labeled objects. MPC-K-Means does not succeed in making use of the constraints due global distance weighting. More constraints can even be misleading for global weighting in the situation of no natural spatial cluster borders. COP-K-Means performs quite good since the algorithm enforces the constraints without caring about the data distribution. HISSCLU performs even better despite of not using any explicit constraints. This demonstrates the usability of our local weighting technique.
124
Chapter 5: Semi-Supervised Clustering
COP-k-Means
0.6
MPC-k-Means
0.5 0.4 0.3 0.2 0.1
10
20
30
40
0 50 100 150 200 250 300
0.12 0.1
Mutual Information
0.7
Mutual Information
0.8
HISSCLU
0.08 0.06 0.04 0.02
10
20
30
40
0 50 100 150 200 250 300
Pre-labeled Objects
Pre-labeled Objects
(a) MI on all data
(b) MI on test data
Figure 5.13: Comparison on liver data set. 5.4.4
Parameter selection
From hierarchical density based clustering we have inherited the parameters and M inP ts, which can be set as suggested in [39]. In addition HISSCLU uses the parameter ρ and ξ to establish borders between classes when there are no clear natural cluster boundaries. In order to maximally preserve the original cluster structure it is recommended to start with ρ = 1.0..1.5, ξ = 0.5 and to increase it if better separation of classes is desired.
5.5
Conclusion
In this chapter, we have proposed HISSCLU, a novel method for semi-supervised clustering. Our method founded on a hierarchical, density-based cluster notion with the advantage of a determinate clustering result, high robustness with respect to noise and no favor for clusters of a particular shape (e.g. convex). HISSCLU consists of a method for cluster-consistent assignment of class labels to previously unlabeled objects and a method for the determination of the overall cluster structure of the data set in a way which is consistent to original and obtained class labels. In contrast to most previous methods, HISSCLU avoids the use of constraints in order to preserve the original cluster structure.
5.5 Conclusion
125
In a broad experimental evaluation we demonstrated that HISSCLU has the following advantages over state of the art semi-supervised clustering methods: • Making use of multimodal class information in the clustering process, • detecting and visualizing hierachical class and cluster structures, • spacial and class coherent data partitioning. The HISSCLU cluster diagram offers the user more insight into the class and cluster structure.
It provides answers to important questions, e.g.
which classes are most
similar? They may have a common superclass. Or, are there any multimodal classes? Classes may be distributed over several clusters in the cluster diagram. Moreover, our method is robust in terms of parameter settings. With a runtime complexity of O(n2 ) and memory usage of O(n) HISSCLU is scalable to be used on top of moderate to large size databases.
An interesting direction for future research would be to investigate to how the class hierarchy discovered by HISSCLU can be used to improve the classification of novel, unlabeled instances. In [80] Dong et al. proposed the principle of nested Nested Dichotomies. For multi-class problems, an ensemble of balanced classification trees has shown superior performance in terms of classification accuracy and efficiency. In presence of a clear class hierarchy, it has been shown in [81] that classification trees respecting this hierarchy perform better than arbitrarily constructed balanced trees.
However,
there are several open questions: As HISSCLU is not adopted to a special classifier, the derived hierarchy may not be beneficial for the used classifier. In addition, many data sets contain only an incomplete hierarchy, i.e. there may be some classes having distinct subclasses but not all of them.
126
Chapter 5: Semi-Supervised Clustering
Another interesting direction to follow is to enrich HISSCLU by elements of the information theory, cf. Chapter 4. A reasonable flat clustering could be obtained by horizontally cutting the cluster diagram at the point requiring the minimum coding costs of the data objects considering the information of class labels. As a first step, a coding scheme for data sets containing labeled objects needs to be elaborated.
5.5 Conclusion
127
algorithm expansion(DS, P, M inP ts, , ) List seedList; forall P ∈ P L.reachDist:= UNDEFINED; neighbors :=N (P ); updateSeedList(neighbors, P ); end for while not seedList.empty() do currentObject = seedList.getMin(); seedList.deleteMin(); currentObject.setCoreDist(, M inP ts); neighbors :=N (currentObject); seedList.update(neighbors, currentObject); end while end expansion procedure updateSeedList(objects neighbors, object P ); core:= P .CoreDist; forall N ∈ neighbors do rd: = max(core, dist(P, N ); if not N ∈ seedList N .reachDist := rd; N .label := P .label; seedList.add(N ); else if rd < N .reachDist N .reachDist: = rd; N .label: = P .label; end for end updateSeedList
Figure 5.14: Cluster Expansion Algorithm.
Chapter 6
Instance Based Classification with Local Density
The most frequent data mining task on biomedical data is classification. Classification is often used to support diagnosis decisions. A classifier is trained on a training set of instances labeled according to different classes, e.g. healthy and diseased in the most simple case. The classifier subsequently assigns class labels to novel unclassified instances. Such instances can be for example feature vectors extracted from biosamples of patients. When used for diagnosis support, a high sensitivity in recognizing diseased instances is of major importance. It is even acceptable to achieve a very high sensitivity to the expense of a slightly lower specificity, because suspicious cases need anyhow to be inspected by experts. In this chapter, we introduce a classification method which performs especially well on unbalanced biomedical data. The sensitivity can be tuned by parameter settings. The chapter is organized as follows: After an introduction in the next section, a brief discussion of related work is provided in Section 6.2. Section 6.3 describes the proposed method in detail, including a discussion on parameter settings and efficiency (cf. Section 6.3.2). In Section 6.4 an experimental evaluation on metabolic data, on data on protein localization sites and on many other benchmarking data sets from the UCI machine learning repository [77] is provided. The chapter ends with conclusions in Section 6.5.
130
6.1
Chapter 6: Instance Based Classification with Local Density
Introduction
In biomedical applications, often the core problem is efficient and effective classification. Some of the existing classification methods produce explicit rules, e.g. decision trees, discriminant analysis, logistic regression analysis and support vector machines, etc. Other classification methods such as the k-nearest neighbor classifier are called instance based because no explicit model is produced.
Many biomedical data sets consist of high dimensional feature vectors with a complex cluster structure. Even class-pure subsets of the data objects may be composed of different clusters. In this case the classes are not easily separable by planes, polynomial functions or combinations thereof and rule-based classifiers tend to break down in terms of accuracy. Often, the simple instance based k-nearest neighbor classifier performs better, but only if the point density is relatively uniform in all classes.
Unbalanced data sets exhibiting a high variation in the number of data items per class tend to have regions of different object density. Data objects situated in boundary regions between high and low object density are always classified into the class of the region of higher density. This situation is very characteristic for biomedical data sets, e.g. metabolic data. Often there is only a very limited amount of training data available for the classes representing the metabolic disorders, in relation to plenty of training data for the healthy controls. In addition, the human metabolism consists of highly complex processes.
Environmental effects have widely unknown consequences on the normal
metabolism and also on metabolic disorders. Genetically caused metabolic disorders with identic genotype may be expressed by various phenotypes. Translated into the data, this means that the objects of one class can be scattered to different areas of the data space.
6.1 Introduction
131
For unsupervised data mining tasks such as clustering density based methods have become very successful due to their robustness and efficiency, e.g. DBSCAN [8] and OPTICS [39], cf. Chapter 3. Recently, density based methods for outlier detection have appeared, such as LOF [82] or LOCI [9]. In contrast to distance based methods, local and global outliers can be discovered. In the density based notion outliers are determined by taking the object density of the surrounding region into account.
The general idea of our technique is to consider the cluster structure of the data set and to use the information of different densities for classification. Our algorithm determines the local point density in the neighborhood of an object to be classified and the local point density of all clusters in the corresponding region. The point is assigned to that class where it fits best into the local cluster structure. This idea can be formalized by defining a local classification factor (LCF) which is similar to the density based outlier factors but with an opposite intension. The local classification factor is determined separately for each of the possible classes and indicates how nicely a point fits into the local cluster structure of a given class. Analogously, a test object is always assigned to that class from which it is least considered as a local outlier. By adopting the concepts of density based methods to classification our technique performs well on data sets with arbitrarily shaped clusters. Classes can be distributed over more than one cluster. In addition, we are not limited to a two-class problem. Even on very unbalanced data sets we obtain a high accuracy. Our method is also applicable to non-vector metric data as our similarity queries require only a metric distance function. The parameterization is robust and allows optimization for different purposes, e.g. optimizing the sensitivity or the classification accuracy.
132
Chapter 6: Instance Based Classification with Local Density
Symbol LCF DD~x (Li ) ID~x (Li ) CLOFLi (~x) N NkLi (~x) nn-distkLi (~x) LRA SVM k-NN DT NB ANN
Definition Local Classification Factor. Direct Density of class Li w.r.t object ~x. Indirect Density of class Li w.r.t.object ~x. Class Local Outlier Factor of object ~x w.r.t class Li . Set of the k nearest neighbors of class Li of an object ~x. Distance to the k-th nearest neighbor of class Li of ~x. Logistic Regression Analysis. Support Vector Machine. The k Nearest Neighbor Classifier. Decision Tree. Na¨ıve Bayes. Artificial Neural Network.
Table 6.1: Symbols and Acronyms Used in Chapter 6. Table 6.1 gives an overview on the symbols and acronyms frequently used in this chapter in addition to the general notation summarized in Table 1.1. Besides specific acronyms, and acronyms for benchmarking classifiers, we particularly need the notion class separated k nearest neighbors, denoted by N NkLi (~x). We further denote by nn-distLk i (~x) the distance to the k-th nearest neighbor of an object ~x among the objects labeled according to class Li .
6.2
Survey
In this section, we briefly discuss related work. A fundamental introduction to the problem of classification is provided in Chapter 3. See Section 3.5 for a brief description of some common classification paradigms to which we compare the performance of LCF in the experimental evaluation in Section 6.4. The basic idea of LCF relies on a density based clustering notion, which is introduced in Section 3.2.2 and in Section 3.3.2. This section surveys concepts of density based outlier detection and supervised clustering related to
6.2 Survey
133
our approach and summarizes our contributions.
6.2.1
Density Based Outlier Detection
In the recent years, density based algorithms have been successfully applied in the field of clustering such as DBSCAN [8], DENCLUE [83] and OPTICS [39]. Also in the field of outlier detection, besides distance based methods ([84, 85]), density based methods have been established in the recent years, e.g. LOF [82], and LOCI [9]. Especially if the data set has both sparse and dense regions, distance based methods can lead to problems. In this case, the property of being an outlier can not be characterized using a global distance based threshold. To overcome this problem the density based approaches take the local density in the region of the object into account. For the local outlier factor (LOF) the density of the region described by the M inP ts nearest neighbors of an object is considered. For each object the local outlier factor is computed determining to which extent the object is an outlier w.r.t. its neighborhood. Similar to the LOF, LOCI (Local Correlation Integral) is also a density based local outlier factor. LOCI permits additional information visualizing the local data distribution in the form of the so-called LOCI plot. Besides that, the authors propose an efficient algorithm to detect outliers using approximate computations.
6.2.2
Supervised Clustering
Recently several approaches in the field of supervised clustering appeared. In contrast to traditional clustering, supervised clustering is applied on the classified instances of the training data set with the objective to find class pure clusters. This information is used for classification assigning an object ~x to a class Li if it is situated inside a cluster of objects of this class. Supervised clustering is related to our approach, since it takes into account the mulitmodality of classes, rather than aiming at finding decision boundaries as
134
Chapter 6: Instance Based Classification with Local Density
most established classification methods do. In [86] the authors propose an evolutionary clustering algorithm witch is initialised with K arbitrary centroids and the objective is to minimize cluster dispersion and impurity. Other approaches present modified version of the K-means [15] and the PAM-algorithm [17]. It is difficult to determine a pureness threshold measure for the clusters. In addition, these methods suffer from the drawbacks of the underlying clustering algorithms, for example the number of clusters has to be specified for all the mentioned algorithms. For multimodal classes, this is not a trivial task.
6.2.3
Contributions
To the best of our knowledge the classification problem has not been addressed before from the viewpoint of density based clustering and outlier detection. In [87] the authors present an approach partitioning the data space into hyperrectangles for the use with their classification method called lattice machine. But this approach is rather grid based than density based.
In contrast, we found our approach on the density based clustering notion by defining a local classification factor consisting of two kinds of information: the first criterion is the average distance of the nearest objects taken for each class separately. The second criterion determines the average density for the close by clusters.
For classification
we minimize the difference between the distance of the point to each cluster and the distances among the points inside each cluster. By doing so, we try to assign a point to the class of that cluster where the point fits best into according the data density. The extensive experimental evaluation shows that this aspect of local density can be
6.3 Proposed Method
135
successfully used for classification.
6.3
Proposed Method
On complex, high dimensional and unbalanced data sets, the simple instance based knearest neighbors classifier often outperforms other more sophisticated methods, as shown in [30] for predicting protein cellular localization sites. The k-nearest neighbors classifier can be characterized as a distance based method, because it assigns the query object to the class of the majority of its neighbors. Like the k-nearest neighbors classifier our method is instance based offering the same advantages on such data sets. But furthermore we take the information of different density into account. In the following we formalize these ideas and present the local classification factor LCF. In Section 6.3.2 we discuss parameter settings and efficiency issues. 6.3.1
Classification Method
For a query object ~x we compute a local classification factor LCF w.r.t. each class Li ∈ L separately. We assign the object ~x to the class w.r.t. which it has the lowest LCF. In particular, the LCF consists of two parts: • Direct Density (DD) • Class Local Outlier Factor (CLOF). The LCF is a weighted sum of these two aspects. Roughly speaking we assign an object ~x to class Li if there is a high density of objects of class Li in the region surrounding ~x. In addition we claim that ~x is not an outlier w.r.t. the objects of class Li in this region. In the following sections we explain these two parts in more detail. We introduce the concept of direct density and define a simple and accurate outlier factor which is especially useful
136
Chapter 6: Instance Based Classification with Local Density
for classification. For illustration we use a two dimensional synthetic data set visualized in Figure 6.4. This data set shows a complex cluster structure. In particular, class one is distributed over three clusters of different size shape and density. Our goal is not only to achieve a high overall classification accuracy but also to develop a balanced classifier that performs well on all classes in every region of the data space. Direct Density Taking a global look at our synthetic example data set probably the first impression is that class 2 is of much higher density than class 1. But since there may be regions of extremely different density among one class we can not globally specify the density of a class. But we can locally examine the density of each class in the region of the object to be classified. For each class Li the region surrounding the query object ~x can be described by the set of the k-nearest neighbors of ~x of class Li . Definition 14 (class k-nearest neighbors of an object ~x) For any positive integer k, the set of class k-nearest neighbors of an object ~x w.r.t. class Li ∈ L, denoted as N NkLi contains the objects of class Li for which the following condition holds: If |Li | < k : N NkLi (~x) = {~y ∈ DS|L(~y ) = Li }. Otherwise N NkLi (~x) is a subset of k elements in DS for which ∀~p ∈ N NkLi (~x),
∀~q with
L(~q) = Li
\ N NkLi (~x) :
dist(~x, p~) < dist(~x, ~q).
If a class contains less than k elements, the set N NkLi (~x) contains all objects of this class. If there are more objects, N NkLi (~x) contains the class internal nearest neighbors.
6.3 Proposed Method
137
Let us note that the size of the region surrounding the object is not equal for each class but is determined by the density of k objects situated directly next to the query object, i.e. the direct density of each class. To capture the density of class Li ∈ L in the region surrounding the query object ~x we define a simple density measure. We compute the mean value of the distances to the k nearest neighbors of ~x belonging to class Li . Definition 15 (Direct density of class Li w.r.t. ~x) P DD~x (Li ) :=
L
dist(~p, ~x)
p ~∈N Nk i (~ x) |N NkLi (~x)|
.
This density measure has a small value if the objects of class Li surrounding ~x are densely populated and larger values for sparser classes in the region of the query object. To compute DD~x (Li ), a metric distance function for vector data is required or a corresponding similarity measure for non-vector data. We can use local density alone for classification by computing the density measure for each class Li and assigning a query object ~x to the class Li where DD~x (Li ) is minimal. Due to the variable size of regions and the class separated consideration, the concept of direct density outperforms in some situations the ordinary k-nearest neighbors classifier. In addition, we have no majority voting. Moreover, for the decision to which class a query object should be assigned to, we get a continuous value by computing the direct density measure. So it is very unlikely to have a standoff situation.
The result on our synthetic data set using direct density only is depicted in figure 6.4 (b). (As described in section 6.4 in more detail, we used k = 5 and 10-fold cross validation). Obviously, still many objects of the sparser class 1 are wrongly classified. In the region of these objects the density of the denser class 2 is higher than the density of their own class. Therefore the direct density measure indicates to assign them to class 2. Intuitively the wrongly classified objects fit better in the cluster structure of their own
138
Chapter 6: Instance Based Classification with Local Density
class, so it should be possible to classify them correctly. Class Local Outlier Factor In addition to the direct density we now examine to which degree a query object ~x is an outlier considering the local cluster structure of each class Li separately. Since classes often form clusters, we want to assign the query object ~x to the cluster of that class where it best fits into. For this purpose, we define a density based class local outlier factor (CLOF), similar to LOF [82], but more suitable for classification. The concept of density based local outliers is interesting for classification, particularly the idea that being an outlier is not a binary property. In contrast, the density based outlier factors assign to each object a degree of being an outlier. Nevertheless, we can not directly apply the LOF for classification, since it is based on the reachability distances of the data objects. The authors decided to use the reachability distances to reduce statistical fluctuations of the distances among objects significantly close to each other (for details see [82]). This may be useful to discover meaningful outliers. However for classification of a query object ~x placed at the border between one two more classes we want to see even minor differences in the degree to which ~x is an outlier w.r.t. these classes.
Instead of the reachability distance we use the distances to the k-nearest neighbors, again computed class-wise separated. In addition to the direct density as defined in Definition 15 we need for the class local outlier factor a measure for the indirect density of the class Li , i.e. for the density of the region surrounding the query object ~x excluding ~x itself. Definition 16 (Indirect density of class Li w.r.t. ~x) P ID~x (Li ) :=
L
DD (Li )
p ~ p ~∈N Nk i (~ x) |N NkLi (~p)|
6.3 Proposed Method
139
As the direct density, the indirect density measure can be 0, if there are at least k duplicates of class Li in DS. For simplicity, we assume that there are no duplicates. To deal with duplicates, we can base Definition 14 on the k distinct class nearest neighbors of the query object in class Li . For the class local outlier factor of a query object ~x w.r.t. class Li we consider the ratio of the class restricted k-nearest neighbor distance nn-distLk i (~x) and the indirect density of a class Li w.r.t. ~x. Definition 17 (Class local outlier factor of an object ~x)
CLOFLi (~x) :=
nn-distLk i (~x) ID~x (Li )
The class local outlier factor describes the degree to which a query object ~x is an outlier to the local cluster structure w.r.t. class Li . It is easy to see that for a query object ~x located inside a cluster of objects of class Li the CLOF is approximately 1. If ~x is an outlier w.r.t. class Li , it gets a significantly higher CLOF w.r.t. that class. To be more precise, we can formally specify an upper and a lower bound for CLOF. These bounds are determined by the distances from the query object ~x to its direct and indirect class nearest neighbors. The set of the indirect class nearest neighbors of ~x, denoted by IndN NkLi (~x) contains are all objects used to compute ID~x (Li ). Definition 18 (Indirect class nearest neighbors of an object ~x)
IndN NkLi (~x) := {~o ∈ DS|∃ p~ ∈ N NkLi (~x) :
~o ∈ N NkLi (~p)}.
The indirect class nearest neighbors of an object ~x are all objects that are nearest neighbors of the nearest neighbors of ~x. We call the union of the sets of the direct and the indirect neighbors of a query object ~x the class neighborhood of ~x, denoted by N HLi (~x).
140
Chapter 6: Instance Based Classification with Local Density
Definition 19 (Class neighborhood of an object ~x) For a query object ~x and each class Li ∈ L the class neighborhood N HLi (~x) is defined as follows: N HLi (~x) := N NkLi (~x) ∪ IndN NkLi (~x). With this notions, we can describe the bounds of CLOFLi (~x) depending on the class neighborhood of ~x. Lemma 3 Let minDist be the minimum distance from the query object ~x to the objects in its class neighborhood, maxDist the maximum distance analogously, i.e. minDist = min{dist(~x, ~o)|~o ∈ N HLi (~x)}, maxDist = max{dist(~x, ~o)|~o ∈ N HLi (~x)}. It holds: 1 ≤ CLOFLi (~x) ≤ δ δ for δ :=
maxDist . minDist
Proof 3 By definition of minDist and maxDist directly follows that the value for both, the class restricted neighborhood distance and indirect density lies between minDist and maxDist, in particular: nn-distLk i (~x) ≥ minDist, ID~x (Li ) ≤ maxDist. Thus the upper and lower bounds for CLOF can easily be specified.
minDist maxDist
≤ CLOFLi (~x) ≤
maxDist minDist
For a query object ~x placed deep inside a dense cluster of objects of class Li , CLOFLi (~x) is approximately 1. This is due to the fact that the class neighborhood of ~x consists of objects that are members of this cluster. The more homogenously the data distribution in the class neighborhood is, the more δ converges towards 1, i.e. the tighter are the
6.3 Proposed Method
141
bounds for CLOF. Especially in the center of a cluster we often find such a homogenous data distribution.
To classify a query object ~x using the class local outlier factor we compute CLOFLi (~x) for each class Li ∈ L and assign ~x to the class Li w.r.t. which its CLOF is minimal. Using solely the class local outlier factor the result on our synthetic data set depicted in figure 6.4 (c). Nearly all objects of the sparser class 1 are classified correctly. Even objects of class 1 situated inside the big dense cluster of class 2 are correctly classified, as they fit better into the density of the objects of their own class than into that of the objects of class 2. But note that especially at the margins of the clusters of the denser class 2 there are many wrongly classified objects. This is due to the fact that the class local outlier factor of the objects in these regions is similar w.r.t. both classes. Using direct density, these objects can be definitely classified correctly. Local Classification Factor The main idea for the local classification factor is to combine the information of direct density with the class local outlier factor to overcome the drawbacks of both methods when used alone for classification. As described in the last paragraphs, it is not sufficient to require a high density of objects of class Li in the region of the query point ~x to assign ~x to class Li . The rule assigning ~x to the class w.r.t. which it has a smaller outlier factor leads to different mistakes. Let us note that the sets of wrongly classified objects in Figure 6.4 (b) and Figure 6.4 (c) are almost disjoint with some exceptions that can be considered as incorrectly labeled training instances. The local classification factor of an object ~x combines both aspects.
142
Chapter 6: Instance Based Classification with Local Density
Definition 20 (Local classification factor of an object ~x)
LCFLi (~x) := DD~x (Li ) + l · CLOFLi (~x) The local classification factor of a query object ~x w.r.t. class Li is the sum of its direct density and its l-times weighted class local outlier factor w.r.t. this class. We use a weighting factor l to determine to which degree the class local outlier factor and the direct density is relevant for classification. To classify an object ~x, we compute the LCF w.r.t. each class Li for ~x and assign ~x to the class w.r.t. which its LCF is minimal. In Figure 6.4 (d) the final result on our synthetic data set is depicted. Due to the combination of both aspects most classification errors disappear. Let us note that there are still some incorrectly classified objects in the smaller clusters of class 2. We will discuss this aspect in the following in detail. 6.3.2
Parameters and Efficiency
We now exactly explain the meaning of the parameters k and l needed to compute the LCF. We give hints for an appropriate choice of the parameter k and show how the parameter l can be used to optimize either overall accuracy or the balance of correctly classified instances within each class. A short discussion on the issue of efficiency follows. Parameter k The parameter k is used to perform the class separated k-nearest neighbor queries which are the base of both, the direct density and the class local outlier factor. Due to our class separated consideration and the definition of the direct density, our concept works well even on unbalanced data sets for different values of k, i.e. data sets with a high variation in the number of objects per class. Using the k-nearest neighbors classifier, objects of the small classes are often not classified correctly. There are always to few training instances
143
x
classclass 1 1 classclass 2 2
90 80 70 60 50 40 30 20 10 0
Accuracy
x
Accuracy
6.3 Proposed Method
2
90 80 70 60 50 40 30 20 10 0 3 24 35
46
57
68
12 11 13 12 14 13 15 14 15 79 10 8 11 9 10
k
k
(b) Accuracy (b) for Different k.
(a) Example. (a)
(b)
(a)
Figure 6.1: Influence of Parameter k. of these classes among the k nearest neighbors of the query object ~x. Due to majority voting the object tends to be assigned to the bigger classes. Using weighted k-NN can weaken this effect but does not solve the general problem. In general, the value of k has influence on the size of the region considered for computing the LCF. We use the direct and the indirect density measure to describe the density of the region surrounding the query object ~x. If k is chosen too small the local density cannot be appropriately characterized. The example depicted in Figure 6.1(a) shows a query object ~x placed inside a cluster. Intuitively the query object should be classified into class 1, since it is located inside a cluster of objects of this class. For the extreme case of k = 1 the direct and the indirect density measure would be equal for both classes since there is an outlier of class 2 close to ~x. The same holds for a k that exceeds the number of objects belonging to the cluster where the query object fits in. Direct and indirect density are in this case also not specific the region around ~x. Due to this, we get similar values for the LCF as computed w.r.t. both classes.
144
Chapter 6: Instance Based Classification with Local Density
Thus the value for k corresponds to the minimum cluster size, i.e. to the minimum number of objects of a class which should be regarded as a cluster. Of course, this is application specific. In a more general sense k depends on the size of the training data set. For our experiments, we used the training data sets to determine an appropriate value for k. In general, we used a k in the range of 5 to 15. For our synthetic data set we get the best accuracy for k = 5. In detail, accuracy values for different k and fixed l = 6 are shown in figure 6.1(b). For a wide range of k the accuracy is quite stable. For k = 5 in our synthetic data set there some incorrectly classified objects of class 2 as depicted in figure 6.4 (d), especially members of the small clusters. Using k = 3 some of these objects are correctly classified. However the overall accuracy decreases slightly to 79%.
Parameter l The parameter l determines to which degree the outlier factor of an object ~x w.r.t. the classes Li ∈ L is relevant for its classification. A higher value for l leads to more correctly classified objects in the sparser classes, to the expense of incorrectly classified objects in the denser classes. Margin objects of the denser class often have a higher class local outlier factor w.r.t. their own class than w.r.t. the sparser class. These objects are typically misclassified if the CLOF gets too much weight. Depending on the concrete application domain, l can be determined either to maximize the overall accuracy or to optimize recall and precision of a certain class. Particularly in biomedical data, high precision and recall on sparse classes is essential, since they often represent abnormal observations.
Figure 6.2 displays the results on our synthetic data set for k = 5 and different values for l. The overall accuracy as well as recall and precision for class 1 and 2 are depicted in w.r.t. l in the range 0..25. In Figure 6.2(a) the accuracy shows at its first
70 65 60
accuracy recall class 1 recall class 2
55 50
6.3 Proposed Method
1
95
95
90
90
85
85
80
80
75
75
%
100
%
100
70
70
65
65
60
60
accuracy recall class 1 recall class 2
55
90
7
9
145
11 13 15 17 19 21 23 25 parameter l
50 1
95
5
accuracy precision class 1 precision class 2
55
50
100
3
3
5
7
9
11 13 15 17 19 21 23 25 parameter l
(a) Accuracy and Recall for Different l.
1
3
5
7
9
11 13 15 17 19 21 23 25 parameter l
(b) Accuracy and Precision for Different l.
Figure 6.2: Influence of Parameter l on Accuracy, Precision and Recall.
85
local maximum for l = 2 where the line of accuracy crosses the lines of recall of both 80 %
classes. This means that at this point the classifier performs most balanced w.r.t. recall. 75 In 70 figure 6.2(b) similar charateristics can be observed w.r.t. precision. Here the point 65
indicating the optimal balance between both classes is located at 1.7. However, the 60
accuracy
accuracy reaches its 1maximum of 83.56% for both l = 6 and 7 respectively. In our precision class 55
precision class 2 50 experiments in Section 6.4 we determine l of 1 3 5 7 9 11 13 15 17 19 21 23 25
the training data sets to optimize the overall
parameter l accuracy wherever not otherwise specified.
Efficiency Due to the growing amount of data produced in many application domains, efficiency is an important issue although it is not the main focus in the area of classification. Since our method is instance based, a training phase is not required. Other classification methods, e.g. ANN require a time consuming training phase. The classification of an object requires
146
Chapter 6: Instance Based Classification with Local Density
|L| · k 2 k-nearest neighbors queries without any preprocessing. If the test data set is sufficiently large, the value of IDLi (~x) can be materialized for each object ~x of the training data set w.r.t. each class Li ∈ L. This preprocessing step has a runtime complexity of O(n2 ) without index support, where n denotes the number of data objects of the training data set. To be more precise, this preprocessing step should be performed if testSize >
trainSize . k · |Li |
The classification of an object q then takes linear time, i.e. O(n) requiring k ·|Li | k-nearest neighbor queries to determine DDLi (~x). For large data sets approximation and compression techniques have been developed ([88], [89]) for the k-nearest neighbors classifier which in principle can be used to additionally speed up our method.
6.4
Experiments
This section gives an extensive experimental evaluation. After some general information on the data sets and on the applied validation, the results on synthetic, metabolic and protein localization data are reported. Furthermore, the results of LCF on other biomedical data sets from UCI Machine Learning Repository [77] are discussed. 6.4.1
Data Sets
LCF was tested and evaluated on one synthetic (cf. Figure 6.4) and seven real biomedical data sets as summarized in Table 6.2. Five data sets (yeast, E. coli, liver, iris and diabetes) come from the UCI Machine Learning Repository. The table shows the dimensionality of data, the number of classes and objects and the number of objects per class. Detailed biological information and experimental results are described and discussed for each data
6.4 Experiments
147
Table 6.2: Data Sets - Summary. name classes dim. objects objects/class Synthetic 2 2 152 71:81 Metabolic 1 2 45 57 38:19 Metabolic 2 2 117 83 54:29 Yeast 10 9 1448 463:429:244:163:51 44:35:30:20:5 E.coli 8 7 336 143:2:35:77:5:20 20:5 Liver 2 6 345 145:200 Iris plant 3 4 150 50:50:50 Diabetes 2 8 768 500:268
set separately throughout this section. See also Chapter 2 for biomedical background information on the data sets.
6.4.2
Benchmark Classifiers, Validation and Parameter Settings
We compared LCF with six popular classification methods obtained from the publicly available WEKA data mining software [90]. See Chapter 3 for a description of these classification methods. For validation we used 10-fold cross validation. All classifiers were parameterized to optimize accuracy. For SVM we used both polynomial (of degree 2) and radial kernels. The cost factor c was appropriately chosen using the training data set. We used the C4.5 decision tree algorithm with reduced error pruning. For ANN, we designed a single layer of hidden units with (number of attributes + number of classes)/2 hidden units, 500 epochs to train through and a learning rate of 0.3. For LRA and NB no advanced settings can be performed. We applied both weighted (1/distance) and unweighted k-NN with an appropriate value for k determined of the training data sets. For LCF we also used Euclidian distance and determined k and l of the training data sets.
148
6.4.3
Chapter 6: Instance Based Classification with Local Density
Synthetic Data
For our synthetic data set (visualized in Figure 6.4) Table 6.3 summarizes the classification result of all six compared classifiers w.r.t. the number of correctly and incorrectly classified objects, and recall, precision and accuracy in %. For LCF the parameter k was set to 5. The parameter l was set to 2 and 6 respectively while maintaining the same value of k. LCF outperforms all established classification methods in terms of accuracy (82.9%) and balance of correctly classified instances between both classes (83.1% for class 1 and 82.7% for class 2) using l = 2 (cf. Table 6.3). Moreover, the overall accuracy can be slightly increased for l = 6, but accompanied by a decrease of recall for class 2. However, LCF for l = 2 manages the complex cluster structure of class 2 and does not understate class 1, which is smaller and shows a sparser data distribution compared to class 2 w.r.t. correctly classified instances. LRA or SVM, which construct a separating hyperplane between both classes drop off in classification accuracy (59.2% - 63.2%) not being able to handle such complex data structures. DT, NB or ANN yielded higher accuracy (65.1% - 67.8%), but also lack of balance of correctly assigned objects within the two classes. k-NN, however, was able to further increase accuracy, but also classifies instances of the sparser class 1 predominantly to those of the denser class 2. Weighting only slightly attenuates this tendency. 6.4.4
Metabolic Data
Classification in metabolomics has great potential for the development of automated diagnostics. After reviewing a certain population of healthy and diseased patients, abnormal metabolic profiles that are significantly different from a normal profile can be identified from the data and thus can become diagnostic of a given disease [34]. Both metabolic data sets have been provided by our cooperation partners and have been generated by modern tandem mass spectrometry (MS/MS) technology. See also Chapter 2, Section 2.4
6.4 Experiments
149
Table 6.3: Results on Synthetic Data. classifier class corr. incorr. recall prec. LRA 1 37 34 52.1 56.9 2 53 28 65.4 60.9 SVM(lin.) 1 35 36 49.3 61.4 2 59 22 72.8 62.2 SVM(pol.) 1 32 39 45.1 65.3 2 64 17 79.0 62.1 1-NN 1 46 25 64.8 70.8 2 68 13 84.0 78.2 5-NN 1 37 34 52.2 66.1 2 77 4 95.1 80.2 10-NN 1 40 31 56.3 81.6 2 72 9 88.9 69.9 DT 1 32 39 45.1 76.2 2 71 10 87.7 64.5 NB 1 40 31 56.3 64.5 2 59 22 72.8 65.6 ANN 1 37 34 52.1 71.2 2 66 15 81.5 66.0 LCF (l=2) 1 59 12 83.1 80.8 2 67 14 82.7 84.8 LCF (l=6) 1 66 5 93.0 77.6 2 62 19 76.5 92.5
acc. 59.2 61.2 63.2 75.0 75.0 73.7 67.8 65.1 67.8 82.9 83.6
for more information on metabolomics.
Metabolic Data Set 1 This data set contains concentration values of 45 metabolites (12 amino acids and 33 sugars (saccharides)).
The data has been generated in the context of the newborn
screening study [33] grouped into patients suffering from a multigenic metabolic disorder and healthy controls. Further information on this data is strictly confidential. Table 6.4 summarizes our experiments by setting parameter k again to 5 and parameter l
150
Chapter 6: Instance Based Classification with Local Density
Table 6.4: Results on Metabolic Data Set 1. classifier class corr. incorr. recall pre. acc. LRA 1 22 16 57.9 71.0 56.1 2 10 9 52.6 38.5 SVM(lin.) 1 28 10 73.7 62.2 52.6 2 2 17 10.5 16.7 SVM(pol.) 1 28 10 73.7 62.2 57.9 2 5 14 26.3 33.3 1-NN 1 33 5 86.8 68.8 64.9 2 4 15 21.1 44.4 5-NN 1 37 1 97.4 68.5 68.4 2 2 17 10.5 66.7 10-NN 1 38 0 100 67.9 68.4 2 1 18 5.3 100 DT 1 35 3 92.1 67.3 64.9 2 2 17 10.5 40 NB 1 31 7 81.6 77.5 68.4 2 8 11 42.1 47.1 ANN 1 26 12 63.2 64.8 57.9 2 7 12 31.6 36.8 LCF 1 29 9 76.3 82.9 73.7 2 13 6 68.4 59.1
to 35 for LCF. Due to the small size of this data set (57 instances) it is favorable to use a small k. It can be expected that metabolic data exhibits regions of various densities caused by a higher variation of metabolite concentration levels at the state of disease vs. normal [91]. The borders between healthy and pathological instances are blurred in this high dimensional data set containing overlapping clusters of both classes. Best accuracy was obtained for value of l = 35. Of all investigated classifiers LCF showed the highest classification accuracy of 73.7% and a superior recall value of 68.4% for class 2, i.e. the abnormal metabolic profiles of diseased people. LCF results are highest balanced in terms of recall and precision, and are comparable to LRA yielding correctly classified cases above 50% in both classes. However, LRA lacks in
6.4 Experiments
151
Figure 6.3: Parameterization of LCF on Metabolic Data Set 1. Classification Accuracy Depending on Different k- and l-Values is Displayed.
accuracy of only 56.1%. SVM and ANN constitute similar accuracy values like LRA, but assign up to 80% of pathological cases to healthy subjects (false negative cases). The k-NN classifier demonstrates the best accuracy values within all benchmark classifiers, but breaks down in recall dramatically. The use of weighted k-NN does not help here.
For diagnostic issues it is of highest importance to classify instances of smaller and sparser classes correctly, in particular if this class is represented by pathological cases. Thus, balance of correctly classified objects between classes and high accuracy is essential for classifying diseased vs. normal metabolic profiles so that LCF is an interesting tool to be used for diagnostics. Figure 6.3 demonstrates classification accuracy of LCF as 3-d plot by setting parameter k = 1, 3, 5, 7, 10 and l = 5, 15, 25, 35, 45, 55. Best accuracy was achieved for k = 5.
152
Chapter 6: Instance Based Classification with Local Density
Table 6.5: Amino Acids (18 classifier class corr. incorr. SVM(lin.) 1 27 2 2 49 5 3-NN 1 26 3 2 48 6 LCF(5,70) 1 24 5 2 50 4
Metabolites). sens. spec. acc. 90.7 91.6 91.6
Table 6.6: Acyl-Carnitines (51 classifier class corr. incorr. SVM(quad.) 1 16 13 2 43 11 5-NN 1 20 9 2 45 9 LCF(6,35) 1 16 13 2 47 7
88.9
89.7
89.2
92.6
82.2
89.2
Metabolites). sens. spec. acc. 79.6 55.2 71.1 83.3
69.0
78.3
87.0
55.2
75.9
Metabolic Data Set 2 This metabolic data set derives from a study on the Metabolic Syndrome. According to the definition of the International Diabetes Federation (IDF)1 , the Metabolic Syndrome comprises adiposeness and at least two of the following factors: • Diabetes, • alterations of metabolism, • hypertension. Metabolite profiling provides superior potential in early diagnosis of the Metabolic Syndrome. An early diagnosis of this condition is of major importance to prevent severe complications, such as heart diseases and type II diabetes. The data consists of 29 1
http://www.idf.org/home/
6.4 Experiments
153
Table 6.7: Sugars (48 Metabolites). classifier class corr. incorr. sens. spec. acc. SVM(quad.) 1 18 11 83.3 62.1 75.9 2 45 9 3-NN 1 17 12 68.5 58.8 65.2 2 37 17 LCF(3,35) 1 25 4 96.3 86.2 92.7 2 52 2 Table 6.8: Whole Data Set (117 Metabolites). classifier class corr. incorr. sens. spec. acc. SVM(quad.) 1 18 11 83.3 62.1 75.9 2 45 9 5-NN 1 22 7 68.5 58.8 65.2 2 48 6 LCF(3,35) 1 25 4 96.3 86.2 92.7 2 52 2 instances of a healthy control group (class 1) and 54 instances from patients afflicted with severe Metabolic Syndrome including cardiovascular symptoms (class 2). The metabolite concentrations of 18 amino acids, 51 acyl-carnitines and 48 sugars are given in µmol/L. Tables 6.5, 6.6 and 6.7 summarize the classification accuracy on the three metabolite classes and Table 6.8 on the whole data set with all 117 metabolites. For evaluation, the performance of LCF was here compared to k-NN and a SVM with linear or quadratic kernel and the parameter c was set to 100. The kernel has been selected to optimize the accuracy. The numbers in brackets LCF(k, l) denote the parametrization used for LCF. On this data set, leave-one-out validation has been performed.
Considering amino acids and acyl-carnitines, all examined classifiers show approximately equal performance in terms of accuracy (cf. Tables 6.5 and 6.6). As mentioned, for classification of biomedical data a high sensitivity is essential. On both data sets the
154
Chapter 6: Instance Based Classification with Local Density
LCF yields the highest sensitivity on the group of patients suffering from the Metabolic syndrome, however accompanied by a slight decrease in overall accuracy. Among the examined classifiers, LCF performs best on both, the sugars and the whole data set (cf. Tables 6.7 and 6.8). Even without dimensionality reduction, highest classification accuracy was observed on the full dimensional metabolite data set using LCF. This result also corresponds well to the fact, that in the condition of severe Metabolic Syndrome the sugars are known as the most striking biomarkers.
6.4.5
Yeast Data
The yeast data set contains 1484 protein sequences labeled according to ten classes [92, 30]. Table 6.12 depicts classification results w.r.t. the three largest classes (1. cytoplasm, 2. nucleus and 3. mitochondria). The classes membrane protein (no N-terminal signal, uncleaved and cleaved signal, classes 4-6), extracellular, vacuole, peroxisome and endoplasmic reticulum (classes 7-10) consist of 5 up to 163 instances and are not shown in detail. Parameter settings for LCF were k = 12 and l = 0.1.
Comparing all classifiers, most of the errors are due to confusing cytoplasmic proteins with nuclear proteins and vice versa. This reflects a fundamental difficulty in identifying nuclear proteins. One reason is the fact that unlike other localization signals the nuclear localization signal does not appear to be limited to one portion of a proteins primary sequence. In some cases a protein without a nuclear localization signal may be transported to the nucleus as part of a protein complex if another subunit of the complex contains a nuclear localization signal ([93, 78]). This aspect has already been mentioned in Chapter 5, Section 5.4. See also Figure 5.9 for the cluster diagram of this data set. In spite of this, LCF demonstrates the best balanced result for the first three classes w.r.t. recall (62.2%; 59.7%; 60%) and precision (56.4%; 57.4%; 63.8%), and an
6.4 Experiments
155
Table 6.9: Confusion class 1 2 3 1 288 132 33 2 131 256 27 3 57 24 139 4 13 17 7 5 5 6 4 6 0 0 1 7 5 0 3 8 10 7 2 9 2 4 2 10 0 0 0
Matrix for Yeast With 4 5 6 7 8 6 1 0 2 0 11 3 0 1 0 10 6 2 3 0 125 1 0 0 0 3 19 8 6 0 0 3 34 6 0 0 2 5 20 0 6 2 0 3 0 0 0 0 2 0 0 1 0 0 0
LCF. 9 10 1 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 10 0 0 4
overall accuracy of 60.3%. LCF seems to be the best choice to identify nuclear proteins, however accompanied by a slight decrease of recall in class 1. In Table 6.9 the confusion matrix of LCF is shown in more detail. For the other classes not considered in Table 6.12 the classification accuracy corresponds well to the results reported in [30]. With the exception of the ANN, DT and the weighted 21- NN classifier all other paradigms constitute a recall rate below 50% for nuclear proteins classification. For the k-NN classifier we used an optimized k value for this special data set, as suggested in [30]. Here, weighting leads to an increase of overall accuracy (61.9%) and also of recall of class 2 (54.8%). However, the recall value of LCF is not reached. With l optimized for correctly identifying nuclear proteins (l = 0.5) we even obtain 66.0% recall in class 2, but overall accuracy decreases to 56.8% mainly due to incorrectly classified instances of the biggest class 1. 6.4.6
E.Coli Data
Similar to the yeast data set, E. coli data describes 7 protein location sites distributed to 8 classes, i.e. cytoplasm (143), inner membrane without signal sequence (77), periplasm (52), inner membrane, uncleavable signal sequence (35), outer membrane (20), outer
156
Chapter 6: Instance Based Classification with Local Density
Table 6.10: Results on E.Coli Data. class corr. incorr. recall prec. 2 65 12 84.4 83.3 4 22 13 62.9 66.7 SVM(poly.) 2 64 13 83.1 84.2 4 23 12 65.7 69.7 SVM(radial) 2 18 59 23.4 64.3 4 0 35 0 0 7-NN 2 58 19 75.3 81.7 4 23 12 65.7 69.5 7-NN(weighted) 2 63 14 81.8 84.0 4 22 13 62.9 71.0 DT 2 60 17 77.9 75.0 4 19 16 54.3 55.9 NB 2 56 21 72.7 87.5 4 29 6 82.9 61.7 ANN 2 64 13 62.9 66.7 4 22 13 62.9 66.7 LCF 2 63 14 81.8 82.9 4 25 9 71.4 73.5 classifier LRA
acc. C: 77.7 O: 87.2 C: 77.7 O: 87.8 C: 16.7 O: 47.9 C: 72.3 O: 86.0 C: 75.9 O: 87.2 C: 70.5 O: 82.1 C: 75.9 O: 85.4 C: 86.1 O: 86.1 C: 78.6 O: 88.1
membrane lipoprotein (5), inner membrane lipoprotein (2), and inner membrane, cleavable signal sequence (2) ([92, 30]). Table 6.10 shows the confusion matrix for the E. coli data set. Parameters for LCF were set to k = 10 and l = 0.1. Table 7 depicts precision and recall for the classes 2 and 4, the accuracy on these classes and the overall accuracy. All examined classifiers show most classification errors due to mixing up inner membrane proteins without a signal sequence (class 2) and inner membrane proteins with an uncleavable signal sequence (class 4). The accuracy on these classes (denoted by C) is approximately 10 percent less than the overall accuracy (denoted by O). Class 2 and 4 which are unbalanced (c.f. 77 vs. 35 data) are very similar, both representing inner membrane proteins. See also the cluster diagram of this data set depicted in Figure 5.8. Horton and Nakai explained in [30] the difficulty to separate both classes with the fact
6.5 Conclusion
157
that the labelling of some of the training examples includes some uncertainty; that means some training instances are probably wrongly labeled. However, LCF performs best w.r.t. balancedness in these classes and is slightly better in terms of overall accuracy. Performance on the other classes corresponds well to the results described in [30]. This example shows that local density of data is useful for instance based classification, especially if there are wrongly labeled instances. Here, the CLOF is not as sensitive as the ordinary or weighted k-NN classifier to capture wrongly labeled instances that are considered as outliers w.r.t. their own class. Test objects in their neighborhood also get a high CLOF so that they are not so likely to adopt the wrong class label.
6.4.7
Biomedical UCI Data Sets
Table 6.11 summarizes the results on biomedical UCI data sets [77], among them data sets which have not been discussed above. There are only minor differences between most of the compared classifiers. The liver data (provided by BUPA Medical Research Ltd., UK) and iris data set are rather balanced. The diabetes data set (provided by the Washington University, St. Louis, MO for the AAAI Spring Symposium on Artificial Intelligence in Medicine, 1994) has categorical and discrete valued attributes. It is not likely to contain a complex data structure with areas of various densities. Nevertheless, the performance of LCF is among the best methods on these three data sets. However, model-based paradigms perform slightly better. As an efficient instance based method, LCF performs in 5 of 6 datasets better than k-NN.
6.5
Conclusion
In this chapter we focused on the problem of classification of objects using the density based notion of clustering and outlier detection. We showed that these concepts can be
158
Chapter 6: Instance Based Classification with Local Density
Table 6.11: Results on UCI data set LRA SVM knn DT Yeast 58.6 59.3 61.9 57.8 E.Coli 86.6 86.0 86.9 82.1 Liver 68.1 58.8 59.1 68.1 Iris 94.0 97.3 96.0 94.0 Diabetes 77.5 73.3 73.2 73.8
data sets. NB ANN 57.6 59.4 85.4 86.0 55.4 71.8 96.0 97.3 76.3 75.3
LCF 60.3 88.1 70.4 97.3 75.1
successfully applied for classification. In particular, we proposed a local density based classification factor combining the aspects of direct density and a class local outlier factor. A broad experimental evaluation demonstrates that our method is applicable on very different biomedical data sets. Our main focus here was on using multimodal unbalanced data sets. We demonstrated that our density based classification method outperformed traditional classifiers especially on data sets containing regions with different object densities, which is of high practical relevance in various biomedical applications.
There are several possible directions for future work.
It would be interesting to
investigate if a local adoption of the parameters k and l would yield to further improvement. Since many biological data sets are very high dimensional, a dimensionality reduction before classification is required. In Chapter 9, a method for feature selection adapted to the special characteristics of proteomic data is presented. Usually, these data sets are of much higher dimensionality than metabolic data, so that only very efficient feature selection strategies can be applied. For metabolic data, however, also techniques derived from of density based clustering and subspace clustering could be useful for selecting relevant attributes and especially combinations of attributes for classification.
6.5 Conclusion
159
Instance-Based Classification with Local Density
95
95
Class 1 Class 2
85
85
80
80
75 70
75 70
65
65
60
60
55
55
50
50 15
20
25
a.
30 attribute A1
35
40
45
15
20
25
c.
(a) Two-dimensional demonstration data set.
30 attribute A1
35
40
45
(c) Result with class local factor outlier only.
95
95
Class 1 Class 2 Wrong
90
Class 1 Class 2 Wrong
90
85
85
80
80 attribute A2
attribute A2
Class 1 Class 2 Wrong
90
attribute A2
attribute A2
90
75 70
75 70
65
65
60
60
55
55
50
50 15
20
25
b.
30 attribute A1
35
40
45
15
(b) Result with direct density only. Definition 3: Indirect density of class ci w. r .t. q
IDq ( ci ) =
∑ p∈NN
Ci k (q
)
20
d.
25
30 attribute A1
35
40
45
(d) Result with LCF for l = 6 and k = 5. Definition 4: Class local outlier factor of an object q
Figure 6.4: Local Classification Factor DDq (-ci Illustration ) CLOF ( q ) =
DD p ( c i )
c NN k i ( q )
ci
IDq ( ci )
The class local outlier factor describes the degree to which an object q is an outlier to the local cluster structure w. r. t. class ci. It is easy to see that for an object q located inside a cluster of objects of class ci the CLOF is approximately 1. If q is an outlier w. r. t. class ci it gets a significantly higher CLOF w. r. t. that class. The set of the indirect class nearest
3
160
Chapter 6: Instance Based Classification with Local Density
Table 6.12: Results on Yeast Data. class corr. incorr. recall prec. 1 324 139 70.0 51.3 2 198 231 46.2 61.7 3 139 105 57 62.1 SVM(lin.) 1 367 96 79.3 47.8 2 147 282 34.3 62.2 3 137 107 56.1 61.4 SVM(pol.) 1 363 100 78.4 49.5 2 162 267 37.8 64.8 3 139 105 57.0 67.1 1-NN 1 261 202 56.4 51.5 2 207 222 48.3 51.8 3 117 127 48 49.4 5-NN 1 296 167 63.9 49.7 2 205 224 47.8 53.9 3 131 113 53.7 59.5 10-NN 1 317 146 68.5 51.9 2 208 221 48.5 57.3 3 140 104 57.4 63.3 21-NN 1 327 136 70.6 52.7.9 2 210 219 49.0 59.0 3 139 105 57.0 65.6 DT 1 294 169 63.5 52.1 2 223 206 52.0 57.8 3 116 128 47.5 64.1 NB 1 324 139 70.0 51.5 2 171 258 39.9 63.3 3 148 96 60.7 62.2 ANN 1 301 162 65.0 54.1 2 230 199 53.6 58.4 3 135 109 55.3 65.2 LCF 1 288 175 62.2 56.4 2 256 173 59.7 57.4 3 139 105 60.0 63.8 classifier LRA
acc. 58.6
57.1
58.9
52.3
56.3
58.7
59.2
57.8
57.6
59.4
60.3
Chapter 7
Discovering Genotype-Phenotype Correlations in Marfan Syndrome
In the previous chapters, novel algorithms for clustering, semi-supervised clustering and classification have been proposed which are building blocks towards improving the knowledge discovery process on biomedical data sets. The proposed algorithms have been predominately evaluated using metabolic data, but are of course also applicable in a general setting. This chapter deals with a special application, the Marfan Syndrome. It illustrates on this example how the integration of genetic and phenotypic data together with the application of suitable data mining methods can lead to a knowledge gain with the potential to improve patient´s management. Discovering novel genotype-phenotype correlations allows to identify patients at high risk to develop severe symptoms, such as ectopia lentis or aortic dissection, at an early stage. The chapter is organized as follows: After a general introduction to the Marfan Syndrome in the next section, Section 7.2 describes in detail the data, the applied data mining methods and introduces a novel score for genotype-phenotype similarity. Section 7.3 summarizes the results and Section 7.4 concludes this chapter.
164
7.1
Chapter 7: Discovering Genotype-Phenotype Correlations in Marfan Syndrome
Introduction
Mutations in the gene encoding fibrillin-1 (FBN1, OMIM #134797) are clinically associated with the Marfan Syndrome (MFS, OMIM #154700). MFS is an autosomal dominant inherited connective tissue disorder with variable clinical manifestations in the cardiovascular, musculoskeletal and ocular system showing a prevalence of approximately 1/5000. At present more than 600 mutations in the gene encoding fibrillin-1 are known and have been observed in at least 80% of cases [94]. The diagnosis of MFS is based on a catalogue of clinical diagnostic criteria which are described in the so-called ”Gent nosology” [95]. Milder and more severe clinical symptoms organized as minor and major criteria, which affect at least two organ systems (major criteria) and the involvement of a third system (minor criteria), are required for classic MFS.
Weakness of the aortic wall accounts for 80% of known causes of death of patients with MFS [96].
Before life threatening complications like dissection or rupture oc-
cur, alterations of aortic elastic properties due to defective FBN1 can be detected by the examination and monitoring of aortic elasticity [97].
The standard treat-
ment for high-risk patients to prevent aortic rupture is a surgery for aortic bracing. Close monitoring is essential to identify the best date for surgery in this group of patients.
Molecular genetic analysis may be helpful to identify at-risk individuals at an early stage of disease. Most of the reported mutations are so-called missense mutations mainly affecting the epitermal growth factor (EGF)-like protein domain structure and the calciumbinding (cb) site [98].
The aim of our study is to investigate the correlation between FBN1 missense mutations and the clinical phenotype of classic or suspected MFS patients by using
7.2 Methods
165
Symbol MFS µsystem σsystem ksystem si
Definition Marfan Syndrome. Mean value of diagnosed symptoms in an organ system. Standard deviation of diagnosed symptoms. Phenotype purity of an organ system within a cluster. Phenotype score w.r.t. phenotype class i.
Table 7.1: Symbols and Acronyms Used in Chapter 7. hierarchical cluster analysis and logistic regression analysis. It is important to include suspected MFS patients, because there is a large number of not reported cases in MFS. A score value is introduced to quantify phenotypic similarity or dissimilarity of patients clinical symptoms and characteristic phenotypes. Table 7.1 summarizes the symbols and acronyms used in this chapter.
7.2
Methods
This section provides a survey on molecular genetic analysis and on the Marfan phenotype. Subsequently, the data used for this study is described in detail. 7.2.1
Molecular Genetic Analysis
In Marfan Syndrome the affected FBN1 gene, which is translated into fibrillin-1, lies on the long arm of chromosome 15 (15q15-q21.1). This very large gene (> 230 kb) is highly fragmented into 65 exons, transcribed in a 10 kb mRNA that encodes a 2,871 amino acid protein. Fibrillin-1 is a large glycoprotein (320 kDa) ubiquitously distributed in connective tissues [99]. In molecular genetic analysis, genomic DNA samples were amplified exon by exon by means of a polymerase chain reaction (PCR) using intron-specific primers. Amplicons were analyzed by denaturing highperformance liquid chromatography (DHPLC) followed by direct sequencing of amplicons with abnormal elution profiles. The
166
Chapter 7: Discovering Genotype-Phenotype Correlations in Marfan Syndrome
mutations found were verified by repeated sequencing on newly amplified PCR products. In the case of splice site mutations and when no mutation was detected by DHPLC, FBN1 transcripts were analyzed by reverse transcription (RT)-PCR of RNA templates isolated from fibroblasts. RT-PCR amplifications and sequencing of putatively mutated transcripts were performed by standard procedures [100].
7.2.2
Marfan Phenotype According to the Gent Criteria
The diagnosis of the MFS phenotype is dependant on a catalogue of international diagnostic criteria as introduced in [95]. Clinical symptoms are organized in major and/or minor criteria of the following organ systems: • Skeletal System. Major criteria: pectus carinatum or pectus excavatum requiring surgery; reduced upper to lower or increased arm-spam to height ratio; positive wrist and thumb signs; scoliosis (> 20◦ ); reduced elbow extension (< 170◦ ); pes planus; protrusio acetabuli. Minor criteria: pectus excavatum of moderate severity; joint hypermobility; highly arched palate with dental crowding; characteristic facial appearance. • Ocular System. Major criterion: ectopia lentis. • Cardiovascular System (CVS). Major criteria: dilatation of the ascending aorta with or without aortic regurgitation and involving at least the sinuses of Valsava; dissection of the ascending aorta. Minor criteria: mitral valve prolapse with or without mitral valve regurgitation; dilatation or dissection of the descending thoracic or abdominal aorta below the age of 50 years. • Skin and Integument. Minor criteria: striae atrophicae (stretch marks) not associated with marked weight changes, pregnancy or repetitive stress; recurrent
7.2 Methods
167
inguinal or incisional herniae.
Data of the pulmonary system (minor criteria), the ocular system (minor criteria), the dura (major criterion) and two minor criteria of the CVS system was partly not available so this additional clinical information was not considered for data analysis.
7.2.3
Patients Data
Currently our database consists of 100 patients entries (age 18.7 +− 11.9 years) with 88 different mutations including anonymous data from three MFS clinical centers worldwide [97, 98, 101]. More specifically, mutation data is represented by the nucleotide change (e.g. 3973G>C for substitution, missense mutation), the position of the affected exon/intron on the gene (e.g. exon no. 32) and the type and consequence of mutation. Our investigated genetic data contains the following mutations: • Sub/Mis = substitution/missense mutation (n=55) • Sub/Stop = substitution/nonsense mutation (n=9) • Sub/Splice = substitution/splice site mutation (n=13) • Del/Fs = deletion/frameshift (n=18) • Dup/Fs = duplication/frameshift (n=1) • Ins/Fs = insertion/frameshift (n=1) • Del/inF = deletion/in frame (n=2) • and Del/Splice = deletion/ splice site mutation (n=1).
168
Chapter 7: Discovering Genotype-Phenotype Correlations in Marfan Syndrome
Phenotype data is available as the accumulated number of symptoms of each organ system separated into major and/or minor criteria. The following example demonstrates one tuple of our anonymous data set, integrating genetic and phenotypic data: Nucleotide change:= 6794G>A, type of mutation:= Sub/Mis, number of exon:= 55; skeletal (major):= 4, skeletal (minor):= 3, ocular (major):= 1; CVS (major):=1, CVS (minor):= 1, skin (minor):= 1.
7.2.4
Data Mining Methods
Data mining techniques like hierarchical cluster analysis and probabilistic models were applied to mine novel correlations between mutation data and the diseases clinical phenotype.
Hierarchical Cluster Analysis Hierarchical cluster analysis was performed to subdivide MFS phenotypes into meaningful subgroups where each of them was showing a characteristic phenotypic pattern. Some clustering algorithms, such as K-means, require users to specify the number of clusters as an input, but users rarely know the right number beforehand [37, 102], see also Chapter 3. Hierarchical clustering algorithms, which do not need a predetermined number of clusters as input, enable the users to determine the natural grouping with interactive visual feedback (dendrogram and color mosaic). To determine a proper number of clusters, the minimum similarity threshold (between 0 and 1) needs to be changed. When the hierarchical clustering algorithm merges two clusters to generate a new bigger cluster, the distances between the new cluster and remaining clusters need to be determined. We used the average linkage approach. Let Cn be a new cluster, a merge of Ci and Cj . Let Ck be a remaining cluster and dist a metric distance function. This definition is recursively applied to the clusters of all levels of the hierarchy starting with the singleton clusters.
7.2 Methods
169
(See Chapter 3, Section 3.3 for more information on hierarchical clustering.) dist(Cn , Ck ) =
|Ci | |Cj | · dist(Ci , Ck ) + · dist(Cj , Ck ) |Ci | + |Cj | |Ci | + |Cj |
A column-by-column normalization by rescaling from 0.0 to 1.0 was performed. The Euclidean distance was chosen as the distance/similarity measure. Cluster analysis was applied to accumulated symptoms of (1) skeletal major, (2) skeletal minor, (3) CVS major, (4) CVS minor criteria, (5) ocular major criterion and (6) skin minor criteria. Phenotype Score Based on the clustered phenotype classes we introduce a quantitative measure which describes the similarity between a patient´s phenotype and a characteristic phenotype class by a score value. To characterize the symptomatic similarity within a phenotype class the following definitions are required: a) µsystem : mean value of diagnosed symptoms in an organ system (major and minor criteria separated), b) σsystem : standard deviation of symptoms in an organ system (major and minor criteria separated), c) ksystem =
1 : σsystem
factor quantifying phenotypic purity of an organ system within a
clustered phenotype class. k is thus defined for the interval (0, 1]. Definition 21 (Phenotype Score) The phenotype score si calculating the similarity between a query tuple (a patients accumulated symptoms for each organ system) and a phenotype class i is given as: si = c ·
X system∈S
|tsystem − µsystem | · ksystem
170
Chapter 7: Discovering Genotype-Phenotype Correlations in Marfan Syndrome
where i ∈ {1..m}, m represents the number of phenotype classes. t is the accumulated number of diagnosed symptoms in an organ system (major and minor criteria separated) of the query tuple, S is the organ system collection, system is a single organ system and c is a scaling factor (we set c = 10). The Factor ksystem weights each attribute according to its phenotypic purity within a phenotype class. A query tuple is thus classified to that phenotype class whose score value si (distance) is minimal: min(si ) ⇒ classified phenotype class, for i ∈ {1..m} Probabilistic Model For genotype-phenotype correlation we propose logistic regression analysis (LRA, see also Chapter 3) as a predictor for the presence of a specific FBN1 gene mutation dependant on the selected clinical symptoms. The conditional probability p for class membership to one of two classes (e.g. Sub/Mis vs. no Sub/Mis mutations or exon vs. intron mutations) is denoted by the equation p = 1/(1 + exp(−z)) for P (Y = 1) or p = 1 − 1/(1 + exp(−z)) for P (Y = 0) respectively. The equation z = b1 x1 + b2 x2 + ... + bn xn + c indicates the logit of the model. Parameter selection: We used the forward selection approach to efficiently search through the space of parameter subsets to identify the optimal ones with respect to a performance measure [40]. For the model-building process we used stratified 10-fold-cross validation on our (training) datasets which is preferred when datasets are small [4].
7.3
Results
Accumulated clinical symptoms of four affected organ systems (skeletal major and minor, CVS major and minor, ocular major and skin minor system) were clustered using
7.3 Results
171
Table 7.2: Phenotypic Purity within a Class. Skeleton Skeleton Ocular CVS CVS Skin Maj.(0-7) Min.(0-4) Maj.(0,1) Maj.(0-2) Min. (0-2) Min.(0-2) Phenotype I µ σ k Phenotype II µ σ k Phenotype III µ σ k Phenotype VI µ σ k
2.74 1.29 0.28
2.04 1.09 0.33
1.00 0.00 1.00
1.00 0.00 1.00
0.72 0.58 0.56
0.59 0.62 0.54
2.42 1.39 0.25
2.06 1.36 0.26
0.00 0.00 1.00
1.00 0.00 1.00
0.68 0.70 0.50
0.71 0.59 0.56
1.69 1.35 0.26
1.56 1.21 0.30
1.00 0.00 1.00
0.00 0.00 1.00
0.31 0.48 0.62
0.69 0.60 0.55
1.71 1.11 0.33
2.29 1.11 0.33
0.00 0.00 1.00
0.00 0.00 1.00
0.43 0.53 0.59
0.29 0.49 0.61
hierarchical cluster analysis. Figure 7.1 depicts the dendrogram of the data set consisting of 100 patients with classical or suspected MFS. Four phenotype classes (I, II, III, and IV) were identified at a minimum similarity threshold of 0.5. The type and consequence of a patients FBN1 mutation is denoted left of the color mosaik. Statistical analysis (µ, σ) of each detected phenotype class is shown in Table 7.2.
Phenotypic purity
within a clustered phenotype is given by the introduced factor k (maximum purity: k = 1, maximum impurity: k → 0). Maximum dissimilarity between the four clustered phenotype classes however is primarily caused by the alternating presence of the ocular major criterion (ectopia lentis) and the CVS major criterion (aortic root dilatation) in the different clusters. Moreover, both dichotomous attributes showed maximum purity (k = 1) within each phenotype class. In detail, types I and III are characterized by the presence of an ectopia lentis, type I and II by an aortic root dilatation, while type IV
172
Chapter 7: Discovering Genotype-Phenotype Correlations in Marfan Syndrome
manifests neither an ectopia lentis nor an aortic root dilatation. However, the coincidence of both major symptoms ectopia lentis and aortic root dilatation in type I corresponds well with a more severe clinical picture, the absence of both symptoms with a milder clinical picture of the MFS phenotype. In contrast, skeletal major and minor criteria, CVS minor and skin minor criteria yielded a lower purity in all four clustered phenotypes classes expressed by a k factor < 1. Individuals with the mildest clinical manifestations of skeletal and CVS symptoms are represented in phenotype class III while phenotype class I indicates the most severe manifestations in the same organ systems. Type IV, however, represents the mildest MFS phenotype without ocular manifestation, with marginal CVS and skin, and moderate skeletal symptoms.
LRA was performed to discriminate FBN1 missense mutations from all other types of mutations by selecting the most relevant organ systems from the entire attribute space. The model was trained and cross-validated on 55 cases with a Sub/Mis mutation (class 0) and 45 cases representing the pool of no Sub/Mis mutations (class 1). The model revealed a sensitivity of 57.8%, a specificity of 78.2% and an accuracy of 69% (see more Figure 7.2). The probability for the presence of a Sub/Mis mutation - independent on the position in the gene - is thus given by the following equation: P (Sub/Mis = 1) = 1 − 1 [1+exp(0.113 + 1.708 · ocular (major)− 0.275 · skeletal (major) − 0.471 · skin(minor)]. Ocular major, skeletal major and skin minor criteria are the selected attributes when using forward selection, which can be interpreted as the predominant clinical phenotype of MFS patients with a FBN1 missense mutation.
Correlating the presence of Sub/Mis mutations with the clustered phenotype classes I-IV, highest correlation was found with phenotype classes I and III (cf. Table 7.3). In
7.3 Results
173
Table 7.3: Correlation between Sub/Mis Mutations and phenotype classes I-IV. Phenotype I Phenotype II Phenotype III Phenotype IV N 46 31 16 7 P(Sub/Mis=1) 0.68 0.30 0.72 0.38 Frequency Sub/Mis(%) 71.1 22.6 62.5 71.4
Table 7.4: MFS Phenotype Within Members of a Family. Age Detected Type of P(Y=1) Phenotype Scores for Mutation Mutation Y = Sub/Mis class Phenotype 8 507delT;Y170frX20 Del/Fs 0.33 II 21 11 31 23 9 507delT;Y170fsX20 Del/Fs 0.24 II 23 11 34 27 10 507delT;Y170fsX20 Del/Fs 0.24 II 18 7 31 26 16 507delT;Y170frX20 Del/Fs 0.24 II 17 7 28 22 16 507delT;Y170frX20 Del/Fs 0.29 II 25 12 35 29 41 507delT;Y170frX20 Del/Fs 0.29 II 22 9 26 22
Table 7.3, N is the number of cases representing phenotype classes I - IV, P(Sub/Mis=1) is the mean probability for the presence of a Sub/Mis mutation in phenotype classes I IV. The observed frequency of Sub/Mis mutations within each phenotype class is shown in the last line. Both phenotypes yielded a mean probability of
0.7 for the presence
of a Sub/Mis mutation which corresponds well with the observed frequency of Sub/Mis cases (71.7% for type I and 62.5% for type III) within these classes. Only phenotype class IV showed a discrepancy between the frequency of missense mutations and the P(Sub/Mis) value primarily caused by the small number of clustered cases (n=7). On the other hand phenotype class II, which constituted more severe manifestations of all investigated systems, but without the presence of an ectopia lentis, contains a fraction of 45% of Del/Fs mutations and represents 78% of all investigated Del/Fs mutations. No correlations between the position and the nature of a FBN1 mutation, and the severity of the phenotype was found when comparing e.g. intron with exon mutations, or mutations on exons 1-23 with those on exons 33-68.
174
Chapter 7: Discovering Genotype-Phenotype Correlations in Marfan Syndrome
Table 7.4 demonstrates an example of an MFS affected family (father with his three children and two nieces) with a detected Del/Fs mutation. The minimum score value, which was calculated for all phenotype classes I-IV, related all family members to phenotype class II. The probability for a Sub/Mis mutation within the family was lower than 34%, which corresponds well to the probability that a Sub/Mis mutation is present in phenotype class II (cf. Table 7.3). All family members belong to phenotype class II showing a very similar intra-familial phenotypic expression. However, our findings may not be generalized because variable phenotypic expressions due to the age-related or pleiotropic nature of some symptoms within affected families can be observed. In order to describe the phenotypic similarity within families and individuals with the same detected mutation, and between one of the clustered phenotype classes we introduced a novel phenotype score. This score value enables us to quantify a patients phenotypic similarity to a characteristic phenotype pattern by minimizing the distance to that phenotype class. We could demonstrate that our approach is practical for phenotype classification on the level of accumulated criteria. However, an extension of this approach on the level of each single symptom (approximately 30 single symptoms according to the Gent nosology) may have potential for more detailed phenotype classification. Nevertheless more data are essential to generate representative score values, in particular for phenotype class IV.
7.4
Conclusion
In this chapter, genetic and phenotypical data on the Marfan Syndrome has been integrated and analyzed for mining genotype-phenotype correlations. In particular, it has been shown that there is a strong correlation between FBN1 missense mutations and distinct phenotypic groups of symptoms. The knowledge on genotype-phenotype correlations may be used for an improved disease management. Based on an analysis of a
7.4 Conclusion
175
patients genotype at the time of diagnosis, or even within a screening test, it is possible to predict which symptoms have to be expected. This information may be a factor in the decision which kind of treatment regime is applied.
Data integration and the systematic application of data mining methods will become more and more important in the future. The emerging research area of systems biology requires that various data sources from genomics, proteomics, metabolomics and phenotypic data need to be integrated and analyzed. For data mining this implies many novel challenges. To provide guidelines for the knowledge discovery process considering application specific aspects will become more and more important. Given a goal for the knowledge discovery process, the questions which steps to combine to achieve this goal in the best possible way and which algorithms to apply are research topics of major importance.
In addition, novel algorithms are needed to meet special requirements
emerging from data. The following two chapters deal with feature selection techniques which are essential for knowledge discovery high dimensional data. On many biomedical data sets, such as proteomic spectra, data mining algorithms can not be applied on the full dimensional space due to performance limitations.
176
Chapter 7: Discovering Genotype-Phenotype Correlations in Marfan Syndrome
Figure 7.1: Hierarchical Clustering.
7.4 Conclusion
177
Figure 7.2: ROC curves of the LRA model built on three selected clinical criteria (ocular major, skeletal major and skin minor criteria). The dependant variable is the presence/absence of a Sub/Mis mutation in the fibrillin-1 gene.
Chapter 8
Subspace Clustering
As Chapter 4 this chapter deals with unsupervised clustering which is besides classification the most important data mining task. Clustering is intended to help a user to discover and to understand the natural structure or grouping in a data set. Chapter 4 gives one possible solution to the problem how to find such a grouping. This chapter focusses on the question where to find a distinct grouping in the data set. In many high dimensional data sets, the points tend to be scattered throughout the data space such that no explicit clusters can be found in the full dimensional space. However, there may be clusters in subspaces spanned by subsets of the attributes. This chapter introduces SURFING (Subspaces Relevant for Clustering), an algorithm which identifies the most relevant subspaces without requiring sensitive input parameters. The next section gives an introduction to the subspace clustering problem. Section 8.2 surveys related work. In Section 8.3 the proposed algorithm is illustrated in detail. After an experimental evaluation in Section 8.4 using, among others, gene expression and metabolic data sets, the chapter ends with some concluding remarks in Section 8.5.
C
5
1 34 2
3 5 2 1 6
4 6
180
B
a1
Chapter 8: Subspace Clustering
D
a1
100.0
100.0
90.0
90.0
80.0
A
80.0
70.0
C1
60.0
70.0
3 1 4 2
60.0
50.0
50.0
40.0 30.0
a2
20 25 30 35 40 50 55 60 65
C2
3
6
30.0
a1
B
100.0 90.0
80.0 Figure 8.1: Subspace Clustering - Motivation.
80.0
70.0
C1
60.0
Introduction
D
(b) Different Clusterings in Different Suba1 spaces. 100.0 90.0
8.1
5 2 1 6
4
40.0
(a) Cluster Structure Only in Subspace.
C
5
60.0
50.0
50.0
40.0 30.0 20 25 30 35 40 50 55 60 65
70.0
a2
C2
40.0 30.0
A lot of work has been done in the area of clustering (see Chapter 3 for an introduction and some fundamental algorithms and e.g. [5] for an overview and see also Chapter 4 for a novel clustering method). However, many biomedical data sets consist of very high dimensional feature vectors. As mentioned, in such high dimensional feature spaces, most of the common algorithms tend to break down in terms of efficiency and accuracy because usually many features are irrelevant and or correlated. However, often clusters can be identified in subsets of the features.
This effect can even be illustrated using a 2-d example as depicted in Figure 8.1(a). In this example, there are two distinct clusters if the data is projected to the subspace consisting only of attribute a1 . In the 2-d space and in the other 1-d subspace representing the attribute a2 , no cluster structure is evident. In addition, different subgroups of features may be irrelevant or correlated considering varying subgroups of the data objects. Thus, objects can often be clustered differently in varying subspaces. This is illustrated in Figure 8.1(b) Usually, global dimensionality reduction techniques such as PCA cannot be applied to these data sets because they cannot account for local trends in the data.
8.1 Introduction
181
Symbol SURFING A ai S πS (~o) N NkS (~o) nn-distSk (~o)
Definition Subspaces relevant for Clustering. Set of attributes of the data set DS, A = {a1 , . . . , ad }. An attribute. A subspace, S ⊂ A. The projection of object ~o into the subspace S. The k nearest neighbors of ~o in the subspace S. The distance of ~o to its k-th nearest neighbor in S.
Table 8.1: Symbols and Acronyms Used in Chapter 8.
To cope with these problems, the procedure of feature selection has to be combined with the clustering process more closely. In recent years, the task of subspace clustering was introduced to address these demands. In general, subspace clustering is the task of automatically detecting all clusters in all subspaces of the original feature space, either by directly computing the subspace clusters (e.g. in [57]) or by selecting the most interesting subspaces for clustering (e.g. in [103]).
We propose an advanced feature selection method preserving the information of objects clustered differently in varying subspaces.
Our method called SURFING
(SU bspaces Relevant F or clusterING) computes all relevant subspaces and ranks them according to the interestingness of the hierarchical clustering structure they exhibit.
Table 8.1 summarizes some notations which are frequently used in this chapter (see also Table 1.1 for general notations). In particular, let A = {a1 , . . . , ad } be the set of all attributes ai of DS. Any subset S ⊂ A, is called a subspace. T is a superspace of S if S ⊂ T . The projection of an object ~o onto a subspace S ⊆ A is denoted by πS (~o).
182
8.2
Chapter 8: Subspace Clustering
Survey
This section provides a brief summary on related approaches in the areas of subspace clustering and feature selection for clustering. Both areas are closely related and to be precise, SURFING is a method for feature selection. The section ends by pointing out the major advantages of SURFING.
8.2.1
Subspace Clustering
The pioneering approach to subspace clustering is CLIQUE [57], using an Apriori-like method to navigate through the set of possible subspaces. The data space is partitioned by an axis-parallel grid into equi-sized units of width ξ. Only units whose densities exceed a threshold τ are retained. A cluster is defined as a maximal set of connected dense units. Successive modifications of CLIQUE include ENCLUS [104] and MAFIA [105]. A severe drawback of all these methods is the use of grids. In general, the efficiency and the accuracy of these approaches heavily depend on the positioning of the grids. Objects that naturally belong to a cluster may be missed or objects that are naturally noise may be assigned to a cluster due to an unfavorable grid position.
Another recent approach called DOC [106] proposes a mathematical formulation for the notion of an optimal projected cluster, regarding the density of points in subspaces. DOC is not grid-based but as the density of subspaces is measured using hypercubes of fixed width w, it has similar problems. If a cluster is bigger than the hypercube, some objects may be missed.
Furthermore, the distribution inside the
hypercube is not considered, and thus the hypercube need not necessarily contain only objects of one cluster.
8.2 Survey
183
In [107] a method called PROCLUS to compute (axis parallel) projected clusters is presented.
However, PROCLUS assigns each data object to a unique projected
cluster and thus misses out the information of objects clustered differently in varying subspaces. The same holds for the modification ORCLUS [61] that finds arbitrarily oriented projected clusters.
8.2.2
Feature Selection for Clustering
In [103] a method called RIS is proposed that ranks the subspaces according to their clustering structure. The ranking is based on a quality criterion using the density based clustering notion of DBSCAN [8]. An Apriori -like navigation through the set of possible subspaces in a bottom-up way is performed to find all interesting subspaces. Aggregated information is accumulated for each subspace to rank its interestingness.
In [108] a quality criterion for subspaces based on the entropy of point-to-point distances is introduced.
However, there is no algorithm presented to compute the
interesting subspaces. The authors propose to use a forward search strategy which most likely will miss interesting subspaces, or an exhaustive search strategy which is obviously not efficient in higher dimensional spaces.
8.2.3
Contributions
Recent density-based approaches to subspace clustering (CLIQUE) or comparable subspace selection methods (RIS) use a global density threshold for the definition of clusters due to efficiency reasons. However, the application of one global density threshold to subspaces of different dimensionality as well as to all clusters in one subspace is rather unacceptable.
The data space naturally increases exponentially with each
dimension added to a subspace and clusters in the same subspace may exceed different
184
Chapter 8: Subspace Clustering
density parameters or exhibit a nested hierarchical clustering structure. Therefore, for subspace clustering, it would be highly desirable to adapt the density threshold to the dimensionality of the subspaces or even better to rely on a hierarchical clustering notion that is independent from a globally fixed threshold.
In this chapter, we introduce SURFING, a feature selection method for clustering which does not rely on a global density parameter. Our approach explores all subspaces exhibiting an interesting hierarchical clustering structure and ranks them according to a quality criterion. SURFING is more or less parameterless, i.e. it does not require the user to specify parameters that are hard to anticipate such as the number of clusters, the (average) dimensionality of subspace clusters, or a global density threshold. Thus, our algorithm addresses the unsupervised notion of the data mining task “clustering” in a best possible way.
8.3
Proposed Method
This section elaborates a quality criterion which allows to rank subspaces of arbitrary dimensionality according their interestingness for clustering. Before formally defining the quality of a subspace S, we give an intuition on the interestingness of S for clustering. Subsequently an algorithm for efficiently finding the interesting subspaces is presented.
8.3.1
General Idea
The main idea of SURFING is to measure the “interestingness” of a subspace w.r.t. to its hierarchical clustering structure, independent from its dimensionality. Like most previous approaches to subspace clustering, we base our measurement on a density-based clustering notion. Since we do not want to rely on a global density parameter, we developed a quality
8.3 Proposed Method
185
3-nn distance
a2
3-nn distance
a2
mean
mean a1
a1
objects
objects
(a) Hierarchical clustering structure in a 2-d sub- (b) Uniform distribution in a 2-d subspace (left); space(left); according sorted 3-nn graph (right) according sorted 3-nn graph (right)
Figure 8.2: Usefulness of the k-nn distance to rate the interestingness of subspaces. criterion for relevant subspaces built on the k-nearest neighbor distances (k-nn-distances) of the objects in DS. Definition 22 (k-nn-distance in a subspace) Let k ∈
N (k ≤ n) and S ⊆ A. For an
object ~o ∈ DS, the set of k-nearest neighbors of ~o in a subspace S, denoted by N NkS (~o), is the smallest set that contains (at least) k objects from the data set and for which the following condition holds: ∀~p ∈ N NkS (~o), ~q ∈ DS − N NkS (~o) : dist(πS (~o), πS (~p)) < dist(πS (~o), πS (~q)). The k-nn-distance of an object ~o ∈ DS in a subspace S, denoted by nn-distSk (~o), is the distance between ~o and its k-nearest neighbor, formally: nn-distSk (~o) = max{dist(πS (~o), πS (~p)) | p~ ∈ N NkS (~o)}. The k-nn-distance of an object ~o indicates how densely the data space is populated around ~o in S. The smaller the value of nn-distSk (~o), the more dense the objects are packed around ~o, and vice versa. If a subspace contains a recognizable hierarchical clustering structure, i.e. clusters with different densities and noise objects, the k-nn-distances of objects should
186
Chapter 8: Subspace Clustering
differ significantly. On the other hand, if all points are uniformly distributed, the k-nndistances can be assumed to be almost equal. Figure 8.2 illustrates these considerations using a sample 2-d subspace S = {a1 , a2 } and k = 3. In Figure 8.2(a) the data exhibits a complex hierarchical clustering structure in S. The corresponding 3-nn-distances (sorted in ascending order) differ significantly among the objects. In Figure 8.2(b) the data is uniformly distributed in S. The corresponding 3-nn-distances are equal for all objects. Consequently, we are interested in subspaces where the k-nn-distances of the objects differ significantly from each other, because the hierarchical clustering structure in such subspaces will be considerably clearer than in subspaces where the k-nn-distances are rather similar to each other. 8.3.2
A Quality Criterion for Subspaces
As mentioned above we want to measure how much the k-nn-distances in S differ from each other. To achieve comparability between subspaces of different dimensionality, we scale all k-nn-distances in a subspace S into the range [0, 1]. Thus, we assume that nn-distSk (~o) ∈ [0, 1] for all ~o ∈ DS throughout the rest of the chapter. Two well-known statistical measures for our purpose are the mean value µ~S of all k-nn-distances in subspace S, i.e. P µ~S :=
~ o∈DS
nn-distSk (~o) n
and the variance. However, the variance is not appropriate for our purpose because it measures the squared differences of each k-nn-distance to ~µS and thus, high differences are weighted stronger than low differences. For our quality criterion we want to measure the non-weighted differences of each k-nn-distance to ~µS . Since the sum of the differences of all objects above ~µS is equal to the sum of the differences of all objects below ~µS , we only take half of the sum of all differences to the mean value, denoted by dif fµ~ S , which can be computed by
8.3 Proposed Method
187
dif fµ~ S =
1 X (~µS − nn-distSk (~o)). 2 ~ o∈DS
In fact, dif fµ~ S is already a good measure for rating the interestingness of a subspace. We can further scale this value by ~µS times the number of objects having a smaller k-nn-distance in S than ~µS , i.e. the objects contained in the following set: BelowS := {~o ∈ DS | nn-distSk (~o) ≤ ~µS }. Obviously, if BelowS is empty, the subspace contains uniformly distributed noise. Definition 23 (quality of a subspace) Let S ⊆ A. The quality of S, denoted by quality(S), is defined as follows: 0 quality(S) =
dif fµ ~S |BelowS |·~ µS
if BelowS = ∅ else.
The quality values are in the range between 0 and 1. A subspace where all objects are uniformly distributed (e.g. as depicted in Figure 8.2(b)) has a quality value of approximately 0, indicating a less interesting clustering structure. On the other hand, the clearer the hierarchical clustering structure in a subspace S is, the higher is the value of quality(S). For example, the sample 2-d subspace in which the data is highly structured as depicted in Figure 8.2(a) will have a significantly higher quality value. Let us note that in the synthetic case where all objects in BelowS have a k-nn-distance of 0 and all other objects have a k-nn-distance of 2 · ~µS , the quality value quality(S) is 1.
In almost all cases, we can detect the relevant subspaces with this quality criterion, but there are two artificial cases rarely found in natural data sets which nevertheless
188
Chapter 8: Subspace Clustering
cannot be ignored. First, there might be a subspace containing some clusters, each of the same density, and without noise (e.g. data set A in Figure 8.3). If the number of data objects in the clusters exceeds k, such subspaces cannot be distinguished from subspaces containing uniformly distributed data objects spread over the whole attribute range (e.g. data set B in Figure 8.3) because in both cases, the k-nn-distances of the objects will marginally differ from the mean value. Second, subspaces containing data of one Gaussian distribution spread over the whole attribute range are not really interesting. However, the k-nn-distances of the objects will scatter significantly around the mean value. Thus, such subspaces cannot be distinguished from subspaces containing two or more Gaussian clusters without noise.
To overcome these two artificial cases, we can virtually insert some randomly generated points before computing the quality value of a subspace. In cases of uniform or Gaussian distribution over the whole attribute range, the insertion of a few randomly generated additional objects does not significantly affect the quality value. The k-nn-distances of these objects are similar to the k-nn-distances of all the other data objects. However, if there are dense and empty areas in a subspace, the insertion of some additional points very likely increases the quality value, because these additional objects have large k-nn-distances compared to those of the other objects. The table in Figure 8.3 shows the quality value of the 2-d data set A depicted in Figure 8.3 w.r.t. the percentage of virtually inserted random objects. Data set B in Figure 8.3 has no visible cluster structure and therefore the virtually inserted points do not affect the quality value. For example, 0.2 % additionally inserted points means that for n = 5, 000 10 random objects have been virtually inserted before calculating the quality value.
8.3 Proposed Method
189
data set A
% of additional points inserted 0 0.1 0.2 0.5 1 5 10
data set B
quality of data set A B 0.13 0.15 0.15 0.15 0.19 0.15 0.31 0.15 0.38 0.15 0.57 0.15 0.57 0.15
Figure 8.3: Benefit of Inserted Points.
Thus, inserting randomly generated points is a proper strategy to distinguish (good) subspaces containing several uniformly distributed clusters of equal density or several Gaussian clusters without noise from (bad) subspaces containing only one uniform or Gaussian distribution. In fact, it empirically turned out that 1% of additional points is sufficient to achieve the desired results. Let us note that this strategy is only required, if the subspaces contain a clear clustering structure without noise. In most real-world data sets the subspaces do not show a clear cluster structure and often have much more than 10% noise. In addition, the number of noise objects is usually growing with increasing dimensionality. In such data sets, virtually inserting additional points is not required. Since our quality criterion is very sensible to areas of different density, it is suitable to detect relevant subspaces in data sets with high percentages of noise, e.g. in gene expression data sets or in synthetic data sets containing up to 90% noise.
190
Chapter 8: Subspace Clustering
8.3.3
Algorithm
The pseudocode of the algorithm SURFING is given in Figure 8.4. Since lower dimensional subspaces are more likely to contain an interesting clustering, SURFING generates all relevant subspaces in a bottom-up way, i.e. it starts with all 1-dimensional subspaces S1 and discards as many irrelevant subspaces as early as possible. Therefore, we need a criterion to decide whether it is interesting to generate and examine a certain subspace or not. Our above described quality measure can only be used to decide about the interestingness of an already given subspace. An important information we have gathered while proceeding to dimension l is the quality of all (l − 1)-dimensional subspaces. We can use this information to compute a quality threshold which enables us to rate all l-dimensional candidate subspaces Sl . We use the lowest quality value of any (l − 1)-dimensional subspace as threshold. If the quality values of the (l − 1)-dimensional subspaces do not differ enough (it empirically turned out that a difference of at least 1/3 is a reasonable reference difference), we take half of the best quality value instead. Using this quality threshold, we can divide all l-dimensional subspaces into three different categories: • Interesting subspace: the quality value increases or stays the same w.r.t. its (l − 1)-dimensional subspaces. • Neutral subspaces: the quality decreases w.r.t. its (l − 1)-dimensional subspaces, but lies above the threshold and thus might indicate a higher dimensional interesting subspace. • Irrelevant subspaces: the quality decreases w.r.t its (l − 1)-dimensional subspace and lies below the threshold. We use this classification to discard all irrelevant l-dimensional subspaces from further consideration. We know that these subspaces are not interesting themselves and, as
8.3 Proposed Method
191
our quality value is comparable over different dimensions, we further know that no superspace of such a subspace will obtain a high quality value compared to interesting subspaces of dimensionality l. Even if through adding a “good” dimension, the quality value would slightly increase it will not be getting better than already existing ones.
However, before we discard an irrelevant subspace S of dimensionality l, we have to test whether its clustering structure exhibits one of the artificial cases mentioned in the previous section. For that purpose, if the quality of S is lower than the quality of a subspace containing an l-dimensional Gaussian distribution, we insert 1% random points and recompute the quality of S. Otherwise, the clustering structure of S cannot get better through the insertion of additional points. In case of a clean cluster structure without noise in S, the quality value improves significantly after the insertion. At least it will be better than the quality of the l-dimensional Gaussian distribution, and, in this case, S is not discarded.
If, due to the threshold, there are only irrelevant l-dimensional subspaces, we don’t use the threshold, but keep all l-dimensional subspaces. In this case, the information we have so far, is not enough to decide about the interestingness.
Finally, the remaining l-dimensional subspaces in Sl are joined if they share any (l − 1)-dimensions to generate the set of (l + 1)-dimensional candidate subspaces Sl+1 . SURFING terminates if the resulting candidate set is empty.
SURFING needs only one input parameter k, the choice of which is rather simple. If k is too small, the k-nn-distances are not meaningful, since objects within dense regions might have similar k-nn-distance values as objects in sparse regions. If k is too
192
Chapter 8: Subspace Clustering
high, the same phenomenon may occur. Obviously, k must somehow correspond to the minimum cluster size, i.e. the minimal number of objects regarded as a cluster. In fact, for parameter selection the same guidelines apply as discussed in detail in Chapter 6, Section 6.3.2. The only difference is that no class information is given for the objects now.
8.4
Experiments
We tested SURFING on several synthetic and biomedical data sets and evaluated its accuracy in comparison to CLIQUE, RIS and the subspace selection proposed in [108] (in the following called Entropy). All experiments were run on a PC with a 2.79 GHz CPU and 504 MB RAM. We combined SURFING, RIS and Entropy with the hierarchical clustering algorithm OPTICS ([39], see also Chapter 3, Section 3.3.2) to compute the hierarchical clustering structure in the detected subspaces. A short description of the data sets: • Synthetic Data. The synthetic data sets were generated by a self-implemented data generator. It permits to specify the number and dimensionality of subspace clusters, dimensionality of the feature space and density parameters for the whole data set as well as for each cluster. In a subspace that contains a cluster, the average density within that cluster is much larger than the density of noise. In addition, it is ensured that none of the synthetically generated data sets can be clustered in the full dimensional space. • Gene Expression Data. We tested SURFING on a gene expression data set studying the yeast mitotic cell cycle [109]. We used only the data set of the CDC15 mutant and eliminated those genes from our test data set having missing attribute values. The resulting data set contains around 4000 genes measured at 24 different
8.4 Experiments
193
time slots. The task is to find functionally related genes using cluster analysis. • Metabolic Data. In addition we tested SURFING on high-dimensional metabolic data provided from the newborn screening program in Bavaria, Germany. Our experimental data sets were generated from modern tandem mass spectrometry. In particular we focused on a dimensionality of 14 metabolites in order to mine single and promising combinations of key markers in the abnormal metabolism of phenylketonuria (PKU), a severe amino acid disorder. The resulting database contains 319 cases designated as PKU and 1322 control individuals expressed as 14 amino acids and intermediate metabolic products., i.e. Ala, Arg, ArgSuc, Cit, Glu, Gly, Met, Orn, Phe, Pyrglt, Ser, Tyr, Val and Xle, respectively. The task is to extract a subset of metabolites that correspond well to the abnormal metabolism of PKU. See Chapter 2 for more background information on the gene expression (Section 2.2) and the metabolic data set (Section 2.4).
8.4.1
Efficiency
The runtimes of SURFING applied to the synthetic data sets are summarized in Table 8.2. In all experiments, we set k = 10. For each subspace, SURFING needs O(n2 ) time to compute for each of the n points in DS the k-nn-distance, since there is no index structure which could support the partial k-nn-queries in arbitrary subspaces in logarithmic time. If SURFING analyzes m different subspaces the overall runtime complexity is O(m · n2 ). Of course in the worst case m can be 2d , but in practice we are only examining a very small percentage of all possible subspaces.
Indeed, our experiments show, that the heuristic generation of subspace candidates used by SURFING ensures a small value for m (cf. Table 8.2). For most complex
194
Chapter 8: Subspace Clustering
Table 8.2: Results on Synthetic Data Sets. data set 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16
d 10 10 10 15 15 15 15 15 15 10 20 30 40 50 15
cluster dim. 4 4 4 2 5 3,5,7 5 7 12 5 5 5 5 5 4,6,7,10
n 4936 18999 27704 4045 3802 4325 4057 3967 3907 3700 3700 3700 3700 3700 2671
# subspaces m % 107 10.45 52 5.08 52 5.08 119 0.36 391 1.19 285 0.87 197 0.60 1046 3.19 4124 12.59 231 22.56 572 0.05 1077 0.0001 1682 1.5·10−7 2387 2.1·10−10 912 2.8
time (s) 351 2069 4401 194 807 715 391 3031 15321 442 1130 2049 3145 4255 4479
data sets, SURFING computes less than 5% of the total number of possible subspaces. In most cases, this ratio is even significantly less than 1%. For data set 10 in Table 8.2 where the cluster is hidden in a 12-dimensional subspace of a 15-dimensional feature space, SURFING only computes 12.5% of the possible subspaces. Finally, for both the biomedical data sets, SURFING computes even significantly less than 0.1% of the possible subspaces (not shown in Table 8.2). The worst ever observed percentage was around 20%. This empirically demonstrates that SURFING is a highly efficient solution for the complex subspace selection problem. 8.4.2
Effectiveness
Results on Synthetic Data We applied SURFING to several synthetic data sets (cf. Table 8.2). In all but one case, SURFING detected the correct subspaces containing the relevant clusters and ranked
8.4 Experiments
195
them first. Even for data set 16, SURFING was able to detect 4 out of 5 subspaces containing clusters, although the clustering structure of the subspaces containing clusters was rather weak, e.g. one of the 4-dimensional subspaces contained a cluster with only 20 objects having an average k-nn-distance of 2.5 (the average k-nn-distance for all objects in all dimensions was 15.0). SURFING only missed a 10-dimensional subspace which contained a cluster with 17 objects having an average k-nn-distance of 9.0.
Results on Gene Expression Data We tested SURFING on the gene expression data set and retrieved a hierarchical clustering by applying OPTICS [39] to the top-ranked subspaces. We found many biologically interesting and significant clusters in several subspaces. The functional relationships of the genes in the resulting clusters were validated using the public Saccharomyces Genome Database1 . Some excerpts from sample clusters in varying subspaces found by SURFING applied to the gene expression data are depicted in Table 8.4. Cluster 1 contains several cell cycle genes. In addition, the two gene products are part of a common protein complex. Cluster 2 contains the gene STE12, an important regulatory factor for the mitotic cell cycle [109] and the genes CDC27 and EMP47 which are most likely co-expressed with STE12. Cluster 3 consists of the genes CDC25 (starting point for mitosis), MYO3 and NUD1 (known for an active role during mitosis) and various other transcription factors required during the cell cycle. Cluster 4 contains several genes related to the protein catabolism. Cluster 5 contains several structural parts of the ribosomes and related genes. Let us note, that MPI6 is clustered differently in varying subspaces (cf. Cluster 1 and Cluster 5). Cluster 6 contains the genes that code for proteins participating in a common pathway. Cluster 7 consists of ribosomal and mitochondrial genes which are essential for the energy supply of the cell. 1
http://www.yeastgenome.org/
196
Chapter 8: Subspace Clustering
Results on Metabolic Data Applying SURFING to metabolic data, we identified 13 subspaces considering quality values > 0.8. In detail, we extracted 5 one-dimensional spaces (the metabolites ArgSuc, Phe, Glu, Cit and Arg), 6 two-dimensional spaces (e.g. Phe-ArgSuc, Phe-Glu) and 3 three-dimensional spaces (e.g. Phe-Glu-ArgSuc). Alterations of our best ranked single metabolites correspond well to the abnormal metabolism of PKU [79]. We compared SURFING findings with results using PCA. Only components with eigen value > 1 were extracted. Varimax rotation was applied. PCA findings showed 4 components (eigen values of components 1-4 are 4.039, 2.612, 1.137 and 1.033) that retain 63% of total variation. However, SURFING’s best ranked single metabolites ArgSuc, Glu, Cit and Arg are not highly loaded (> 0.6) on one of four extracted components. Moreover, combinations of promising metabolites (higher dimensional subspaces) are not able to be considered in PCA. Particularly in abnormal metabolism, not only alterations of single metabolites but more interactions of several markers are often involved. As our results demonstrate, SURFING is more usable on metabolic data taking higher dimensional subspaces into account.
Influence of Parameter k We re-ran our experiments on the synthetic data sets with k = 3, 5, 10, 15, 20. We observed that if k = 3, SURFING did find the correct subspaces but did not rank the subspaces first (i.e. subspaces with a less clear hierarchical clustering structure got a higher quality value). In the range of 5 ≤ k ≤ 20, SURFING produced similar results for all synthetic data sets. This indicates that SURFING is quite robust against the choice of k within this range.
8.4 Experiments
197
Table 8.3: Comparative Tests on Synthetic Data. data set 06 07 08 16
# clusters/ subspaces 2 3 3 5
# correct clusters/subspaces found by CLIQUE RIS Entropy SURFING 1 2 0 2 1 2 0 2 1 3 0 3 0 3 0 4
Comparison with CLIQUE The results of CLIQUE applied to the synthetic data sets confirmed the suggestions that its accuracy heavily depends on the choice of the input parameters which is a nontrivial task. In some cases, CLIQUE failed to detect the subspace clusters hidden in the data but computed some dubious clusters. In addition, CLIQUE is not able to detect clusters of different density. Applied to our data sets which exhibit several clusters with varying density (e.g. data set 16), CLIQUE was not able to detect all clusters correctly but could only detect (parts of) one cluster (cf. Table 8.3) — even though we used a broad parameter setting. A similar result can be reported when we applied CLIQUE to the gene expression data set. CLIQUE was not able to obtain any useful clusters for a broad range of parameter settings. In summary, SURFING does not only outperform CLIQUE by means of quality, but also saves the user from finding a suitable parameter setting.
Comparison with RIS Using RIS causes similar problems as observed when using CLIQUE. The quality of the results computed by RIS also depends, with slightly less impact, on the input parameters. Like CLIQUE, in some cases RIS failed to detect the correct subspaces due to the utilization of a global density parameter (cf. Table 8.3). For example, applied to data set 16, RIS was able to compute the lower dimensional subspaces, but could not detect the higher dimensional one. The application of RIS to the gene expression data set is
198
Chapter 8: Subspace Clustering
described in [103]. SURFING confirmed these results but found several other interesting subspaces with important clusters, e.g. clusters 5 and 6 in subspace 70, 90, 110, 130 (cf. Figure 8.4). Applying RIS to the metabolic data set the best ranked subspace contains 12 attributes which represent nearly the full feature space and is biologically not interpretable. The application of RIS to all data sets, was limited by the choice of the right parameter setting. Again, SURFING does not only outperform RIS by means of quality, but also saves the user from finding a suitable parameter setting.
Comparison with Entropy Using the quality criterion Entropy in conjunction with the proposed forward search algorithm in [108], none of the correct subspaces were found. In all cases, the subspace selection method stops at a dimensionality of 2. Possibly, an exhaustive search examining all possible subspaces could produce better results. However, this approach obviously yields unacceptable run times. Applied to the metabolic data, the biologically relevant 1-dimensional subspaces are ranked low.
8.5
Conclusion
In this chapter, we introduced a new method to subspace clustering called SURFING which is more or less parameterless. One goal of this thesis was to propose unsupervised data mining methods which are as free as possible from sensitive input parameters. See also Chapter 4 for a parameter-free clustering method. In contrast to most recent approaches, the proposed algorithm does not rely on a global density threshold. SURFING selects and ranks subspaces of high dimensional data according to their interestingness for clustering. We empirically showed that the only input parameter of SURFING is stable in a broad range of settings. Thus, SURFING addresses the unsupervised notion of
8.5 Conclusion
199
the subspace clustering task in a best possible way. SURFING does not favor subspaces of a certain dimensionality.
A broad, comparative experimental evaluation using synthetic, gene expression and metabolic data sets shows that SURFING is an efficient and accurate solution to the complex subspace clustering problem. It outperforms recent subspace clustering methods in terms of effectiveness.
A possible starting point for future work is the development of an efficient index structure for partial k-nn-queries in arbitrary subspaces to further improve the runtime of SURFING. It would also be interesting to investigate if the search for relevant subspaces could be guided by application specific, biological information. To achieve this, biological information could be included in the bottom-up subspace merging process, e.g. to only merge subspaces containing clusters of genes with a specific function. The resulting algorithm would be a semi-supervised subspace selection algorithm.
As pointed out in Chapter 2, dimensionality reduction often is an essential preprocessing step enabling data mining on biomedical data. The next chapter is dedicated to a feature selection framework for high throughput proteomic data sets.
200
Chapter 8: Subspace Clustering
algorithm SURFING(Database DS, Integer k) // 1-dimensional subspaces S1 := {{a1 }, . . . , {ad }}; compute quality of all subspaces S ∈ S1 ; Sl := S ∈ S1 with lowest quality; Sh := S ∈ S1 with highest quality; if quality(Sl ) > 23 · quality(Sh ) then h) ; τ := quality(S 2 else τ := quality(Sl ); S1 = S1 − {Sl }; end if // k-dimensional-subspaces k := 2; create S2 from S1 ; while not Sk = ∅ do compute quality of all subspaces S in Sk ; Interesting := {S ∈ Sk |quality(S) ↑}; N eutral := {S ∈ Sk |quality(s) ↓ ∧ quality(S) > τ }; Irrelevant := {S ∈ Sk |quality(S) ≤ τ }; Sl := S ∈ Sk with lowest quality; Sh := S ∈ Sk − Interesting with highest quality; if quality(Sl ) > 23 · quality(Sh ) then h) ; τ := quality(S 2 else τ := quality(sl ); end if if not all subspaces irrelevant then Sk := Sk − Irrelevant; end if create Sk+1 from Sk ; k := k + 1; end while end
Figure 8.4: Algorithm SURFING.
8.5 Conclusion
201
Table 8.4: Results on Gene Expression Data. Gene Name Function Cluster 1 (subspace 90, 110, 130, 190) RPC40 builds complex with CDC60 CDC60 tRNA synthetase FRS1 tRNA synthetase DOM34 protein synthesis, mitotic cell cycle CKA1 mitotic cell cycle control CPA1 control of translation MIP6 RNA binding activity, mitotic cell cycle Cluster 2 (subspace 90, 110, 130, 190) STE12 transcription factor (cell cycle) CDC27 possible STE12-site EMP47 possible STE12-site XBP1 transcription factor Cluster 3 (subspace 90, 110, 130, 190) CDC25 starting control factor for mitosis MYO3 control/regulation factor for mitosis NUD1 control/regulation factor for mitosis Cluster 4 (subspace 190, 270, 290) RPT6 protein catabolism; complex with RPN10 RPN10 protein catabolism; complex with RPT6 UBC1 protein catabolism; part of 26S protease UBC4 protein catabolism; part of 26S protease MRPL17 part of mit. large ribosomal subunit MRPL31 part of mit. large ribosomal subunit MRPL32 part of mit. large ribosomal subunit MRPL33 part of mit. large ribosomal subunit SNF7 interaction with mit. protein VPS2 VPS4 mit. protein; interaction with SNF7 Cluster 5 (subspace 70, 90, 110, 130) SOF1 part of small ribosomal subunit NAN1 part of small ribosomal subunit RPS1A structural constituent of ribosome MIP6 RNA binding activity, mitotic cell cycle Cluster 6 (subspace 70, 90, 110, 130) RIB1 participate in riboflavin biosynthesis RIB4 participate in riboflavin biosynthesis RIB5 participate in riboflavin biosynthesis Cluster 7 (subspace 70, 90, 110, 130) MSH1 mitochondrial DNA repair MRP2 part of mit. small ribosomal subunit MRPL6 part of mit. large ribosomal subunit MGM101 mitochondrial genome maintenance MIP1 subunit of mitochondrial DNA polymerase
Chapter 9 Feature Selection for Classification
This chapter is dedicated to feature selection for classification on proteomic spectra. As demonstrated in the previous chapter for clustering, also for classification feature selection can be more useful than feature transformation. Compared to feature transformation methods, such as PCA or wavelet transformation, feature selection techniques have the inherent advantage of the interpretability of the results. In the analysis of proteomic spectra, the selected features represent specific peaks or other regions of interest, whereas features derived from e.g. wavelet transformation can not be attributed to specific regions and thus lack in interpretability. This chapter describes a supervised feature selection framework for identifying biomarker candidates for ovarian and prostate cancer from mass spectrometry data. The chapter is organized as follows: After an introduction in the next section, Section 9.2 is dedicated to related work on feature selection methods. The used data sets are introduced and we summarize our contributions. In Section 9.3 we elaborate in detail our framework for a 3-step feature selection. Each step within the framework aims at reducing the dimensionality by removing irrelevant features and, at the same time, improving the classification accuracy. In Section 9.4 we discuss the results on data sets on ovarian and prostate cancer. Section 9.5 provides a comparison of our method to established feature selection methods and Section 9.6 concludes the chapter.
204
9.1
Chapter 9: Feature Selection for Classification
Introduction
The identification of putative proteomic marker candidates is a big challenge for the biomarker discovery process. Pathologic states within cancer tissue may be expressed by abnormal changes in the protein and peptide abundance. By the availability of modern high throughput techniques such as SELDI-TOF (surface-enhanced laser desorption and ionization time-of-flight) MS a large amount of high dimensional mass spectrometric data is produced from a single blood or urine sample. For more biomedical background information see also Chapter 2, Section 2.3. Each spectrum is composed of peak amplitude measurements at approximately 15,200 features represented by a corresponding m/z value. For early stage ovarian cancer detection traditional single biomarkers, such as the widely used cancer antigen 125 (CA125) can only detect 50%-60% of patients with stage I ovarian cancer [110]. Analogously, the single use of the PSA value for early stage prostate cancer identification is not specific enough to reduce the number of false positives [1]. The analysis of high dimensional serum proteomic data gives a deeper insight into the abnormal protein signaling and networking which has a high potential to identify previously not discovered marker candidates.
More precisely, we can identify two aims for mining proteomic data: One subtask is to identify diseased subjects with the highest possible sensitivity and specificity. For this purpose, classification algorithms, e.g.
support vector machines (SVM), neural
networks and the k-nearest neighbor classifier (k-NN) can be applied. In the training phase, these algorithms are trained on a training data set which contains instances labeled according to classes, e.g. healthy and diseased, and then predict the class label of novel unlabeled instances [45]. Due to the high dimensionality of the data, most classification algorithms can not be directly applied on proteomic mass spectra. One reason is the so-called curse of dimensionality: With increasing dimensionality the distances among
9.1 Introduction
205
the instances assimilate. Noisy and irrelevant features further contribute to this effect, making it difficult for the classification algorithm to establish decision boundaries. A further reason why classification algorithms are not applicable on the full dimensional space are performance limitations. Ultimately, feature transformation techniques are applied before classification, e.g. in [111].
The second subtask is the identification of unknown marker candidates for cancer. This subtask goes beyond classification towards to understanding the underlying biomolecular mechanisms of the disease. Feature selection methods, which try to find the subset of features with the highest discriminatory power, can be applied for this purpose. Nevertheless, as aforementioned, the use of traditional methods is limited due to the high dimensionality of the data. Moreover, feature transformation is not useful as well because the marker candidates can not be identified out of the transformed features. In this chapter, we propose a novel 3-step feature selection framework which combines elements of existing feature selection methods and is accustomed to the special characteristics of high-throughput MS data. We present the results on two published SELDI-TOF-MS data sets on prostate and ovarian cancer. Our method identifies feature subsets with a classification accuracy between 97% and 100%. In Figure 9.1 the dimensionality and the classification accuracy for the input and the output of each step are depicted for the prostate data set. In total, we reduced the dimensionality from 15,154 to 164, and at the same time increased the classification accuracy from 90.37% to 97.83%. For the ovarian data set, the dimensionality is reduced to 9 features and the classification accuracy is increased from 99.60% to 100%.
Some notations which are frequently used in this chapter besides the general notations in Table 1.1 are summarized in Table 9.1: Within our 3-step feature selection
206
Chapter 9: Feature Selection for Classification
d = 15, 554 ac = 90.37%
Step 1: removeIrrelevant d = 9,566 ac = 93.16%
Step 2: selectBestRanked d = 1,330 ac = 94.41%
Step 3: selectBestRepresentatives d = 164 ac = 97.83%
Subset Selection, Mining Rules…
Figure 9.1: Overview.
method, we denote the resulting data set of step i by resi with the classification accuracy acci and the dimensionality dimi . As in the previous chapter, a set of features is denoted by A = {a1 , ...ai }. In our framework, we use a classification algorithm denoted by C, which can be an arbitrary classifier, e.g. a support vector machine. Additionally we use a ranker R, which is a feature selection method generating a ranked list of the features and associating a quality value to each feature. We denote the rank of ai by rank(ai ) and the quality of ai by quality(ai ). We further denote by index(ai ) the index, i.e. the position of the m/z value of the feature ai in the original data set DS. The feature with the smallest m/z value has the index 1, the feature with the largest m/z value has the index d.
9.2 Survey
207
Symbol resi acci dimi A ai C R rank(ai ) quality(ai ) index(ai )
Definition Result of step i. Accuracy of step i. Dimensionality of the resulting data set of step i. Set of attributes of the data set DS, A = {a1 , . . . , ad }. An attribute. A classifier. A ranker. Rank of feature ai . Quality of feature ai . Position of attribute ai in DS.
Table 9.1: Symbols and Acronyms Used in Chapter 9.
9.2
Survey
In this section we give a survey on attribute selection techniques for classification and on related publications on the used data sets. In addition we summarize our contributions.
9.2.1
Feature Selection for Classification
Numerous feature selection strategies for classification have been proposed, for a comprehensive survey see e.g. [112]. Following a common characterization, we distinguish between filter and wrapper approaches.
Filters Filter approaches use an evaluation criterium to judge the discriminating power of the features. Among the filter approaches, we can further distinguish between rankers and feature subset evaluation methods. Rankers evaluate each feature independently regarding its usefulness for classification. As a result, a ranked list is returned to the user. Rankers are very efficient, but interactions and correlations between the features are neglected.
208
Chapter 9: Feature Selection for Classification
Feature subset evaluation methods judge the usefulness of subsets of the features. The information of interactions between the features is in principle preserved, but the search space expands to the size of O(2d ). For high-dimensional proteomic mass spectra, only very simple and efficient search strategies, e.g. forward selection algorithms, can be applied because of the performance limitations. We now give more details on two rankers and two subset evaluation methods which are subsequently used.
Rankers. The information gain [47] of an attribute reflects how much information the attribute provides on the class label of the objects. For numerical attributes, as in our case, the data set is first discretized using the method of Fayyad and Irani [113]. The information gain of a feature ai w.r.t. class Lj is defined as IG(ai ) = H(Lj ) − H(Lj |ai ), whereas H(Lj ) is the entropy of the class Lj , H(ai ) is the entropy of the attribute ai and H(Lj |ai ) the conditional entropy. Relief is an instance based attribute ranking method that has been introduced by Kira and Rendell [114] and has been extended to the method reliefF for mutli-class and noisy data sets by Kononenko [115]. The main idea of relief is that useful attributes have significantly different values for instances of different classes and similar values for instances of the same class. To compute a relevance score for an attribute, for randomly sampled instances the k nearest neighbors of the same class and from each of the other classes are taken into account.
Feature subset evaluation.
Correlation-based feature selection (CFS) [116] is a
feature subset evaluation heuristic which assigns high scores to feature subsets which show a high correlation with the class attribute and a low correlation among each other. Redundant features are rejected because they are highly correlated with other features, but only under the assumption that they get into a common subset during the search
9.2 Survey
209
process. Constistency-based subset evaluation [117] uses an evaluation metric that favors attribute subsets which achieve a strong single class majority. This means, the attribute subsets are evaluated w.r.t. their discriminatory power regarding the majority class. The idea is to find different small attribute subsets to separate the different classes from each other.
Wrappers The wrapper attribute selection method uses a classifier to evaluate attribute subsets. Cross-validation is used to estimate the accuracy of the classifier on novel unclassified objects. For each examined attribute subset, the classification accuracy is determined. Adapted to the special characteristics of the classifier, in most cases wrapper approaches identify attribute subsets with a higher classification accuracy than filter approaches, cf. [112]. As the attribute subset evaluation methods, wrapper approaches can be used with an arbitrary search strategy. Among all feature selection methods, wrappers are the most computational expensive ones, due to the use of a learning algorithm for each examined feature subset. We propose to apply a 3-step framework for feature selection on proteomics data, that combines the efficiency of the filter approaches with the the effectiveness of wrapper approaches and exploits special characteristics of the data.
9.2.2
Data Sets
In the following we evaluate our feature selection method for proteomic data on two SELDI-TOF-MS data sets available at the website of the US National Cancer Institute1 . Both data sets derive from clinical studies with the objective to identify putative biomarker candidates by comparing well defined groups of cancer versus healthy controls. Each spectrum is composed of d = 15, 154 features. 1
http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp
210
Chapter 9: Feature Selection for Classification
Ovarian Data Set The ovarian data set 8-7-02 contains 162 instances of ovarian cancer and 91 instances of a healthy control group. The data set is an improved version of the data set published in [2], using a WCX2 protein chip and robotic automation. Trajanowski et al. [111] recently proposed an approach for ovarian cancer identification based on dimensionality reduction. They use a multi-step feature transformation technique based on wavelet transformation and present the results on this data set and on a high resolution data set on ovarian cancer. With 2-fold cross validation, an accuracy of 91.12% is reported for SVM on the wavelet reduced 8-7-02 data set. In the next section, we demonstrate that 100% accuracy can be achieved on 9 key features selected by our framework using a linear SVM and 10-fold cross validation. Even on the full dimensional space, the classification accuracy is with 99.60% much higher than on the wavelet reduced data set. This indicates that the data quality is very high. Alexe et al. [118] analyzed this data set using a combinatorics and optimization-based methodology. With a system of 41 rules they achieved a sensitivity and specificity of 100% on this data set. Prostate Data Set This data set consists of four classes: • L1 : healthy, with a PSA value < 1 ng/mL (63 instances), • L2 : healthy, PSA value > 4 ng/mL (190 instances), • L3 : prostate cancer, PSA value 4-10 ng/mL (26 instances), • L4 : prostate cancer, PSA value > 10 ng/mL (43 instances). In [1] this data set has been analyzed with with ProteinQuest, a tool combining genetic algorithms and cluster analysis. The same algorithm has also been applied in [2] and [119]
9.2 Survey
211
for feature selection for ovarian cancer data sets. (Conrads et al. used in [119] a data set very similar to the 8-7-02 data set. However, the number of instances differs. Therefore we did not provide a comparison with their results in the previous paragraph). The genetic algorithm starts with randomly selected feature subsets. A fitness criterium rating the ability of the feature subsets to generate class-pure clusters is used. The best feature subsets are refined with the genetic algorithm. As optimal discriminating pattern the m/z values 2,092, 2,367, 2,582, 3,080, 4,819, 5,439 and 18,220 have been identified. This method achieves to identify prostate cancer with 94.74% accuracy (accuracy on the class corresponding to L3 ∪ L4 ) and 77.73% percent of the instances were correctly identified to have benign conditions (accuracy on L1 ∪L2 ). However, especially for class L2 , i.e. healthy men with marginally elevated PSA levels, the reported specificity is with 71 % quite low. Due to different methodology and validation methods, the results of ProteinQuest are not directly comparable to our technique. Nevertheless, we obtain with 78.8 % a much higher specificity of class L2 and an overall accuracy of 97.83% on this data set. On the selected features by our method, linear SVM correctly identified instances with prostate cancer with 97.10% accuracy (accuracy on L3 ∪ L4 ) and even 99.60% of the instances were correctly identified to have benign conditions (accuracy on L1 ∪ L2 ). 9.2.3
Contributions
The main advantages of our method can be summarized as follows: • We propose a generic framework for feature selection using a classifier C, a search strategy S and a ranker R. • Our method is efficient and applicable on very high dimensional proteomic data sets. • The classification results on the selected features confirm and outperform the results
212
Chapter 9: Feature Selection for Classification
reported in literature on the ovarian and the prostate data set.
9.3
Proposed Method
In this section we propose a 3-step framework for feature selection. Our framework is a hybrid method combining the good performance of a filter approach with the advantages of the wrapper approach: more accustomed to the learning algorithm the wrapper approach achieves better results in most cases. Our framework uses a classifier C e.g. SVM and a ranker R, e.g. information gain and a search strategy S. An optimal feature subset for biomarker identification and diagnosis is a subset consisting of as few features as possible and achieving highest classification accuracy. We use C, R, S and special properties of proteomic data for an effective and efficient exploration of the huge search space of 2d feature subsets to find a close to optimal solution. Figure 9.1 gives an overview of the steps in our feature selection framework. The classifier, the evaluation criterium and the search strategy can be arbitrarily chosen, also depending on time and memory resources. Starting with the full data set as input, the feature selection process selects and removes in each step noisy, redundant or irrelevant features, while keeping the accuracy at least constant.
In the following we discuss the single steps in detail.
We give results for SVM,
k-NN as classifiers, information gain and reliefF as rankers, and ranked search, simulated annealing, and a novel heuristic called modified binary search as search strategies. For the classifiers we use the implementations of the WEKA machine learning package [90]. For the nearest neighbor classifier, we use k = 5 nearest neighbors. We use a linear SVM with the standard parametrization in WEKA of c = 1.0 and γ = 0.01. To estimate the classification accuracy we use 10-fold cross validation.
213
1
0.5
prostate ovarian
0.8 0.6
prostate ovarian
0.4
reliefF
information gain
9.3 Proposed Method
0.4 0.2
0.3 0.2 0.1
0
0 0
2000
4000
6000
8000
10000
ranked features
0
5000
10000
15000
ranked features
(a) Information Gain.
(b) ReliefF.
Figure 9.2: Distribution of the Ranked Features.
9.3.1
Step 1: Removing Irrelevant Features
High dimensional proteomic data contains noisy features and moreover a good portion of the features is irrelevant for diagnosis. In the first step, we want to identify and discard the irrelevant features using the ranker R and the classifier C. To get a baseline for the classification accuracy, we first determine the accuracy on the full feature space using C.
For ovarian data 99.60% is achieved with linear SVM, for prostate
data 90.37%. We then use the evaluation criterium to remove all irrelevant features. For information gain this means we remove all features with information gain 0 and determine the accuracy again.
For the ovarian data set dim1 = 6, 238 attributes
remain and the accuracy stays the same, i.e. acc1 = 99.60. For prostate data, the reduction to dim1 = 9, 566 attributes improves the classification accuracy to acc1 = 93.16%.
Figure 9.2(a) shows for each feature the quality according to information gain and Figure 9.2(b) according to reliefF for both data sets. Only features with information gain and reliefF > 0 are depicted and the features are displayed in the ranked order according their score w.r.t R. For both data sets and both evaluation criteria the feature ranking shows a characteristic property: There are only very few highly ranked features and many features that are irrelevant or almost irrelevant. For the ovarian data set
214
Chapter 9: Feature Selection for Classification
1.1 1.05
accuracy
1 0.95 0.9 0.85 0.8
ovarian prostate
0.75 0.7 1
500
999 1498 1997 2496 2995 3494 3993 4492 4991 5490 5989
number of ranked features
Figure 9.3: Search Space for Step 2. this aspect is even more evident and the feature ranking shows a perfect exponential distribution. This indicates that much more features can be removed without affecting the accuracy. In the following we focus our discussion on the use of information gain as ranker and SVM as classifier and give more results for reliefF and 5-NN in Section 9.4.
9.3.2
Step 2: Selecting the Best Ranked Features
After having identified and discarded the irrelevant features using R in the last step, we now want to further reduce the dimensionality without affecting the accuracy, i. e. our aim is to identify a feature subset res2 with acc2 >= acc1 and dim2 quality(ai ) rep.add(ai ); DS.remove(ai ); return rep;
Figure 9.5: Selecting Region Representatives Definition 24 (Binning Function) Let s ∈ N. The binning function b is defined as b(ai ) = max(1, index(ai )/100 · s) For a feature ai , b determines a region size of s% of the index of ai . We set the minimum region size to 1. In the following we report the results using res2 obtained with SVM and information gain and modified binary search as search strategy. For each region we now choose the best ranked feature as representative and use C to evaluate the accuracy. For the ovarian data set we obtain 9 features and an accuracy of 100%, and we are done, since the maximum accuracy has been achieved. For the prostate data set, this results in 19 features and an accuracy of 93.48%. Since the accuracy declined from originally 94.41% on 1,331 attributes, we subsequently add in each step for each region the attribute which is best ranked among the remaining attributes and evaluate the accuracy with C again.
222
Chapter 9: Feature Selection for Classification
For 187 attributes the accuracy of 94.41% is reached again. More precisely, the algorithm for adding the best representatives works as follows. We begin with an empty result data set res3 . The list of ranked features of res2 is first sorted by the index of the features. We then determine the region size of each feature using b and determine if there are any better ranked features within the region of ai in res2 . If not, ai is added to the result set. In each step, the algorithm adds for each region the best feature among the remaining features until no further improvement of accuracy can be achieved.
Adding Missing Region Representatives Different regions of the sprectra tend to contain different information. Some of the regions in our intermediate result set may be under-represented or not represented at all, since res2 has already been a drastically reduced attribute set containing only high ranked features. Therefore, we now also use the list of ranked features of res1 . As described in the last section, we determine for each not represented region the best representatives using the binning function and add them as long as an improvement of the accuracy can be obtained.
The pseudocode for the whole algorithm is depicted in Figure 9.5.
The method
representatives() selects for each region the best representative which has not been selected before. As a final step (which has been left out in the pseudocode for simplicity), we test if there are redundant features. More precisely, we take the list of features sorted w.r.t. their index. We then try to leave out for each pair of neighboring features the feature which has been lower ranked by R and evaluate the accuracy again. For the prostate data set, we end up with 164 features and a final accuracy of 97.83%. The confusion matrix for linear SVM and 10-fold cross validation shows only seven classification errors, whereas four of them are due to confusing the two different stages of prostate cancer.
9.4 Results
223
Table 9.3: Confusion L1 L1 63 L2 0 L3 0 L4 0
Matrix for Prostate Data. L2 L3 L4 0 0 0 189 1 0 1 22 3 1 1 41
In contrast to CFS, this algorithm does not only cover linear dependencies among the features. Similar as in the wrapper approach, we use the classification algorithm to evaluate the intermediate results. Only features that are useful for the special classifier are included in the result set. For linear SVM redundant features are almost synonymous with highly linearly correlated features, but this may be different for a neural network. The algorithm is efficient: For the prostate data set, step 3 takes 5.30 minutes, whereas for the ovarian data set we are done in 50 seconds.
9.4
Results
Table 9.4 summarizes the results using linear SVM as classifier and the information gain as ranker. All three single steps of our method reduce the dimensionality, and at the same time improve the classification accuracy for this combination of C and R. For comparison, in Table 9.5 the results for 5-NN as C and reliefF as R are given.
The classification accuracy using 5-NN on the full feature space is lower than the accuracy using SVM on both data sets. Also for reliefF and 5-NN our method achieves a sound reduction of features and improvement in classification accuracy. However, for 5-NN and reliefF on both data sets, the final set of features is larger and the classification accuracy is lower than with information gain and SVM. Also, not all the single steps lead to an improvement w.r.t. both aspects, the classification accuracy and the number
224
Chapter 9: Feature Selection for Classification
Table 9.4: Linear SVM and Information Gain DS full space step 1 step 2 step 3 ovarian 15,154 6,238 39 9 99.60% 99.60% 100.0% 100% prostate 15,154 9,566 1,331 164 90.37% 93.16% 94.41% 97.83% Table 9.5: 5-NN and ReliefF full space step 1 step 2 step 3 15,154 15,037 66 90 93.28% 93.20% 99.20% 99.40% prostate 15,154 14,435 35 361 87.27% 87.27% 85.09% 92.50% DS ovarian
of features. However always at least one of these two aspects is improved in each single step. In principle, every combination of R, C and S can be applied. A standardized investigation of the huge amount of possible combinations w.r.t. their performance on proteomic data is part of our ongoing work.
Figure 9.6 depicts some selected regions that have been identified by our feature selection framework on the ovarian data set.
A randomly selected spectrum of the
control group with highlighted reagions is depicted. Below we show the highlighted regions in more detail comparing the healthy instance to a randomly selected instance with ovarian cancer. We can confirm that a majority of the relevant features can be found in the region of low m/z values, this has also been stated in [118]. The 9 features found using SVM and information gain are the m/z values 2.76, 25.49, 222.42, 244.66, 261.58, 417.73, 435.07, 464.36 and 4,002.46. Besides the m/z value 2.78 all these features have also been selected using reliefF and 5-NN. Among the 90 selected features with reliefF and 5-NN, 70% percent represent m/z values below 3,000.
9.4 Results
225
120
intensity
100 80 60 40 20 0 -0.0001
2176.5303
8707.7758
19593.7
m/z value 4.60 4.50 4.40 4.30 4.20 4.10 4.00 3.90 3.80 3.70 3.60 0.54
control cancer 1.45
2.79
4.57
6.78
20 00 80 60 40 20 0 216.34
m/z 0.55 – 6.75
120 100 80 60 40 20
233.12
250.53
m/z 216 - 268
268.57
0 397
416
435
455
475
m/z 397 - 475
Figure 9.6: Results on Ovarian Data.
For the prostate data set, also in the area of higher m/z values discriminating regions have been found. Figure 9.7 shows some selected regions. Out of the 164 selected features using SVM and information gain, the most evident changes between healthy and diseased instances can be observed in the regions representing the m/z values of approximately 500, 5,000 and 9,000. For clarity reasons, one randomly selected spectrum of class L1 (healthy, PSA < 1 ng/mL) is compared to one randomly selected spectrum of class L4 (prostate cancer, PSA > 10 ng/mL) w.r.t. the three highlighted regions in Figure 9.7. Most of the features selected by reliefF and 5-NN are also in these three areas. Besides this, more features in the region of very low m/z values have been selected using reliefF and 5-NN.
In Figure 9.8 we inspect the interesting regions from m/z 4,500 to 5,500 and m/z 9,000 to 9,500 in more detail w.r.t.
all classes by depicting one randomly selected
spectrum of each class. In Figure 9.8(a) there are two interesting regions which are
226
Chapter 9: Feature Selection for Classification
100 80
intensity
60 40 20 0 -20 -7.86026E-05
2176.5303
8707.7758
19593.7
m/z value 15
8
10
7
60 50
6
5
40
5
0
control cancer
-5 -10 476
484
493
m/z 467 - 501
30
3
20
2
10
1
-15 468
4
501
0 5114
5181
5249
5317
5385
m/z 5,114 - 5,385
0 9159
9249
9339
9429
9520
m/z 9,159 - 9,520
Figure 9.7: Results on Prostate Data.
highlighted: One is the peak between approximately m/z 4,550 and 4,850. The amount of the corresponding peptides is considerably lower for the instances with prostate cancer (L3 and L4 ) than for the instances with benign conditions (L1 and L2 ). The other interesting region is a peak of smaller intensity at approximately m/z of 5,250. Here the amount of the corresponding peptides is increased for the instances of the class L4 (prostate cancer, highly elevated PSA value) and L2 (healthy, elevated PSA value) w.r.t. class c1. The same region is also displayed in more detail in Figure 9.7 for an instance of class L1 compared to an instance of class L4 . It is especially interesting, because in most of the discriminating regions, the abundance of peptides is reduced for the instances with cancer (also on the ovarian data set, cf. Figure 9.6).
In Figure 9.8(b) it can be seen that the abundance of the peptides corresponding to the m/z values around 9,200 is reduced for the instance of prostate cancer with a highly elevated PSA value (class L4 ) w.r.t. the class of the healthy control group without
9.5 Comparison with Existing Methods
227
80
80
c1 c2 c3 c4
70 60 50
60 50
40
40
30
30
20
20
10
10
0 4400.3853
4651.4974
4909.577
5174.6241
c1 c2 c3 c4
70
0 9001.0721
(a) m/z 4,500 - 5,500
9089.8304
9179.0241
9268.6533
9358.7179
9449.2181
(b) m/z 9,000 - 9,500
Figure 9.8: Selected Regions on Prostate Data. elevation of the PSA value (class L1 ). For both classes representing instances with marginally elevated PSA value (classes L2 and L3 ) the abundance of the corresponding peptides is increased w.r.t. the instance of class L1 . These interesting findings have to be systematically verified and analyzed for interpretation. But already now we can observe that no single features have highest discriminating power. Instead, groups of features in different regions of the spectra establish highest accuracy.
9.5
Comparison with Existing Methods
One might wonder if a mulit-step framework for feature selection is really necessary on proteomic data sets. In principle, the feature selection techniques evaluating subsets of features can be directly applied on the whole feature space. But due to the high dimensionality of the data sets, there are performance limitations for established methods. CFS as implemented in WEKA is not applicable due to memory limitations. Consistency based subset evaluation is more efficient, because it does not consider correlations among the features. It can be applied with an efficient forward selection search strategy on the full dimensional data sets. This search algorithm implemented in WEKA starts with an empty feature subset and performs greedy hillclimbing with
228
Chapter 9: Feature Selection for Classification
a backtracking facility which can be controlled by a parameter.
This parameter
specifies the maximum number of consecutively added features if no improvement in accuracy can be achieved. As in the default settings of WEKA, we set this parameter to 5.
Consistency-based subset evaluation.
For the ovarian data set, consistency-
based subset evaluation applied on the full dimensional space finds a three dimensional attribute subset, containing the m/z values of 0.005, 2.79 and 244.66 and an accuracy of 99.60% with both, SVM and 5-NN. Although the accuracy is remarkably high, this subset of attributes provides not much information on the data set. Two of the three m/z values are of the area of very low molecular weight. This region consists of fragments of peptides that can be rather arbitrarily split up during ionization. The attribute 244.66 is the top ranked attribute by information gain. Biomarker identification for diagnosis requires to identify attribute subsets with highest predictive accuracy. But besides that, interesting regions of the spectra that show evident differences among the objects of the different classes should be identified. The result of consistency-based subset evaluation only represent two regions. On the prostate data set consistency-based subset evaluation ends up with a subset of 7 attributes and a low classification accuracy of 77.01% using SVM and 86.33% using 5-NN.
Rankers.
For high dimensional data, ranker methods are often the only choice
because of the performance limitations. However, it is critical to identify a suitable number of attributes to use for classification. For both examined methods, information gain and reliefF it turned out to be uncritical to leave out the attributes evaluated with zero. But still approximately 6,000 (for information gain on ovarian) to 14,000 (for reliefF on prostate) remain, which is still too much for interpretation, and an accuracy far below the optimum (cf. Section 9.3). Information gain and reliefF do not consider
9.6 Conclusion
229
correlations and interactions among the features, thus most of the information at a certain cut-off point is redundant, whereas other regions of the spectrum providing additional information are not represented as all.
9.6
Conclusion
In this chapter, we presented a framework for feature selection on high-throughput mass spectrometry data. We evaluated our method on two SELDI-TOF-MS data sets on cancer identification. On both data sets we found groups of features providing a very high sensitivity and specificity for cancer identification. This result can be used as an input for further data mining steps. Currently we focus on mining association rules on the selected features (after discretizing the numerical features) and clustering to identify unknown sub-classes. As cancer is a complex systemic disease with different stages, different sub-classes can be expected.
In our ongoing research we are also adapting our approach specifically to moreclass-problems. It is for example especially interesting to determine well discriminating feature sets among samples belonging to the class of early stage cancer and samples corresponding to different benign conditions. In addition, we are evaluating and extending our method for the use on high-resolution SELDI-TOF and on MALDI-TOF-MS data sets. For these data sets which can have a dimensionality of up to 700,000 features extra steps are required. Moreover, we focus on the biological interpretation of the identified peptides, proteins and their variants. This contributes towards a better understanding of the bio-molecular level of the disease, development of new diagnostic and therapeutic strategies.
230
Chapter 9: Feature Selection for Classification
Another interesting direction we are focussing on is the visualization of proteomic data and visually supported data mining. Since proteomic mass spectra contain different information in different regions, it makes sense to locally run data mining algorithms in selected regions. Feature selection methods can identify potentially interesting candidate regions which can be selected and further explored by the user.
Chapter 10
High-Perfomance Data Mining on Data Streams
This chapter addresses a problem within the area of efficiency of data mining. For many data mining algorithms, such as the k-nearest neighbor classifier, K-Means, DBSCAN and OPTICS, the k-nearest neighbor query is a fundamental building block. In the recent years many index structures have been proposed to efficiently support k-nearest neighbor queries, e.g. the X-Tree [123] or the VA-File [124]. However, most of these structures are restricted to static databases and not applicable in a high throughput streaming scenario, e.g. in health monitoring. This chapter is organized as follows: After an introduction in the following section, Section 10.2 discusses related work. The proposed method for k-NN monitoring consists of three ideas which are presented in Section 10.3: (1) Selecting exactly those objects from the stream which are able to become answers to a k-NN query and storing them in a special data structure called the query skyline (cf. Section 10.3.1), (2) delaying the processing of those objects which qualify not immediately for being nearest neighbors of any of the queries (cf. Section 10.3.2) and (3) indexing the queries rather than the objects (cf. Section 10.3.3). Section 10.4 gives an extensive experimental evaluation and Section 10.5 concludes this chapter.
234
10.1
Chapter 10: High-Perfomance Data Mining on Data Streams
Introduction
In biomedicine more and more data is generated in the form of data streams, consider for example health monitoring: Modern smart sensors attached to the patient generate huge amounts of data. Most importantly, these data streams have to be monitored for serious events in real-time.
For further analysis it is also useful to store only
”interesting” subsequences of the streaming objects to facilitate the application of data mining algorithms w.r.t. efficiency and effectiveness. Both cases are subject to k-nearest neighbor monitoring. An expert can specify the most interesting or most dangerous events and the system then continuously monitors the patient data for the k most similar data objects from the stream. Especially since the prevalence of chronical diseases such as hypertension, diabetes and cardiovascular conditions is increasing in the western world the use of direct permanent monitoring of patients vital signs, as well as the direct synchronous transfer of this sensory data will significantly increase [125, 126].
Besides health monitoring, there are many other application areas where k-NN monitoring on data streams is essential, e.g. network monitoring. The system administrators may specify various network packages which are suspicious due to different reasons (intrusion, rule violation, misuse etc.) Then, the stream of network packages is constantly surveyed for packages which are similar to the suspicious objects. As an other example, take advertisements of an arbitrary market segment. Each user may specify the properties (price, color, size, weight etc.) of a product he/she is interested in. The system permanently informs the user about those advertisements which fit best his/her requirements.
Consequently, query processing on data streams has become very popular in recent years, e.g. [127, 128, 129, 130] to mention a few. There is is a vast number of solutions for various types of data, e.g.
relational, semi-structured and time series,
10.1 Introduction
235
but most existing approaches are approximate because of the special conditions in a streaming environment. In contrast to conventional query processing, where queries are immediately answered from a database with a large but finite and previously known number of objects, in stream based query processing the queries are subscribed to a data stream. This means that, upon every arrival of a new object from the stream, the set of registered queries has to be checked. If the new object qualifies for one or more queries, it is reported as a result of the queries. There are two main challenges: The streaming objects arrive with a very high frequency and usually not all of them can be stored in the system.
We distinguish between two different types of similarity queries: range queries and nearest neighbor queries. For both types, the user selects an object, the query object which is the starting point of the search. For a range search, the user must additionally specify the query radius, i.e. a threshold for the maximally allowed distance from the query object. Since similarity measures are often not very intuitive, it may be difficult to specify such a query radius. Therefore, in practice, the k-nearest neighbor query (k-NN) is more important, because the user only has to specify k, the number of objects that he wants to retrieve, and the system automatically retrieves the k most similar results. In this chapter, we focus on k-NN queries but our technique can be extended to range queries in a straightforward way.
We define for each object a life span in which the object is valid. In many applications such as advertisements, the objects themselves may be associated with a time of expiry. If this is not the case, the time of validity can either be specified globally for the whole system or may be individually specified for each query. By specifying a global or query-specific lifespan, the user is also enabled to control how frequently he/she wants to
236
Chapter 10: High-Perfomance Data Mining on Data Streams
get a query result. E.g. if the life span is set to 1 hour, the user will get at least every hour a new result (but maybe also some additional results if good hits arrive before the current nearest neighbor expires).
In addition to global, query-specific, and object-specific expiry, we can also distinguish between time-based and number-based expiry. In time-based expiry, the life span of an object is defined in terms of e.g. seconds. In contrast, for number-based expiry, the user defines a maximum number of objects n which is at all times simultaneously valid. After the object number n + 1 has arrived from the stream, object number 1 automatically expires. Number based expiry is also practically useful because n gives a intuitive quality measure for the results: each result is the best out of n objects. Number based expiry is useful for global and query-specific expiry, but not for object-specific.
We will also distinguish between monotonic and non-monotonic expiry.
Monotonic
expiry means that objects expire in the same order as they have arrived from the stream. Non-monotonic expiry is only possible in combination with object-specific expiry. In this context, the queries have also a life span which starts at the time stamp of subscription subscr(q) and ends at the time stamp of unsubscription unsubscr(q).
10.1.1
Problem Specification
We consider a data stream S as a sequence of objects (o1 , o2 , ..). Besides its coordinates, each object has a time stamp of its appearance at the stream, denoted by app(o) and a time stamp of expiry, denoted by exp(o). We call the time span between an objects appearance and its expiry the life span l(o) of an object o. In this time the object is valid.
10.1 Introduction
237
Definition 25 (Valid objects.) For a given timestamp t the set of valid objects Vt is defined as V (t) = {o ∈ S|app(o) ≤ t ≤ exp(o)}. Definition 26 (Valid objects w.r.t q.) For a given time stamp t the set of objects which are valid for query q, denoted Vq (t) corresponds to Vq (t) = {o ∈ V (t)|subscr(q) ≤ app(o) ≤ unsubscr(q)}. Definition 27 (k-NN.) Let dist be a metric distance function. The k-nearest neighbors of a query q at a given time stamp t, N Ntk (q) is the minimal subset of valid objects Vq (t) containing at least k elements with ∀o ∈ N Ntk (q) ∧ ∀p ∈ Vq (t) \ N Ntk (q) : dist(o, q) < dist(p, q). We denote by N Nt (q) the nearest neighbor of a query q, i.e. N Nt (q) = N Nt1 (q). Wherever non-ambiguous we write NN. Conceptually at every time stamp all subscribed queries are evaluated. If there are changes among the k-NN of a query, they are reported to the user who owns the subscription. The result set of the system at a time stamp contains all objects that need to be reported. Definition 28 (Result set.) For a given time stamp t and a set of queries Q the result set R(t) consists of the following objects o: k (q)}. R(t) = {o|∃q ∈ Q : o ∈ N Ntk (q) ∧ o ∈ / N Nt−1
Some notations which are frequently used in this chapter are summarized in Table 10.1. A data stream S may consist of a finite or infinite number of objects. In the finite case, we denote the number of streaming objects by N .
238
Chapter 10: High-Perfomance Data Mining on Data Streams
Symbol S N o app(o) exp(o) l(o) q N Ntk (q)
Definition A data stream, S = {o1 , o2 , ...}. Number of objects in a finite stream. A stream object. Time stamp of appearance of object o. Time stamp of expiry of object o. Life span of o in time stamps. A query. The set of k nearest neighbors of the query q at time stamp t. Table 10.1: Symbols and Acronyms Used in Chapter 10.
10.1.2
Contributions
The key contributions of our approach can be summarized as follows:
a) We propose a framework for exact k-nearest neighbor query processing on data streams.
b) We demonstrate how the basic idea of a skyline can be exploited for continuous k-nearest neighbor queries to assure exact answer guarantee at very low memory consumption.
c) We further reduce memory consumption by object delaying without giving up the exactness property.
d) An efficient index structure for continuous queries is provided. This index allows highly dynamic updates and is small enough to fit into main memory.
10.2 Survey
10.2
Survey
10.2.1
k-NN Queries on Data Streams
239
k-NN queries on static databases is a well studied problem for which many index structures have been proposed, e.g. [124, 131, 123]. Most of them are not suitable in our context because of the high throughput. In the special case streaming time series research activities mainly focus on discovering patterns to predict the coordinates of new objects. This information is used to discard objects that are probably irrelevant to the query and improve response time of the system, e.g. [128]. Prediction-based methods are not applicable if the objects attribute values are independent from their time of appearance and the objects arrived before. For this case approximate approaches have been proposed, such as [132]. The data space is partitioned into cells and a B+ tree together with a Z-curve is used to index the space. For each grid cell some objects are retained to guarantee an absolute error bound for k-NN queries. However, in some applications exact answers are essential, e.g. in health monitoring. In any case exact answers are of highest compliance to the user and give a sound base for data analysis and interpretation. Moreover [132] and [133] focus on answering single queries and are not designed for continuous k-nearest neighbor monitoring.
10.2.2
Skylines and Query Monitoring
In this chapter, we use a skyline based object buffer associated with each query, which we call the query skyline. This allows us to discard most of the objects from the stream immediately without giving up the exact answer guarantee (cf. Section 10.3.1). In general the skyline of a data set contains the data points which are maximal or minimal in two or more of the attributes. The problem was first proposed by [134]. It attracted much attention both on static data sets, e.g. [135, 136] and streaming data [137]. Papadias
240
Chapter 10: High-Perfomance Data Mining on Data Streams
et al. introduced the concept of k-skyband [136] which is also related to our work. The k-skyband of a data set contains all objects which are dominated by at most k − 1 other objects. In [138] an interesting algorithm for monitoring top-k-queries is proposed. The kskyband has geometric implications on the data space, leading to areas that are excluded to contain relevant objects. In this approach all objects are indexed in a grid and the order of processing of the cells is determined by the k-skyband. However, this approach relies on the assumption that all valid objects can be kept in memory.
10.2.3
Query Indexing
In a data stream context it is usually impossible to index all the data objects because of memory limitations. Typically the number of queries is much smaller than the number of valid objects and naturally the system needs to have enough memory to store them. The idea of indexing the queries instead of the data objects has been successfully applied in the context of moving objects [139]. Organizing queries in a grid index outperforms object indexing for a large number of moving objects, because the objects are highly dynamic causing a lot of index updates. In our context indexing the queries instead of the objects means that we change the problem of k-nearest neighbor search (of the objects) into a bichromatic reverse k-nearest nearest neighbor search (of the queries), for more details see Section 10.3.3. The problem of RNN has been extensively studied for static databases. A lot of sophisticated tree based index structures have been prosed, e.g. [140, 141, 142]. For indexing the queries these structures are not flexible enough to support frequent updates and lead to a memory overhead which is not necessary. We propose a grid-style index, which is described in detail in Section 10.3.3.
10.3 Proposed Method
10.3
241
Proposed Method
This section elaborates the proposed method in detail. First the concept of the query skyline is introduced which allows exact k-NN processing with very limited memory usage (cf. Section 10.3.1). The idea of object buffering which is subsequently presented allows to discard even more objects from the stream without loosing the exactness guarantee (cf. 10.3.2. Furthermore, this section gives a detailed description of the index used to store the queries (cf. 10.3.3). Section 10.3.4 discusses how the result reporting is implemented. For simplicity, the general idea is introduced for 1-NN queries and extended to k-NN processing in Section 10.3.5.
10.3.1
Skyline Based Object Maintenance
Usually only a very small fraction of the objects arriving from a data stream is relevant to a query. For high throughput streams also only a small fraction of the objects can be stored in the system. These general conditions require a decision strategy to discard irrelevant objects. Most of existing proposals focus on sophisticated filtering and load shedding strategies carefully selecting objects which are considered as unimportant, and can be discarded. Various decision strategies with error bounds have been proposed but exact answers can not be guaranteed (cf. Section 10.2). We develop a criterion to decide upon arrival of a new object from the stream if it may become the nearest neighbor in future, or can definitely not. Among the stored potential relevant objects pruning is performed. The basic idea is that a new object arriving from the steam can often exclude many other objects which can not become nearest neighbors until the new object expires. In this section, we restrict ourselves to the case where the queries are continuous 1-nearest neighbor queries, continuous k-nearest neighbor queries for k > 1 will be considered in Section 10.3.5.
242
Chapter 10: High-Perfomance Data Mining on Data Streams
To decide whether or not an object o may become the nearest neighbor of a query q we consider the 2-dimensional space of the query distance and the time of expiry. Note that this space is always 2-d, regardless of the dimensionality of the feature space of the data and query objects. Our method is applicable to objects of arbitrary dimensionality, and can even be extended to non-vector metric objects. An object o may become the nearest neighbor, unless there exists another object p which at the same time (1) is closer to the query than o and which (2) expires later than o. If both conditions hold, then o cannot be the nearest neighbor now (because at least one closer object p is known.) Moreover, it cannot become the nearest neighbor later, because the object p lives longer. A set of objects which are maximal (minimal) with respect to two (or more) different conditions (such as, in this case, distance and expiry) is called a skyline [137]. In this context, we are considering a two dimensional distance-expiry skyline, which we call the query skyline, formally: Definition 29 (Query Skyline.) Given a time stamp t, a query q and two points o, p ∈ Vq (t). We say o dominates p if exp(o) > exp(p) and dist(o, q) < dist(p, q). The query skyline s(q, t) of query q and the set of valid points Vq (t) consists of all points in Vq (t) that are not dominated by other points in Vq (t). s(q, t) = {o ∈ Vq (t)| @o0 ∈ Vq (t) : exp(o0 ) > exp(o) ∧ dist(o0 , q) < dist(o, q)}. In the following, we will formally prove that the objects of the skyline in the distanceexpiry space (and only these objects) need to be stored as potential nearest neighbors. The current nearest neighbor is also in the skyline at all times. Lemma 4 (Correctness.) At each time stamp t the query skyline s(q, t) of each query contains N Nt (q) the nearest neighbor of q at time stamp t.
10.3 Proposed Method
243
Proof 4 N Nt (q) is by definition of the nearest neighbor not dominated by any other object o ∈ Vq (t), i. e. @o ∈ Vq (t) : dist(o, q) > dist(N N (q), q). Lemma 5 (Completeness.) A valid object o which is not in the query skyline s(q, t) at time stamp t, is the nearest neighbor at no time t0 ≥ t. Proof 5 Since o is not in the skyline, it is dominated by another object p, i.e. exp(p) > exp(o) and dist(p, q) < dist(o, q). Therefore, o is not the element having minimal distance to q for any time stamp t0 ∈ [t..exp(p)]. Since exp(o) < exp(p), o is not valid after that time interval, and thus, o is additionally not the nearest neighbor for any t0 ∈ [exp(p)..∞[. We can represent the skylines in main memory as double-linked lists, ordered by time of expiry (and at the same time, automatically ordered by distance, because the monotonicity of the objects in the skyline can be easily proven). When a new object o arrives from the data stream, we first have to locate the potential position of this object in the skyline. We start our search at the upper end, because in most applications the objects have at least approximately uniform life spans. If the potential position is not at the end, we have to check if the new object is excluded by the object immediately following the potential position. It is also a consequence of the monotonicity that, if the immediate successor does not exclude the new object, then none of the (transitive) successors can exclude the object. If o is not excluded, it needs to be appended and we proceed as follows: We test whether o excludes its direct predecessor.
Due to
monotonicity, if o does not exclude the direct predecessor, it cannot exclude any of its transitive predecessors, and we’re done. Otherwise, the dominated object is removed from the linked list, and we do the same test with the new predecessor (if any). The pseudocode of the algorithm is depicted in Figure 10.1. An example is visualized in Figure 10.2, where we have a set of 6 objects, A-F. Two objects, B and D are dominated, i.e. the actual skyline is stored in the linked list A-C-E-F. As an example, object C is
244
Chapter 10: High-Perfomance Data Mining on Data Streams
algorithm skylineMaintenance (Object o, Query q) Skyline s := q.associatedSkyline; SkylineElement c := s.last; // current element SkylineElement c0 ; if nonMonotonicExpiry // search for suitable position: while c 6= null and exp(c) > exp(o) c0 := c; c := c.pred; if c 6= s.last and dist(c0 , q) < dist(o, q) return; // o is dominated, leave s unchanged c := skylinePruning(o, q, c); if c = null s.insertAsFirst(o); reportNeighbor(o, user(q)); else s.insertAfter(o, c); function skylinePruning (Object o, Query q, SkylineElement c): SkylineElement SkylineElement c0 ; while c 6= null and dist(c, q) > dist(o, q) c0 := c; c := c.pred; s.delete(c0 ); return c;
Figure 10.1: Skyline Maintenance.
highlighted, along with the space which is excluded by C (containing the object B). When a new object G arrives it is simply appended, since the expiry is assumed monotonic here. We have to consider the two predecessors F, and then E, and discard them. Since C is not dominated by G none of the predecessors (A) must be checked.
An interesting question is how many objects do we have to expect to be in the skyline. It has been shown in [137] that, under reasonable assumptions, the size of the skyline is in O(log2 n). In our first experiments we were able to confirm a typical size of 10 elements for n = 1, 000, 000 valid objects. Although this is very moderate, we will
245
D B A
F E
C
distance
distance
10.3 Proposed Method
D B A
F E
C
time of expiry
G
time of expiry
Figure 10.2: Skyline Example. present in Section 10.3.2 an idea to further reduce the typical skyline size to 2-3 objects.
Mindful readers may have noticed that in case of monotonic time stamps, the newest object must always be appended to the skyline of every query. Therefore, one might wonder if it is really necessary to access all stored skylines for every new object (which would make query indexing, described in Section 10.3.3 obsolete). The answer is yes, but we will propose a simple trick in the next section (10.3.2) which greatly reduces this effort without affecting the correctness and completeness of the result of our algorithm.
10.3.2
Object Delaying
New objects cannot be dominated by other objects in the query skylines because they have the latest expiry. These unnecessarily appended objects are discarded soon by new objects which are closer to the query. Therefore the size of the query skylines stays logarithmic despite of this problem. But for efficient online stream processing it is essential to avoid any overhead in space and time. The skyline size can be further reduced to a few objects by a simple but effective trick. We delay every new object o from the stream for a certain time span before trying to insert it into the actual query skylines. By doing so we avoid inserting o into the skyline of a query q which is far away from o (where o qualifies only because it is new). It is likely that while o is delayed, objects from the stream arrive that are
246
Chapter 10: High-Perfomance Data Mining on Data Streams
closer to the query. Naturally these new objects dominate o (closer to the query and also later expiry) so that o can be safely discarded. Very few objects need to be immediately inserted because they are really relevant, e.g. because they replace the current nearest neighbor or have the potential to become nearest neighbor during the delay time. In the following we explain in detail how to determine the objects requiring immediate insertion. We introduce a simple ring buffer to store the new objects temporarily. Definition 30 (Buffer. ) The buffer B with size |B| stores the |B| newest objects from the stream, it holds: ∀o ∈ B, ∀p ∈ / B: exp(o) > exp(p). We additionally denote by tb the earliest expiry among all elements in the buffer. The set of objects in the buffer at time stamp t is denoted by delayed(t). In this buffer objects are processed in first-in-first-out order if the objects have monotonic expiries (for non-monotonic expiry, we either use a priority queue or we use a fifo-buffer as well, but in the latter case we only insert such objects the life span of which is at least two times the delay time, cf. Lemma 6). In addition we divide the skyline of a query into two parts. We call the first part of the skyline the exact skyline. Objects are inserted into this skyline when they come out of the buffer and qualify to be inserted then. Definition 31 (Exact Skyline.) The exact skyline of a query q at time stamp t contains all objects o ∈ Vq (t)\ delayed(t) which are not dominated by other elements p ∈ Vq (t)\ delayed(t) or by elements of the approximate skyline at time stamps tb ..t. The other part of the skyline is called approximate skyline. It consists of new objects requiring immediate insertion. When a new object o arrives then it is appended to all approximate skylines for which it could be possible that object o could become the nearest neighbor immediately or during the time when the object is delayed in the ring buffer. Additionally, it is appended to some of the approximate skylines for which o is also a ”good” object, according to an arbitrary strategy. For maintenance and good performance
10.3 Proposed Method
247
tremain
F E
B A
C
tdelay H
J
G
rC
D
Exact Skyline
Approximate Skyline
Figure 10.3: Approximate and Exact Skyline. of our index structure described in the next section it is beneficial to have some ”good” objects in the approximate skylines. In our system, the queries are stored in a grid based index (cf. Section 10.3.3). Our strategy is, therefore, to insert the new object o to all those queries which are stored in the grid cells which encompass o. Formally, the approximative skyline consists of all non-dominated objects from a subset of the delayed objects: Definition 32 (Approximate Skyline.) Let S be an arbitrary subset of delayed(t) which contains all objects with a distance below a cut radius rC to q. The approximate skyline of q is the set of elements from S which are not dominated by other elements from S. Note that the objects in the exact skyline need to be physically stored there. In contrast, for the approximate skyline, it is sufficient to store pointers to the objects only, because they are physically stored in the delay buffer. Whenever the delay buffer is full (usually upon every new insertion of an object) we remove the oldest object p from it (remove it also from all approximate skylines) and check, to which of the exact skylines it must be appended. It must be appended to those exact skylines for which no object is known which dominates p. However, in the meantime (during the delay of p), we have seen many new objects with later expiry times. Our hope is that for each query at least one of these new objects is very close, and, therefore, p is indeed excluded now. Provided we have appended any good object to the approximate skyline, we can use the top element
248
Chapter 10: High-Perfomance Data Mining on Data Streams
of the approximate skyline for discarding p. All elements in the approximate skyline have later expiries than p. Therefore, p is discarded if the top element of the approximate skyline is closer to the query than p. Therefore, the top element of the approximate skyline exactly defines the cut radius rC which is used for query indexing (c.f. Section 10.3.3).
In the last paragraph, we have seen that an object can be safely discarded if it is dominated by the top element of the approximate skyline. In the extreme, the top element of the approximate skyline can dominate all elements of the exact skyline. The top element of the approximate skyline will become the nearest neighbor of the associated query. In this case, the question is justified whether or not this top element is really the nearest neighbor, and under what conditions this can be guaranteed. We will show in the following that this is guaranteed if (1) every object is inserted into the approximate query skyline which is closer to the query than the current top element, and (2) the remaining time of validity after being inserted into the exact skyline (tremain ) is at least as high as the delay time tdelay . Lemma 6 Assume that the above conditions hold, and that an object o is not appended to the approximate skyline of query q. Then, it is guaranteed that o cannot become the nearest neighbor of query q until o is removed from the delay buffer (and inserted into the exact skyline). Proof 6 According to the definition of the approximate skyline, o has a distance to q which is larger than rC , the distance of the top element ot of the approximate skyline. Therefore, it cannot become the nearest neighbor of q until ot expires. The top element ot is still delayed (ot ∈ delayed(t)) but may be extracted from the delay buffer in an arbitrarily short time. But then, it is still valid for tremain . The new object o remains in the buffer for tdelay . If tdelay ≤ tremain then o cannot become the nearest neighbor of q during the delay time.
10.3 Proposed Method
249
Note that, after the delay time, o is considered to be inserted into the exact query skyline. To perform this test, o is compared to the new top element of the approximate skyline. If o is dominated by the new top element, it is guaranteed that o will become never the nearest neighbor of q.
An example of a pair of exact and approximate query skylines is shown in Figure 10.3. The exact skyline consists of the objects A and D and the approximate skyline of the objects G and J. B and C are dominated by D. E would actually belong to the exact skyline but is dominated by the top element of the approximate skyline G which also defines the cut radius rC . H should actually belong to the approximate skyline. But according to the definition of the approximate skyline, it is possible to discard H as it is above the cut radius. According to Lemma 6, H cannot become the nearest neighbor before G expires which takes at least tremain time. Before the expiry of G the object H is taken out of the buffer and is tested for insertion into the exact skyline. The same holds for J, but it is inserted anyhow because it is in the same grid cell as the query.
10.3.3
Query Indexing
Due to object delaying (cf. Section 10.3.2) the number of objects of the stream that need to be stored in the query skyline is very small. In addition, the streaming objects are highly dynamic, so it would not be beneficial to use an index structure for the objects. In this section we propose to index the queries in a grid-style index, which is highly flexible w.r.t. updates and can be held in main memory. According to Definition 27, at each time stamp t the answer of a continuous nearest neighbor query q is that object which has the smallest distance to q among all objects oj ∈ Vt (q) which are valid at time t. We are now interested in the question, given a new (and by definition, valid) object o at timestamp t, to which queries q ∈ Q is o the nearest neighbor among all valid objects
250
Chapter 10: High-Perfomance Data Mining on Data Streams
l0,2
u0,2 G1,3
u1,2
C2,2
l1,2
g1,3 G1,2 g1,2 G1,1 g1,1 G1,0
G0,0
G0,1 G0,2 G0,3 g0,1 g0,2 g0,3
Figure 10.4: Query Indexing. o ∈ Vt (q)? We can write this formally in the following way: q ∈ Q : N Nt (q) = {o}. One might recognize that this resembles to the definition of the reverse nearest neighbor problem. More exactly, the object o searches for the bichromatic reverse nearest neighbor among the queries, as defined, e.g. in [141].
For the efficient evaluation of a bichromatic reverse nearest neighbor query, usually the following consideration is applied: o is the nearest neighbor of those q ∈ Q to which it is closer than the current nearest neighbor of q. This can be easily decided if the radius r(q) of the current nearest neighbor is stored with every query q ∈ Q. For continuous N N -queries which are supported by a delay buffer we need a cut radius rC which is guaranteed to be an upper bound of the nearest neighbor radius for tremain time. As discussed in Section 10.3.2, rC corresponds to the distance between q and the top element of the approximate skyline. Instead of considering the queries as points of a metric or vector space, we consider them as spheres with a radius rC (q).
10.3 Proposed Method
251
F
q1-Skyline:
H
E A
q1
B F
rC1
E G
J H
D C
q2 rC2
B
D
q2-Skyline:
E B
rC1
G
C
A
A
J
C
G D
Exact Skyline
J
F
H
rC2
Approximate Skyline
Figure 10.5: The Relationship Between Query Indexing and Approximate and Exact Skylines For indexing, we merely need a structure which allows us to efficiently determine all spheres in which the point o is contained. We apply a very simple, grid based method for vector objects which is described in the following. Note that this is used for storing the queries only. Therefore, it might fit into main memory for a moderately high number of queries and regardless of the overall size of the data stream and regardless of the number of valid objects. Our data structure is visualized in Figure 10.4. Each of the d dimensions is independently partitioned into a number of ρ quantiles according to a sample of the set of queries. If initially no sample of the queries is known, then the quantiles can be uniformly distributed in the data space. These quantiles are depicted in Figure 10.4 in dashed lines and marked with gi,j where i ∈ 0, ..., d − 1 is the dimension of the grid line and j ∈ 0, ..., ρ is a sequential number which is strictly monotonic (i.e. gi,j1 < gi,j2 ⇔ j1 < j2 ). The set of all quantiles constitutes a d-dimensional ρ × ... × ρ grid. The rows and columns are marked with Gi,j where i is again the dimension and j ∈ 0, ..., ρ − 1 is the sequence number. Initially the lower limit of Gi,j is gi,j and the upper limit is gi,j+1 . The cells of the grid are marked with Ca0 ,...ad−1 where ai corresponds to some grid row or column Gi,ai . The queries are associated to the grid cells according
252
Chapter 10: High-Perfomance Data Mining on Data Streams
to the coordinates qi , i = 0, ..., d − 1 of the center, i.e. q ∈ Ca0 ,...,ad−1 ⇔ qi ∈ [gi,ai , gi,ai +1 ]. Note that, until now, this structure resembles the well-known VA-file [124].
We now extend this structure to index spheres rather than points.
Additionally
to the original quantile lines gi,j we determine for each row and column Gi,j the lower and upper boundaries li,j , ui,j of all query spheres (the spheres centered by q having the current cut radius r(q)) which are stored in any of the cells: li,j = ui,j =
min
{qi − rC (q)}
max
{qi + rC (q)}
q∈Q|qi ∈[gi,j ,gi,j+1 ]
q∈Q|qi ∈[gi,j ,gi,j+1 ]
Here qi denotes attribute number i in the query vector q. As depicted in Figure 10.4, these limits usually extend each line and column a little bit to the left and to the right. Therefore, these modified grid cells are now allowed to overlap. But if the radii of the stored spheres are small, it is also possible that some grid cells do not grow but shrink instead, and the neighboring grid cells become disjoint. During query processing, the queries may frequently change their associated cut radius rC (q): As a consequence of the arrival of a new object, the cut radius of a query may decrease. Likewise, the cut radius may increase if the current nearest neighbor of a query expires. We handle the increase of a radius by immediately checking the lower and upper boundaries of the cell to which q is associated, and updating them if necessary. From time to time, the boundaries are also checked if they are able to be contracted. By this simple method it is guaranteed that each grid cell is always a conservative approximation of the contained query sphere. Thus, it is possible to safely decide according to the cell boundaries that a cell cannot contain any queries to which a given object could be the nearest neighbor. Figure 10.5 illustrates the relationship between the index structure and the skylines described in the
10.3 Proposed Method
253
algorithm processNewObject (Object o) if buffer.isFull() Object o0 :=buffer.getAndRemoveMinimumExpiry(); process(o0 , false); buffer.insert(o); process(o, true); algorithm process (Object o, boolean isApproximate) for all g ∈ gridCells if o ∈ g.occupiedSpace for all q ∈ g.associatedQueries maintenance(o, q, isApproximate); algorithm maintenance (Object o, Query q, boolean isApproximate) // monotonic expiry only due to space limitations Skyline e := q.associatedExactSkyline; Skyline a := q.associatedApproximateSkyline; SkylineElement c, c0 ; if isApproximate c := skylinePruning (o, q, a.last); // cf. Figure 10.1 if not isApproximate or a.isEmpty() c := skylinePruning (o, q, e.last); if e.isEmpty() reportNeighbor(o, user(q))); if isApproximate a.append(o); else e.append(o);
Figure 10.6: Processing of a New Object o. previous section. Figure 10.6 summarizes the required steps for processing a new object in pseudocode.
10.3.4
Result Reporting
Our system has to report every change of the nearest neighbor of any subscribed query q ∈ Q immediately to the user who owns the subscription. There are two possible causes of a nearest neighbor change, (1) the arrival of a new object from the data stream which takes over immediately the top position of the query skyline, and (2) the expiry of the current nearest neighbor of q. Case (1) is immediately detected during processing of the new object. Case (2) is more complicated because, in principle, the query skyline of every query must be continuously monitored for the expiry of its first element.
254
Chapter 10: High-Perfomance Data Mining on Data Streams
To do this in a very efficient way, we store pointers to the queries in a priority queue (heap) which we call the action-queue. We need only one global action queue in the system with a size of |Q|, the number of subscribed queries. The elements of the action queue are ordered by the time stamp of the expiry of the top element of the query skyline. This allows an efficient access (in constant time) to the element which will expire next in any query. This access can be done periodically upon every arrival of a new object. As an alternative, the next relevant expiry (the top element of the action queue) can be triggered by a timer.
Whenever an element is dequeued from the action queue, and whenever a newly arrived object takes over the top position of a query q, the action queue must be updated which requires O(log2 |Q|) time where |Q| is the number of subscribed queries (note that we have to register every query exactly once in the action queue with the expiry of the top element of the query skyline). This reorganization upon the arrival of new objects can even be avoided in the case of monotonic object expiry: In this case, we know that the expiry tnew of the new top element of q is later than of the old one (told ). We can leave the element in the action queue as it is, wait for told and then state that a top element has expired which has actually been already dominated. Therefore the new expiry tnew can then be registered without notifying the user about any change.
10.3.5
k-Nearest Neighbor Queries
In this section, we shortly describe the generalization of our technique for k-nearest neighbor queries with k > 1. The three ideas of our technique, query indexing, skyline based object maintenance and delay buffering can also be applied in the general case. Particularly the concept of query-skylines needs some modifications. We start again with
10.3 Proposed Method
255
F B
G
D F(3) B(4)
H(1)
J(0)
E(1)
C(2) A(0)
J
E
C(2) A
H
D(0)
G(0)
Dominated by 1 object Dominated by 2 or more objects
Figure 10.7: Example of k-Skyline. a simple variant without distinction between exact and approximate query skyline.
An object can now be safely discarded from memory if it is guaranteed that it will become at no future point-of-time one of the k nearest neighbors of any query q ∈ Q. Translating this into the language of skylines, an object can be discarded if it is dominated by at least k other objects, where the dominance is exactly the same as in Definition 29. We can then define a k-skyline as the set of objects which is dominated by less than k objects, which corresponds to the concept of the k-skyband [136]. In addition we store for each object a counter with the information by how many other objects it is dominated.
Figure 10.7 shows an example of a k-skyline with k = 2.
For each object, our
figure additionally displays a counter with the information, by how many other objects an object is dominated. Highlighted is object C which is dominated by the objects D and G, and, therefore, C is marked with the counter 2. Upon every insertion of a new object o, the counters of the objects dominated by o are increased by 1. Objects having counters that exceed k, can be deleted from the skyline (in the figure, C and F can be
256
Chapter 10: High-Perfomance Data Mining on Data Streams
removed; the remaining objects are members of the 2-skyline).
When we use a delay buffer, we have again to distinguish between approximate and exact skylines. Our notion of dominance can be naturally extended to this case (objects of the approximate skyline can also dominate objects from the exact skyline, and, thus, increase their counter). To associate a query with a cut radius we need to guarantee that no object exceeding the cut radius can become the nearest neighbor during the delay time or during its remaining life time. This requires that we have to store the objects of the approximate k-skyline in a way which allows efficient access to the element with distance-rank k. A linked list can be used for the management of the elements with rank 1 to k. A priority queue (heap) can be used for the remaining elements with rank > k.
10.4
Experiments
This section provides an extensive experimental evaluation on synthetic real-world data sets. Real-world biomedical data streams consisting of complex streaming objects with more than one measurement from different sensors have not been publicly available for evaluation, and therefore data sets on internet traffic have been used.
10.4.1
Data Sets and Methodology
For a systematic analysis of the properties of our method we generated synthetic data sets of various dimensionality. In particular, we used uniformly distributed and clustered gaussian data. The clustered gaussian data sets contain ten clusters (standard deviation 0.1) which are randomly distributed on the data space. We also show the performance on two netflow data sets. We used a two dimensional data set containing records extracted
2000 1500 1000
15 20
25 30
600
1100
ρ
(a) Resolution clustered gaussian
30000000
uniform
25000000 20000000 15000000 10000000 5000000 10000
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
3000
300 200 100
2
3
4
5
6
K
(i) k
7
8
9 10
800
1000
(d) Dynamic Queries |Q| our algorithm without index without buffer
700
0.78%
300 200
1.1 %
600 500 400 300 200 100
0
0
1000 2000 3000 4000 5000 6000
0
|B|
|B|
(f) Memory Usage
(g) |B|
200000 400000 600000 800000 1000000
N
(h) Indexing and Delaying
25
3D
40
4D 30
5D
20 10
120
20
10
clustered gaussian
200000 400000 600000 800000 1E+06
N
(j) d
0
2D netflow
80 60 40 20
uniform 0 0
3D netflow
100
15
5
0
0
600
800
400
4000
runtime in s
runtime in s
clustered gaussian uniform
400
|Q|
0
50
500
2D netflow 3D netflow
(c) Dynamic Queries N
100
(e) Index Maintenance
1
200000 400000 600000 800000 1E+06
500
2000
20
0 200
600
skylines
1000
30
N
buffer
0
100000
40
10
0
1600
memory usage
shrink after objects
400
40
(b) |Q| memory usage in %
distance calculations
35000000
1000
60
|Q|
40000000
0 100
50
80
0
0 100
35 40
60
3D netflow
20
runtime in s
10
100
2D netflow runtime in s
2500
500
5
runtime in s
120
our algorithm without index
3000
runtime in s
50000000 40000000 30000000 20000000 10000000 0
70
140
3500
runtime in s
clustered gaussian uniform
runtime in s
70000000 60000000
257
runtime in s
distance calculations
10.4 Experiments
20000 40000 60000 80000 100000
l(o)
(k) l(o)
0 0
200000 400000 600000 800000 1E+06
N
(l) Netflow Data N
Figure 10.8: Experimental Evaluation. from netflow IP data logs, cf. [132]. We additionally used a three dimensional netflow data set containing records (source port, destination port, packet size) extracted from the LBL-TCP 3 data set, available at the Internet Traffic Archive (http://ita.ee.lbl.gov/). The IDs of the source and destination hosts have been left out here. For all experiments we assume number based expiry. In this context for an object o a life span of e.g. l(o) = 20, 000 means that o expires after 20,000 new objects from the stream have arrived. We used N = 1, 000, 000 2-d streaming objects with a lifespan of l(o) = 20, 000, |Q| = 500 queries, a buffer size of |B| = 2000, and a resolution of ρ = 20 when not otherwise specified. We further assume that at each time stamp a new object from the stream arrives. We randomly separated objects from the stream and used them as continuous queries. Experiments were run on a PC with a 2.4 GH pentium processor and 512 MB
258
Chapter 10: High-Perfomance Data Mining on Data Streams
main memory under Java.
10.4.2
Query Indexing
Figure 10.8(a) shows the number of distance calculations with varying resolution of the query index on uniform and clustered gaussian data. For uniformly distributed data, ρ = 20 grid lines per dimension show the best trade-off between number of cells and queries per cell. For clustered gaussian data a higher resolution of 35 is better, since queries are densely populated in some regions of the data space. In Figure 10.8(b) the scalability w.r.t. the number of subscribed queries |Q| for our algorithm and a variant without indexing is depicted. Upon arrival of a new object o, this variant simply checks all queries if o needs to be inserted. In this experiment we used uniformly distributed data. Hence our index structure is simple and efficient in maintenance, query indexing pays off already for |Q| = 100. For |Q| = 1600 the index leads to a speed up factor of 80. Figures 10.8(c) and 10.8(d) show the performance in the case of new dynamic queries arriving during the runtime of the system on netflow data. In Figure 10.8(c) in addition to 400 static queries 100 dynamic queries have been inserted. Despite of the inserted queries, the runtime scales linearly with N . Figure 10.8(d) shows the performance w.r.t. the number of dynamic queries. New queries can be very efficiently inserted in our index. This leads to a sublinear increase in runtime. During the runtime the grid cells are expanded as explained in Section 10.3.3. To maintain a high performance of the grid index it is beneficial to adjust the upper and lower bounds in fixed time intervals. For uniform and clustered gaussian data we achieve good performance if we do this upon arrival of every 1,000 new objects as Figure 10.8(e) shows.
10.4 Experiments
10.4.3
259
Buffer Size
The size of the buffer can be chosen depending on available memory. As shown in Section 10.3.2 our method works correctly with an arbitrarily small buffer size. According to Lemma 6 an upper limit for the buffer size |B| is then half of the live span of the streaming objects. Figure 10.8(f) shows the overall memory usage for varying buffer sizes |B| = 20...4, 000 for two dimensional uniformly distributed data. Here we increased the live span of the objects to l(o) = 500, 000. The memory usage for the buffer, for the query skylines and the overall memory usage is depicted. Even for this relatively long life span of the objects, memory usage is very low. For all examined buffer sizes we needed to store in total less than 2% of the valid objects. As a minimum 0.78% of the valid objects need to be stored for a buffer size of |B| = 500 objects. Figure 10.8(g) gives the runtimes in seconds for the previous experiment. Although the configuration minimizing memory usage allows a throughput of 14,085 objects per second, a much higher throughput of 93,110 objects per second can be achieved with a buffer size of |B| = 3000, which means storing 1.1% of the valid objects. Larger buffer sizes do not considerably further improve runtime. 10.4.4
Scalability
Figure 10.8(h) demonstrates the benefits of query indexing and object delaying. The runtime in seconds on uniformly distributed two dimensional data is reported for our technique and versions without query indexing and without object delaying for various numbers of objects N . The version without buffer performs worst. This demonstrates the effectiveness of object delaying. For N = 1, 000, 000 this version attains a throughput of 1,372 objects per second. The version without query indexing is slightly better and attains a throughput of 1,570 objects per second. The combination of query indexing and object delaying leads to impressive performance gains over both aspects when used alone. One
260
Chapter 10: High-Perfomance Data Mining on Data Streams
million stream objects are processed in 11.58 seconds, this corresponds to a throughput of 86,356 objects per second. The throughput is 55 times higher than for the version without query indexing and even 63 times higher than for the version without object delaying. Figure 10.8(i) shows the scalability our method for k-nearest neighbor processing as described in Section 10.3.5 on two dimensional uniform and clustered gaussian data. Starting with |B| = 2000, we increased the buffer size linearly with k. It can be assumed that for a larger k also more memory is available, at least for storing the query results. With this configuration, for k = 10 we store 5% of the valid objects in the buffer. For both data sets, the runtime scales linear with k. Figure 10.8(j) demonstrates the scalability with the dimensionality on clustered gaussian data of dimensionality 3, 4, and 5. For this experiment, the resolution was set to ρ = 10. All curves scale linearly with the number of objects N . There is only a small increase in runtime from 4 to 5 dimensional data. Figure 10.8(k) displays runtime for various life spans l(o) = 5, 000..1, 000, 000 on uniform and clustered gaussian data. The runtime remains almost constant with increasing l(o), even if all objects are valid at all time stamps. This demonstrates that our method is applicable high throughput data streams ranging from relatively short object live spans to scenarios without expiry. Figure 10.8(l) shows the runtime in seconds for varying N on netflow data. Here we set ρ = 15 for both, 2-d and 3-d data sets. For both data sets, the runtime scales linearly with the data size.
10.5
Conclusion
In this chapter, we have proposed an efficient technique for processing a large number of continuous k-nearest neighbor queries on data streams of feature vectors which are important e.g. in multimedia and marketing applications. In contrast to previous methods for query processing on streams, the result of our method is not approximative but exact and complete. Our method is based on three major ideas:
10.5 Conclusion
261
a) The query skyline enables us to carefully select those objects from the stream which have the potential of becoming answers to any of the subscribed queries. b) A delay buffer is used to retard processing of those objects which are not immediate answers to the queries and thus greatly improves the efficiency of processing. c) A query index is used to organize the subscribed queries in a simple, grid based way. Our extensive experimental evaluation demonstrates that our technique scales well with the size of the stream, the number of queries, the number k of answers per subscribed queries, the dimensionality, and the available buffer size for artificial and real-world stream data. In particular, we demonstrate that the combination of our 3 major ideas leads to a very low memory consumption: Only between 0.78% and 1.1% of the valid stream objects must be retained in memory to guarantee completeness and correctness of the result of our technique. In addition, we demonstrate that the combination of our 3 major ideas improves the throughput by a large factor of about 60.
An interesting direction for further research would be to extend the index structure such that it can deal with general metric, non-vector data. Of course it would also be interesting to adapt this approach to be used in the context of a concrete biomedical application. To accomplish this, accustomed similarity measures can be defined, e.g. by the means of feature weighting.
Part III Conclusion
263
Chapter 11 Summary and Outlook
In part II novel algorithms for mining biomedical data sets have been proposed. The contributions of this thesis also include feature selection techniques which can be applied before data mining to improve the results. This chapter summarizes the major contributions of this thesis (cf. Section 11.1) and points out possible directions for future research (cf. Section 11.2).
266
Chapter 11: Summary and Outlook
11.1
Summary
The tremendous amounts of heterogenous data produced every day in modern biology and medicine pose novel challenges for data mining. Part II of this thesis proposes some innovative approaches to data mining on biomedical data sets, focussing mainly on high dimensional vector data. All proposed techniques have been designed to be scalable to large and high dimensional data sets and to avoid sensitive input parameters as much as possible. The main advantages of the proposed methods can be summarized as follows: • In the field of clustering, RIC (Robust Information-theoretic Clustering) has been introduced in Chapter 4. This method can detect clusters of various data distributions in noisy data sets, and also subspace clusters without requiring sensitive input parameters. As additional value add, RIC provides the user with a detailed description of the cluster content in terms of data distribution and subspace orientation. RIC works on top of an initial data partitioning which can be derived by an arbitrary clustering algorithm, e.g. K-Means. The algorithm purifies the initial clusters from noise and merges clusters which fit well together. During the purifying and merging steps, the algorithm is guided by optimizing a cluster quality measure based on information theory. RIC has been evaluated on metabolic data and on phenotypic data on retina detachment and is especially suitable for detecting non-gaussian clusters. • HISSCLU (Hierarchical Density Based Clustering), cf. Chapter 5, is a hierarchical density based algorithm for semi-supervised clustering. During the clustering process, this algorithm takes the information of class-labels given for some of the objects and the data distribution of all objects into account. During the run of the algorithm, the hierarchical class and cluster structure is determined and visualized in the cluster diagram. In contrast to other semi-supervised clustering methods,
11.1 Summary
267
HISSCLU preserves spacial coherency when assigning objects to clusters, thus the information of multi modal class structures is preserved. In addition, similar classes and unknown class hierarchies can be derived of the cluster diagram. HISSCLU has been evaluated on various data sets. Among others, metabolic data and data on predicting protein localization sites has been used. • The local classification factor (LCF) has been introduced in Chapter 6. This technique has been designed to enhance instance-based classification with ideas from a density based clustering notion and from density based outlier detection. Especially on high dimensional biomedical data sets, e.g. metabolic data, classification methods constructing hyperplanes or models tend to decline in accuracy, because these data sets often exhibit sparse multi modal data distributions. Often, the k-nearest neighbor classifier shows better performance, however accompanied by the tendency to understate sparser classes, which often represent the pathologic instances. LCF assigns an object to that class in which it best fits into the local data distribution. For this purpose, a class local outlier factor is defined and this information is used for classification in addition to the local object densities of the classes. LCF demonstrates superior performance especially on metabolic data. The contributions of this thesis also include unsupervised and supervised feature selection techniques which belong to the data transformation step, following the outline of the data mining process given in [6]. • SURFING (Subspaces Relevant For Clustering), proposed in Chapter 8 is an unsupervised feature selection technique for clustering. In contrast to competitive approaches, the method does not require the user to specify a global density threshold for clustering. As the average data density naturally decreases with increasing dimensionality, methods requiring a global density parameter can only detect
268
Chapter 11: Summary and Outlook
interesting subspaces of a certain dimensionality which is fixed by the density parameter. Therefore, the parametrization is difficult. SURFING defines a quality criterion based on the idea that the k-nearest neighbor distances vary significantly in interesting subspaces. Based on this quality criterion, the interesting subspaces are determined in a bottom-up heuristic algorithm. The experimental evaluation demonstrates that this heuristic is efficient: On gene expression data sets, less then 1% of all possible 2d subspaces are examined. SURFING outperforms competitive methods in finding interesting subspaces containing clusters on synthetic and gene expression data. • A supervised feature selection method for identifying biomarker candidates and classifying high dimensional proteomic data has been proposed in Chapter 9. Complex diseases like cancer are expressed by abnormal changes in serum protein and peptide abundance which can be measured by high-throughput techniques such as SELDI-TOF MS. Proteomic spectra inherently contain more information for early stage cancer detection than traditional single biomarkers, however only a few among thousands of features corresponding to a few regions of the spectra discriminate well among the healthy and the diseased instances. The proposed three-step feature selection framework is a hybrid method combining elements of filter and wrapper approaches and also takes special characteristics of the data into account. In each step, the data dimensionality is reduced and the classification accuracy is increased. As a result, highly discriminating regions on data sets on ovarian and prostate cancer have been identified. This thesis includes two further approaches related to mining biomedical data which can not be characterized as data mining algorithms or feature selection approaches. First, an application of existing techniques to a very concrete biomedical application, and secondly a technique for enabling data mining on high-throughput data streams:
11.1 Summary
269
• Chapter 7 deals with discovering genotype-phenotype correlations in the Marfan Syndrome. The Marfan Syndrome severely affects the connective tissue and the most frequent cause of death is weakness of the aortic wall. The integration of genetic and phenotypical data and the application of suitable data mining methods (logistic regression analysis and hierarchical clustering) allows it to determine a correlation between different genotypes and corresponding phenotypes. It is thus possible to predict the phenotype of a patient based on the examination of his or her genotype. Patients with a high risk for developing severe cardiovascular symptoms can be identified at an early stage of the disease. The results of this analysis may be used for an improved disease management.
• In Chapter 10 an index structure to support continuous k-nearest neighbor queries on data streams is proposed. The k-nearest neighbor query is a fundamental building block for many data mining algorithms including clustering and classification. High throughput data streams as they occur in health monitoring and other applications are very challenging because it is often impossible to store all streaming objects in main memory. Because of this, approximate solutions for k-nearest neighbor monitoring have been proposed. However, in particular for patient monitoring, exact answers are often essential. The proposed index relies on storing only the required objects of the stream in a skyline-like data structure and the idea of indexing the queries instead of the objects. The combination of these two ideas lead to impressive performance gains. The technique thus enables exact k-nearest neighbor monitoring on high-throughput data streams with very limited memory usage.
270
Chapter 11: Summary and Outlook
11.2
Outlook
The last section of this thesis points out some major directions for future research ends with a more general outlook on important research directions for mining biomedical data. In particular, some of the opportunities for future research include: • In the area of clustering, extending RIC to a stand-alone clustering algorithm which does not require an initial data partitioning. In addition, RIC can be extended to derive a hierarchy of clusters from a data set. • Elements from information theory can be transferred into semi-supervised clustering. Besides that, HISSCLU could be extended to support subspace and correlation clusters. For biomedical data sets, supervision is not only restricted to be available in the form of class labels or constraints. Also the features or the objects themselves may be more or less important for the application because of biological reasons. In proteomic spectra, e.g. high peaks at unexpected regions represent more likely errors in measurement than true observations. Simple feature weighting is often no solution, because different features may be of different meanings in different object. Semi-supervised data mining can be extended to support more complex types of supervision. • Classification can be more closely related to (semi-supervised) clustering. Clustering can not only be a useful preprocessing step for classification. An interesting direction to investigate would be a close interrelationship between clustering and classification. Hierarchical ensembles of classifiers could be arranged according to the hierarchical class- and cluster structure. Further, more specific directions for future research are indicated at the end of the corresponding chapters.
11.2 Outlook
271
To support the development towards systems biology, in the next years there will be an increasing need for appropriate data mining techniques. In particular, the following challenges are emerging: • High Performance Data Mining: Developing scalable methods for massive amounts of data. • Integration: Integrating data and combining methods. • Modeling: Designing the knowledge discovery process. Systems biology aims at getting a better understanding of highly complex biomedical processes by integrating information from genomics, proteomics and metabolomics. Thus, the goal is to study and simulate complex biological processes at different levels of the system (see also Chapter 2). Most of the methods discussed in this thesis deal with one specific application. Already in this context, conventional data mining algorithms often get soon to their limits when confronted with modern high throughput data sets, e.g. derived from mass spectrometry. To extract knowledge of the data, there will be an increasing need for data mining techniques which can deal with massive amounts of very high dimensional data sets. Some of the methods proposed in this thesis, in particular the feature selection techniques, contribute to this task.
However, integrating data not only means to multiply the amount of data to be analyzed. First of all, integrating data means to scale up the complexity of the data. Most of the times, the data to be integrated is available in heterogenous type, format and quality. Most data mining algorithms can not deal with heterogeneous data types, e.g. categorical attributes and feature vectors. Often, data mining algorithms have to be applied separately to the sources and the results need to be combined. Even
272
Chapter 11: Summary and Outlook
when the data formats are conform, there are serious issues concerning data quality. In combination noisy and irrelevant attributes from various data sources can lead to undesired side effects and mislead data mining algorithms.
Because of this, the
integration of biomedical data from various sources and the integration of data mining results depending on the specific goal of the discovery process is an important research area.
The modeling of the knowledge discovery process as a whole will be increasingly subject to future research. For each of the single steps of the knowledge discovery process there is huge variety of methods which can be applied, and constantly novel methods are proposed. As a result, an overwhelming amount of possible combinations of methods exist to implement a single knowledge discovery process. Therefore, there is a need of techniques for modeling the knowledge discovery process and giving support on the decisions which steps to apply and which specific methods to select.
List of Tables
1.1
Table of Symbols and Acronyms . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1
Symbols and Acronyms Used in Chapter 4. . . . . . . . . . . . . . . . . . . 65
4.2
Self-delimiting integer coding . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3
Clusters found by RIC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.1
Symbols and Acronyms Used in Chapter 5. . . . . . . . . . . . . . . . . . . 101
6.1
Symbols and Acronyms Used in Chapter 6. . . . . . . . . . . . . . . . . . . 132
6.2
Data Sets - Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.3
Results on Synthetic Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.4
Results on Metabolic Data Set 1. . . . . . . . . . . . . . . . . . . . . . . . 150
6.5
Amino Acids (18 Metabolites). . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.6
Acyl-Carnitines (51 Metabolites). . . . . . . . . . . . . . . . . . . . . . . . 152
6.7
Sugars (48 Metabolites). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.8
Whole Data Set (117 Metabolites). . . . . . . . . . . . . . . . . . . . . . . 153
6.9
Confusion Matrix for Yeast With LCF. . . . . . . . . . . . . . . . . . . . . 155
6.10 Results on E.Coli Data.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.11 Results on UCI data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.12 Results on Yeast Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.1
Symbols and Acronyms Used in Chapter 7. . . . . . . . . . . . . . . . . . . 165
276
LIST OF TABLES
7.2
Phenotypic Purity within a Class. . . . . . . . . . . . . . . . . . . . . . . . 171
7.3
Correlation between Sub/Mis Mutations and phenotype classes I-IV.
7.4
MFS Phenotype Within Members of a Family. . . . . . . . . . . . . . . . . 173
8.1
Symbols and Acronyms Used in Chapter 8. . . . . . . . . . . . . . . . . . . 181
8.2
Results on Synthetic Data Sets. . . . . . . . . . . . . . . . . . . . . . . . . 194
8.3
Comparative Tests on Synthetic Data. . . . . . . . . . . . . . . . . . . . . 197
8.4
Results on Gene Expression Data. . . . . . . . . . . . . . . . . . . . . . . . 201
9.1
Symbols and Acronyms Used in Chapter 9. . . . . . . . . . . . . . . . . . . 207
9.2
Comparison of Search Strategies. . . . . . . . . . . . . . . . . . . . . . . . 219
9.3
Confusion Matrix for Prostate Data. . . . . . . . . . . . . . . . . . . . . . 223
9.4
Linear SVM and Information Gain . . . . . . . . . . . . . . . . . . . . . . 224
9.5
5-NN and ReliefF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
. . . 173
10.1 Symbols and Acronyms Used in Chapter 10. . . . . . . . . . . . . . . . . . 238
List of Figures
1.1
The KDD Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1
From DNA to Cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2
cDNA Microarray Technique. . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3
Microarray Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4
Metabolic Disorders. Arrows indicate abnormally enhanced and diminished metabolite concentrations. Bold metabolites denote the established primary diagnostic markers. American College of Medical Genetics/American Society of Human Genetics Test and Technology Transfer Committee Working Group, 2000, [34]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1
Example Data Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2
Definitions of DBSCAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3
Example data set (left) and dendrogram (right). . . . . . . . . . . . . . . . 51
3.4
OPTICS Reachability Plot of Example (b). . . . . . . . . . . . . . . . . . . 53
4.1
A fictitious dataset, (a) with a good clustering of one Gaussian cluster, one sub-space cluster, and noise; and (b) a bad clustering. . . . . . . . . . . . . 63
4.2
Example of VAC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3
Conventional and robust estimation. . . . . . . . . . . . . . . . . . . . . . 78
4.4
The decorrelation matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
280
LIST OF FIGURES
4.5
2-d synthetic data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.6
3-d synthetic data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.7
Layers of the Retina. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.8
(a): Visualizing the distribution of the 7-dimensional retinal image tiles. Each subfigure shows the distribution of two dimensions. The data set contains non-Gaussian clusters. (b): The 13 clusters found by RIC. Figures look best in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.9
Example clusters on retinal image tiles found by RIC. . . . . . . . . . . . . 90
4.10 The white boxes in the two retinal images indicate example tiles in selected clusters. Left (a): Tiles at position A of cluster of Figure 4.9(a) Right (b): Tiles at position B of cluster of Figure 4.9(d). Best viewed in color. . . . . 91 4.11 RIC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.1
Example Data Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2
CCL algorithm - example. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3
Ordering different paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4
Different values for ξ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5
Excluded Areas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.6
Weighting - example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.7
Cluster Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.8
Visualizing semi-supervised class- and cluster-hierarchies: Results on Ecoli data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.9
Visualizing semi-supervised class- and cluster-hierarchies: Results on Yeast data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.10 OPTICS Reachability Plot. . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.11 HISSCLU Cluster Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.12 Cluster assignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
LIST OF FIGURES
281
5.13 Comparison on liver data set. . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.14 Cluster Expansion Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.1
Influence of Parameter k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2
Influence of Parameter l on Accuracy, Precision and Recall. . . . . . . . . . 145
6.3
Parameterization of LCF on Metabolic Data Set 1. Classification Accuracy Depending on Different k- and l-Values is Displayed. . . . . . . . . . . . . 151
6.4
Local Classification Factor - Illustration
. . . . . . . . . . . . . . . . . . . 159
7.1
Hierarchical Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.2
ROC curves of the LRA model built on three selected clinical criteria (ocular major, skeletal major and skin minor criteria). The dependant variable is the presence/absence of a Sub/Mis mutation in the fibrillin-1 gene. . . . 177
8.1
Subspace Clustering - Motivation. . . . . . . . . . . . . . . . . . . . . . . . 180
8.2
Usefulness of the k-nn distance to rate the interestingness of subspaces. . . 185
8.3
Benefit of Inserted Points. . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.4
Algorithm SURFING. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
9.1
Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
9.2
Distribution of the Ranked Features. . . . . . . . . . . . . . . . . . . . . . 213
9.3
Search Space for Step 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
9.4
Modified Binary Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.5
Selecting Region Representatives . . . . . . . . . . . . . . . . . . . . . . . 221
9.6
Results on Ovarian Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.7
Results on Prostate Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.8
Selected Regions on Prostate Data. . . . . . . . . . . . . . . . . . . . . . . 227
10.1 Skyline Maintenance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
282
LIST OF FIGURES
10.2 Skyline Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 10.3 Approximate and Exact Skyline. . . . . . . . . . . . . . . . . . . . . . . . . 247 10.4 Query Indexing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 10.5 The Relationship Between Query Indexing and Approximate and Exact Skylines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 10.6 Processing of a New Object o. . . . . . . . . . . . . . . . . . . . . . . . . . 253 10.7 Example of k-Skyline.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
10.8 Experimental Evaluation.
. . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Bibliography
[1] E. Petricoin, D. Ornstein, C. Paweletz, A. Ardekani, P. Hackett, B. Hitt, A. Velassco, C. Trucco, L. Wiegand, K. Wood, C. Simone, P. Levine, W. Linehan, M. EmmertBuck, S. Steinberg, E. Kohn, and L. Liotta, “”Serum proteomic patterns for detection of prostate cancer.”,” J Natl Cancer Inst. 94(20), pp. 1576–1578, 2002. [2] E. Petricoin, A. Ardekani, B. Hitt, P. Levine, V. Fusaro, S. Steinberg, G. Mills, C. Simone, D. Fishman, E. Kohn, and L. Liotta, “”Use of proteomic patterns in serum to identify ovarian cancer”,” Lancet 359(9306), pp. 572–577, 2002. [3] M. Ester and J. Sander, Knowledge Discovery in Databases - Techniken und Anwendungen, Springer, 2000. [4] I. H. Witten and W. Frank, Data Mining - Practical machine learning tools and techniques with java implementations, Morgan Kaufmann, 2000. [5] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Academic Press, 2001. [6] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “Knowledge discovery and data mining: Towards a unifying framework.,” in Proc. of KDD Conference, pp. 82–88, 1996.
286
Bibliography
[7] R. T. Ng and J. Han, “Efficient and effective clustering methods for spatial data mining,” in Proc. of VLDB Conference., pp. 144–155, 1994. [8] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise.,” in Proc. of KDD Conference, pp. 226–231, 1996. [9] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos, “”LOCI: Fast Outlier Detection Using the Local Correlation Integral”,” in Proc. of ICDE Conference, pp. 315–, 2003. [10] F. Murtagh, “A survey of recent advances in hierarchical clustering algorithms,” The Computer Journal 26(4), pp. 354–359, 1983. [11] P. Kr¨oger, Coping with new challenges for density-based clustering. PhD thesis, LMU Munich, 2004. [12] J.-Y. Pan, Advanced Tools for Multimedia Data Mining. PhD thesis, Carnegie Mellon Univerity, Pittsburgh, PA, 2006. [13] I. Jolliffe, Principal Component Analysis, Springer Verlag, 1986. [14] A. Hyv¨arinen and E. Oja, “Independent component analysis: algorithms and applications.,” Neural Networks 13(4-5), pp. 411–430, 2000. [15] S. Basu, A. Banerjee, and R. J. Mooney, “”Semi-Supervised Clustering by Seeding”,” in Proc. of ICML Conference, pp. 19–26, 2002. [16] S. Basu, M. Bilenko, and R. J. Mooney, “A probabilistic framework for semisupervised clustering.,” in Proc. of KDD Conference, pp. 59–68, 2004.
Bibliography
287
[17] C. F. Eick, N. M. Zeidat, and Z. Zhao, “Supervised clustering - algorithms and benefits.,” in Proc. of ICTAI Conference, pp. 774–776, 2004. [18] C. B¨ohm, C. Faloutsos, J.-Y. Pan, and C. Plant, “Robust information-theoretic clustering.,” in Proc. of KDD Conference, pp. 65–75, 2006. [19] C. Plant, C. B¨ohm, C. Baumgartner, and B. Tilg, “”Enhancing instance-based classification with local density.”,” in Bioinformatics, pp. 22:981–988, 2006. [20] C. Plant, C. B¨ohm, B. Tilg, and C. Baumgartner, “Enhancing instance-based classification on high throughput ms/ms data: Metabolic syndrome as an example.,” in ¨ Proc. of Gemeinsame Jahrestagung der Deutschen, Osterreichischen und Schweizerischen Gesellschaft f¨ ur Biomedizinische Technik (BMT 2006)., 2006. [21] C. Baumgartner, D. Baumgartner, M. Eberle, C. Plant, G. M´aty´as, and B. Steinmann, “Genotype-phenotype correlation in patients with fibrillin-1 gene mutations,” in Proc. 3rd Int. Conf. on Biomedical Engineering (BioMED 2005), pp. 561–566, 2005. [22] C. Baumgartner, C. Plant, K. Kailing, H.-P. Kriegel, and P. Kr¨oger, “Subspace selection for clustering high-dimensional data.,” in Proc. of ICDM Conference, pp. 11– 18, 2004. [23] C. Plant, M. Osl, B. Tilg, and C. Baumgartner, “Feature selection on high throughput seldi-tof mass-spectronometry data for identifying biomarker candidates in ovarian and prostate cancer.,” in to appear in IEEE ICDM 2006 Workshop on Data Mining in Bioinformatics (DMB 2006), 2006. [24] C. B¨ohm, B. Ooi, C. Plant, and Y. Yan, “Efficiently processing continuous k-nn queries on data streams.,” in to appear in Proc. of ICDE Conference, 2007.
288
Bibliography
[25] M. W. Kirschner, “The meaning of systems biology.,” Cell 121, pp. 503–504, May 2005. [26] H. Lodish, A. Berk, P. Matsudaira, C. Kaiser, M. Krieger, M. Scott, S. Zipursky, and J. Darnell, Molecular Cell Biology, WH Freeman, 2004. [27] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative monitoring of gene expression patterns with a complementary dna microarray.,” Science 270, pp. 467–470, Oct 1995. [28] V. C. Wasinger, S. J. Cordwell, A. Cerpa-Poljak, J. X. Yan, A. A. Gooley, M. R. Wilkins, M. W. Duncan, R. Harris, K. L. Williams, and I. Humphery-Smith, “Progress with gene-product mapping of the mollicutes: Mycoplasma genitalium.,” Electrophoresis 16, pp. 1090–1094, Jul 1995. [29] T. N. Reiner Westermeier, Proteomics in Practice. A Laboratory Manual of Proteome Analysis, Wiley-VCH, 2002. [30] P. Horton and K. Nakai, “”Better Prediction of Protein Cellular Localization Sites with the k Nearest Neighbors Classifier”,” in Proc. 5th International Conference on Intelligent Systems for Molecular Biology, Halkidiki, Greece, pp. 147–152, AAAI Press, 1997. [31] K. Nakai and M. Kanehisa, “”A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells”,” Genomics 14(897), pp. 897–911, 1991. [32] G. Harrigan and R. Goodacre, Metabolic Profiling: Its Role in Biomarker Discovery and Gene Function Analysis, Kluwer Academic Publishers, 2003. [33] B. Liebl, U. Nennstiel-Ratzel, R. von Kries, R. Fingerhut, B. Olgem¨oller, A. Zapf, and A. A. Roscher, “Very high compliance in an expanded ms-ms-based newborn
Bibliography
289
screening program despite written parental consent.,” Prev Med 34, pp. 127–131, Feb 2002. [34] C. Baumgartner and D. Baumgartner, “Biomarker discovery, disease classification, and similarity query processing on high-throughput ms/ms data of inborn errors of metabolism,” J Biomol Screen (11), pp. 90–99, 2006. [35] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988. [36] J. MacQueen, “”Some Methods for Classification and Analysis of Multivariate Observations”,” in 5th Berkeley Symp. Math. Statist. Prob., 1967. [37] L. Kaufmann and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley & Sons, 1990. [38] A. P. Dempster, N. M. Laird, and D. B. Rubin, “”Maximum Likelihood from Incomplete Data via the EM Algorithm”,” in J Roy Stat Soc, (39), pp. 1–31, 1977. [39] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS: Ordering points to identify the clustering structure,” in Proc. of SIGMOD Conference, pp. 49– 60, 1999. [40] D. W. Hosmer and S. Lemeshow, Applied Logistic Regression, Wiley, 2000. [41] S. le Cessie and J. C. van Houwelingen, “”Ridge Estimators in Logistic Regression”,” in Applied Statistics, pp. 191–201, 1992. [42] C. Cortes and V. Vapnic, “”Support vector networks”,” in Mach learn, pp. 273–297, 1995. [43] V. Vapnic, Statistical Learnig Theory, Wiley, 1998.
290
Bibliography
[44] J. Platt, N. Cristianini, and J. Shawe-Tayler, “”Large Margin DAGs for Multiclass Classification”,” in Advances in Neural Information Processing Systems 12, pp. 547– 553, MIT Press, 2000. [45] T. M. Mitchell, Machine Learning, McGraw-Hill Boston, MA., 1997. [46] R. J. Quinlan, “Induction of decision trees,” in Machine learning, pp. 81–106, 1986. [47] R. J. Quinlan, C4.5: Program for Machine Learning, Morgan Kaufmann, San Mateo, CA, 1993. [48] P. Langley, W. Iba, and K. Thompson, “”An analysis of Bayesian classifiers”,” in Proc. of the tenth National Conference on Artificial Intelligence, pp. 223–228, 1992. [49] G. H. John and P. Langley, “”Estimating Continuous Distributions in Bayesian Classifiers”,” in Proc. of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345, 1995. [50] C. M. Bishop, Neural networks for pattern recognition, Oxford University Press, 1995. [51] S. Salzberg, “On comparing classifiers: a critique of current research and methods,” in Data mining and knowledge discovery, pp. 1–12, 1999. [52] P. Gr¨ unwald, “A tutorial introduction to the minimum description length principle,” Advances in Minimum Description Length: Theory and Applications , 2005. [53] J. A. Hartigan, Clustering Algorithms, John Wiley & Sons, 1975. [54] D. Pelleg and A. Moore, “X-means: Extending K-means with efficient estimation of the number of clusters,” in Proc. of ICML Conference, pp. 727–734, 2000.
Bibliography
291
[55] G. Hamerly and C. Elkan, “Learning the k in k-means,” in Proc. of NIPS Conference, 2003. [56] S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algorithm for large databases,” in Proc. of SIGMOD Conference, pp. 73–84, 1998. [57] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications,” in Proc. of SIGMOD Conference, pp. 94–105, 1998. [58] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data clustering method for very large databases,” in Proc. of SIGMOD Conference, pp. 103–114, 1996. [59] C. B¨ohm, K. Kailing, P. Kr¨oger, and A. Zimek, “Computing clusters of correlation connected objects,” in Proc. of SIGMOD Conference, pp. 455–466, 2004. [60] A. K. Tung, X. Xu, and B. C. Ooi, “CURLER: Finding and visualizing nonlinear correlation clusters,” in Proc. of SIGMOD Conference, pp. 467–478, 2005. [61] C. C. Aggarwal and P. S. Yu, “Finding generalized projected clusters in high dimensional spaces,” in Proc. of SIGMOD Conference, pp. 70–81, 2000. [62] B. Zhang, M. Hsu, and U. Dayal, “K-harmonic means - A spatial clustering algorithm with boosting,” pp. 31–45, 2000. [63] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm.,” in Proc. of NIPS Conference, pp. 849–856, 2001. [64] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” CoRR physics/0004057, 2000.
292
Bibliography
[65] N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method.,” in Proc. of SIGIR Conference, pp. 208–215, 2000. [66] D. Chakrabarti, S. Papadimitriou, D. S. Modha, and C. Faloutsos, “Fully automatic cross-associations,” in Proc. of KDD Conference, pp. 79–88, 2004. [67] A. Bhattacharya, V. Ljosa, J.-Y. Pan, M. R. Verardo, H. Yang, C. Faloutsos, and A. K. Singh, “Vivo: Visual vocabulary construction for mining biomedical images.,” in Proc. of ICDM Conference, pp. 50–57, 2005. [68] K. Wagstaff, C. Cardie, S. Rogers, and S. Schr¨odl, “Constrained k-means clustering with background knowledge.,” in ICML Conference, pp. 577–584, 2001. [69] D. Klein, S. D. Kamvar, and C. D. Manning, “From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering.,” in Proc. of ICML Conference, pp. 307–314, 2002. [70] M. Bilenko, S. Basu, and R. J. Mooney, “Integrating constraints and metric learning in semi-supervised clustering.,” in Proc. of ICML Conference, 2004. [71] B.-R. Dai, C.-R. Lin, and M.-S. Chen, “On the techniques for data clustering with numerical constraints.,” in Proc. of SDM Conference, 2003. [72] Z. Lu and T. Leen, “”Semi-supervised Learning with Penalized Probabilistic Clustering”,” in Proc. of NIPS Conference, pp. 849–856, 2005. [73] H. Liu and S. teng Huang, “Evolutionary semi-supervised fuzzy clustering.,” Pattern Recognition Letters 24(16), pp. 3105–3113, 2003. [74] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data with label propagation. technical report.,” 2002.
Bibliography
293
[75] J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky, “Automatic extraction of clusters from hierarchical clustering representations.,” in Proc. of PAKDD Conference, pp. 75–87, 2003. [76] B. E. Dom, “” An Information-Theoretic External Cluster-Validity Measure”,” in Research Report RJ 10219, IBM, 2001. [77] C. L. Blake and C. J. Merz, ”UCI Repository of machine learning databases, http://www.ics.uci.edu/∼mlearn/MLRepository.html”, University of California, Irvine, Dept. of Information and Computer Sciences, 1998. [78] J. Gracia-Bustos, J. Heitman, and M. Hall, “”Nuclear protein localization”,” in Biochimica et Biophysica Acta, pp. 1071:83–101, 1991. [79] C. Baumgartner, C. B¨ohm, D. Baumgartner, G. Marini, K. Weinberger, B. Olgem¨oller, B. Liebl, and A. Roscher, “”Supervised machine learning techniques for the classification of metabolic disorders in newborns”,” in Bioinformatics, pp. 20(17):2985–2996, 2004. [80] L. Dong, E. Frank, and S. Kramer, “Ensembles of balanced nested dichotomies for multi-class problems.,” in Proc. of PKDD Conference, pp. 84–95, 2005. [81] A. Zimek, “Hierarchical classification using ensembles of nested dichotomies.,” Master’s thesis, TU/LMU Munich, 2005. [82] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: Identifying densitybased local outliers.,” in SIGMOD Conference, pp. 93–104, 2000. [83] A. Hinneburg and D. A. Keim, “”An Efficient Approach to Clustering in Large Multimedia Databases with Noise”,” in Proc. of KDD Conference, pp. 224–228, 1998.
294
Bibliography
[84] E. Knorr and R. Ng, “”Algorithms for Mining Distance Based Outliers in Large Datasets ”,” in Proc. of VLDB Conference, pp. 392–403, 1998. [85] E. Knorr and R. Ng, “”Finding Intentional Knowledge of Distance-Based Outliers”,” in Proc. of VLDB Conference, pp. 211–222, 1999. [86] A. Demiriz, K. P. Bennett, and M. J. Embrechts, “”Semi-supervised Clustering using Genetic Algorithms”,” in Proc. of the Artificial Neural Networks in Engineering ANNIE´99, 1999. [87] H. Wang, D. Bell, and I. D¨ untsch, “”A density based approach to classification”,” in Proc. of the of the 2003 ACM symposium on Applied computing, pp. 470–474, 2003. [88] D. R. Wilson and T. R. Martinez, “”Reduction Techniques for Exemplar-Based Learning Algorithms”,” in Mach Learn 38-3, pp. 257–286, 2000. [89] G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, “”kNN Model-based Approach in Classification”,” in Proc. the 2nd International Conference on Ontologies, Database and Applications of Semantics, ODBASE’03, pp. 986–996, 2003. [90] ”WEKA machine learning package, http://www.cs.waikato.ac.nz/ ml/weka”, Universitity of Waikato. [91] C. Baumgartner, C. B¨ohm, and D. Baumgartner, “Modelling of classification rules on metabolic patterns including machine learning and expert knowledge.,” J Biomed Inform 38(2), pp. 89–98, 2005. [92] P. Horton and K. Nakai, “A probabilistic classification system for predicting the cellular localization sites of proteins.,” in Proc. of ISMB Conference, pp. 109–115, 1996.
Bibliography
295
[93] L. Zhao and R. Padmanabhan, “”Nuclear transport of adenovirus dna polymerase is facilitated by interaction with preterminal protein”,” in Cell, pp. 55:1005–1015, 1988. [94] R. Pyeritz and H. C. Dietz, “Marfan syndrome and other micofibrillar disorders,” Connective Tissue and its Heritable Disorders: Molecular, Genetic and Medical Aspects 2, pp. 585–626, 2002. [95] A. D. Paepe, R. B. Devereux, H. C. Dietz, R. C. Hennekam, and R. E. Pyeritz, “Revised diagnostic criteria for the marfan syndrome.,” Am J Med Genet 62, pp. 417– 426, Apr 1996. [96] J. L. Murdoch, B. A. Walker, B. L. Halpern, J. W. Kuzma, and V. A. McKusick, “Life expectancy and causes of death in the marfan syndrome.,” N Engl J Med 286, pp. 804–808, Apr 1972. [97] D. Baumgartner, C. Baumgartner, G. M´aty´as, B. Steinmann, J. L¨offler-Ragg, E. Schermer, U. Schweigmann, I. Baldissera, B. Frischhut, J. Hess, and I. Hammerer, “Diagnostic power of aortic elastic properties in young patients with marfan syndrome.,” J Thorac Cardiovasc Surg 129, pp. 730–739, Apr 2005. [98] A. Biggin, K. Holman, M. Brett, B. Bennetts, and L. Ad`es, “Detection of thirty novel fbn1 mutations in patients with marfan syndrome or a related fibrillinopathy.,” Hum Mutat 23, p. 99, Jan 2004. [99] L. Pereira, M. D’Alessio, F. Ramirez, J. R. Lynch, B. Sykes, T. Pangilinan, and J. Bonadio, “Genomic organization of the sequence coding for fibrillin, the defective gene product in marfan syndrome.,” Hum Mol Genet 2, p. 1762, Oct 1993. [100] G. M´aty´as, A. D. Paepe, D. Halliday, C. Boileau, G. Pals, and B. Steinmann, “Evaluation and application of denaturing hplc for mutation detection in marfan
296
Bibliography
syndrome: Identification of 20 novel mutations and two novel polymorphisms in the fbn1 gene.,” Hum Mutat 19, pp. 443–456, Apr 2002. [101] S. Katzke, P. Booms, F. Tiecke, M. Palz, A. Pletschacher, S. T¨ urkmen, L. M. Neumann, R. Pregla, C. Leitner, C. Schramm, P. Lorenz, C. Hagemeier, J. Fuchs, F. Skovby, T. Rosenberg, and P. N. Robinson, “Tgge screening of the entire fbn1 coding sequence in 126 individuals with marfan syndrome and related fibrillinopathies.,” Hum Mutat 20, pp. 197–208, Sep 2002. [102] B. S. Everitt, Cluster analysis, Edward Arnold, 1993. [103] K. Kailing, H.-P. Kriegel, P. Kr¨oger, and S. Wanka, “Ranking interesting subspaces for clustering high dimensional data.,” in Proc. of PKDD Conference, pp. 241–252, 2003. [104] C. H. Cheng, A. W.-C. Fu, and Y. Zhang, “Entropy-based subspace clustering for mining numerical data.,” in Proc. of KDD Conference, pp. 84–93, 1999. [105] S. Goil, H. Nagesh, and A. Choudhary, “”MAFIA: Efficiant and Scalable Subspace Clustering for Very Large Data Sets”,” Tech. Report No. CPDC-TR-9906-010, Center for Parallel and Distributed Computing, Dept. of Electrical and Computer Engineering, Northwestern University, 1999. [106] C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali, “A monte carlo algorithm for fast projective clustering.,” in Proc. of SIGMOD Conference, pp. 418– 427, 2002. [107] C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park, “Fast algorithms for projected clustering.,” in Proc. of SIGMOD Conference, pp. 61–72, 1999.
Bibliography
297
[108] M. Dash, K. Choi, P. Scheuermann, and H. Liu, “Feature selection for clustering a filter solution.,” in Proc. of ICDM Conference, pp. 115–122, 2002. [109] P. Spellman, G. Sherlock, M. Zhang, V. Iyer, K. Anders, M. Eisen, P. Brown, D. Botstein, and B. Futcher, “”Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization.”,” Mol Biol Cell 9, pp. 3273–3297, 1998. [110] L. A. Liotta, M. Ferrari, and E. Petricoin, “Clinical proteomics: written in blood.,” Nature 425, p. 905, Oct 2003. [111] J. S. Yu, S. Ongarello, R. Fiedler, X. W. Chen, G. Toffolo, C. Cobelli, and Z. Trajanoski, “Ovarian cancer identification based on dimensionality reduction for highthroughput mass spectrometry data.,” Bioinformatics 21(10), pp. 2200–2209, 2005. [112] M. A. Hall and G. Holmes, “Benchmarking attribute selection techniques for discrete class data mining,” IEEE T Knowl and Data En 15(6), pp. 1437–1447, 2003. [113] U. M. Fayyad and K. B. Irani, “Multi-interval discretization of continuous-valued attributes for classification learning.,” in Proc. of IJCAI Conference, pp. 1022–1029, 1993. [114] K. Kira and L. A. Rendell, “A practical approach to feature selection.,” in Proc. of the Ninth International Workshop on Machine Learning (ML 1992), pp. 249–256, 1992. [115] I. Kononenko, “Estimating attributes: Analysis and extensions of relief.,” in Proc. of ECML Conference, pp. 171–182, 1994. [116] M. A. Hall, “Correlation-based feature selection for discrete and numeric class machine learning.,” in Proc. of ICML Conference, pp. 359–366, 2000.
298
Bibliography
[117] H. Liu and R. Setiono, “A probabilistic approach to feature selection - a filter solution.,” in Proc. of ICML Conference, pp. 319–327, 1996. [118] G. Alexe, S. Alexe, L. A. Liotta, E. Petricoin, M. Reiss, and P. L. Hammer, “Ovarian cancer detection by logical analysis of proteomic data.,” Proteomics 4(3), pp. 766– 783, 2004. [119] T. Conrads, V. Fusaro, S. Ross, D. Johann, V. Rajapakse, B. Hitt, S. Steinberg, E. Kohn, D. Fishman, G. Whitely, J. Barrett, L. Liotta, E. r. Petricoin, and T. Veenstra, “”High-resolution serum proteomic features for ovarian cancer detection”,” Endocr-Relat Cancer 11(2), pp. 163–178, 2004. [120] O. C. Martin and S. W. Otto, “Combining simulated annealing with local search heuristics,” Tech. Rep. CSE-94-016, 1993. [121] M. Betke and N. Makris, “Fast object recognition in noisy images using simulated annealing,” Tech. Rep. AIM-1510, 1994. [122] M. D. Natale and J. A. Stankovic, “Applicability of simulated annealing methods to real-time scheduling and jitter control,” in IEEE Real-Time Systems Symposium, pp. 190–199, 1995. [123] S. Berchtold, D. A. Keim, and H.-P. Kriegel, “The x-tree : An index structure for high-dimensional data,” in Proc. of VLDB Conference, pp. 28–39, 1996. [124] R. Weber, H.-J. Schek, and S. Blott, “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces.,” in Proc. of VLDB Conference, pp. 194–205, 1998.
Bibliography
299
[125] R. Haux, E. Ammenwerth, W. Herzog, and P. Knaup, “Health care in the information society. a prognosis for the year 2013,” Int J of Med Inform (66), pp. 3–21, 2002. [126] H. Schuldt and G. Brettlecker, “Sensor data stream processing in health monitoring.,” in Mobilit¨at und Informationssysteme, 2003. [127] A. Das, J. Gehrke, and M. Riedewald, “Approximate join processing over data streams.,” in Proc. of SIGMOD Conference, pp. 40–51, 2003. [128] L. Gao and X. S. Wang, “Continually evaluating similarity-based pattern queries on a streaming time series.,” in Proc. of SIGMOD Conference, pp. 370–381, 2002. ¨ [129] L. Golab and M. T. Ozsu, “Processing sliding window multi-joins in continuous queries over data streams.,” in Proc. of VLDB Conference, pp. 500–511, 2003. [130] H. Su, E. A. Rundensteiner, and M. Mani, “Semantic query optimization for xquery over xml streams.,” in Proc. of VLDB Conference, pp. 277–288, 2005. [131] C. Yu, B. C. Ooi, K.-L. Tan, and H. V. Jagadish, “Indexing the distance: An efficient method to knn processing.,” in Proc. of VLDB Conference, pp. 421–430, 2001. [132] N. Koudas, B. C. Ooi, K.-L. Tan, and R. Z. 0003, “Approximate nn queries on streams with guaranteed error/performance bounds.,” in Proc. of VLDB Conference, pp. 804–815, 2004. [133] R. Cheng, B. Kao, S. Prabhakar, A. Kwan, and Y.-C. Tu, “Adaptive stream filters for entity-based queries with non-value tolerance.,” in Proc. of VLDB Conference, pp. 37–48, 2005.
300
Bibliography
[134] S. Borzsonyi, D. Kossmann, and K. Stocker, “The skyline operator,” in Proc. of ICDE Conference, pp. 421–430, 2001. [135] K.-L. Tan, P.-K. Eng, and B. C. Ooi, “Efficient progressive skyline computation,” in Proc. of VLDB Conference, pp. 301–310. [136] D. Papadias, Y. Tao, G. Fu, and B. Seeger, “Progressive skyline computation in database systems,” ACM Trans. Database Syst. 30(1), pp. 41–82, 2005. [137] X. Lin, Y. Yuan, W. Wang, and H. Lu, “Stabbing the sky: Efficient skyline computation over sliding windows.,” in Proc. of ICDE Conference, pp. 502–513, 2005. [138] K. Mouratidis, S. Bakiras, and D. Papadias, “Continuous monitoring of top-k queries over sliding windows,” in Proc. of SIGMOD Conference, pp. 635–646, 2006. [139] X. Yu, K. Q. Pu, and N. Koudas, “Monitoring k-nearest neighbor queries over moving objects.,” in Proc. of ICDE Conference, pp. 631–642, 2005. [140] F. Korn and S. Muthukrishnan, “Influence sets based on reverse nearest neighbor queries,” in Proc. of SIGMOD Conference, pp. 201–212, ACM Press, 2000. [141] I. Stanoi, M. Riedewald, D. Agrawal, and A. E. Abbadi, “Discovery of influence sets in frequently updated databases.,” in Proc. of VLDB Conference, pp. 99–108, 2001. [142] Y. Tao, D. Papadias, and X. Lian, “Reverse knn search in arbitrary dimensionality.,” in Proc. of VLDB Conference, pp. 744–755, 2004.