Weighted Cluster Editing for Clustering Biological Data - Applied ...

8 downloads 45 Views 2MB Size Report
Das Weighted-Cluster-Editing-Problem ist, wie auch sein unge- wichtetes Gegenstück Cluster-Editing, NP-schwer und kann nicht in poly- nomieller Zeit gelöst ...
Weighted Cluster Editing for Clustering Biological Data

DIPLOMARBEIT zur Erlangung des akademischen Grades Diplom-Bioinformatiker

¨ JENA FRIEDRICH-SCHILLER-UNIVERSITAT Fakult¨at fu ¨r Mathematik und Informatik eingereicht von Sebastian Briesemeister geb. am 09.05.1983 in Ilmenau Betreuer: Prof. Dr. Sebastian B¨ocker Jena, den 17.12.2007

Zusammenfassung Das Klassifizieren von Objekten in disjunkte Klassen ist ein zentrales Thema in den Wissenschaften und insbesondere in der Biologie. In den letzten Jahren und Jahrzehnten wurden daf¨ ur viele sogenannte Clustering-Algorithmen und Klassifizierungsalgorithmen entwickelt. Sehr erfolgreich waren unter anderem graphbasierte Clustering-Algorithmen wie zum Beispiel CLICK [22]. In dieser Arbeit besch¨aftigen wir uns mit dem Weighted-Cluster-EditingProblem und seiner Anwendbarkeit zum Clustern von biologischen Daten. Einleitend geben wir einen Einblick in den biologischen Hintergrund die¨ ses Forschungsbereiches, einen Uberblick u ¨ber bereits existierende Klassifizierungsmethoden und f¨ uhren den Leser in das Gebiet der festparameterhandhabbaren Algorithmen ein. Das Weighted-Cluster-Editing-Problem ist wie folgt definiert: Gegeben  ist ein ungerichteter Graph G = (V, E) und eine Kostenfunktion s : V → R, welche f¨ ur jedes Knotenpaar Modikationskosten f¨ ur die Kante zwi2 schen beiden Knoten definiert. Gesucht ist eine Menge von Modifikationsoperationen, wobei Einf¨ ugungen und L¨oschungen erlaubt sind, so dass die summierten Kosten der Operationen minimal sind und der resultierende Graph eine disjunkte Vereinigung von Cliquen ist. Eine Clique ist ein Subgraph, welcher vollst¨andig verbunden ist. Ein Graph, welcher nur aus disjunkten Cliquen besteht, enth¨alt kein Konflikttripel: Ein Konflikttripel sind drei Knoten aus dem Graphen, so dass zwei der drei m¨oglichen Kanten zwischen diesen Knoten existieren aber die dritte Kante nicht existiert. Das Weighted-Cluster-Editing-Problem ist, wie auch sein ungewichtetes Gegenst¨ uck Cluster-Editing, NP-schwer und kann nicht in polynomieller Zeit gel¨ost werden, solange P 6= NP. Ein Ansatz, um trotzdessen einen praktisch anwendbaren Algorithmus zu erhalten, sind festparameterhandhabbare Algorithmen. Im Gegensatz zur klassischen Komplexit¨at wird die Laufzeit eines Problems in Abh¨angigkeit von der Eingabegr¨oße und eines Problemparameters k gemessen. Gesucht wird ein Algorithmus, der das Problem in Laufzeit f (k) · nO(1) l¨ost, wobei n die Eingabegr¨oße ist und f eine Funktion, die nur von k abh¨angt. Existiert ein solcher Algorithmus, so ist ein Problem festparameter-handhabbar bez¨ uglich des Parameters k und der exponentielle Anteil der Laufzeit ist auf den Parameter beschr¨ankt. Der Parameter wird problemspezifisch gew¨ahlt und, wenn er hinreichend klein ist, als konstant angenommen. Das Weighted-Cluster-Editing-Problem ist festparameter-handhabbar bez¨ uglich der Modifikationskosten k, und der bisher schnellste Algorithmus f¨ ur das Problem hat Laufzeit O(2.42k + |V |3 log V ). Im ersten Teil dieser Arbeit besch¨aftigen wir uns mit neuen Reduktionsregeln f¨ ur das Weighted-Cluster-Editing-Problem. Dabei konzentrieren wir uns haupts¨achlich auf parameter-unabh¨angige Reduktionsregeln. Zwei einfache Regeln identifizieren Kanten, welche im Gegensatz zu Nachbarkanten sehr teuer zu modifizieren sind, und setzen sie auf “permanent” beziehungsweise “verboten”. Permanente und verbotene Kanten d¨ urfen nicht mehr moi

difiziert werden. Des Weiteren f¨ uhren wir eine Reduktionsregel ein, die fast vollst¨andige Cliquen, welche nur sp¨arlich mit dem Rest des Graphen verbunden sind, als Cliquen identifiziert. Eine weitere wichtige Reduktionsregel verallgemeinert die Kritische-Cliquen-Regel f¨ ur ungewichtetes Cluster Editing. Obwohl diese Reduktionsregel eine vergleichsweise hohe Laufzeit hat, reduziert sie ¨außerst effektiv den Graphen. Basierend auf drei Reduktionsregeln, zeigen wir, dass die Reduzierung mit diesen Regeln zu einem Problemkern der Gr¨oße k 2 + 3k + 2 f¨ ur ganzzahliggewichtete Graphen f¨ uhrt: Ein Problemkern ist eine reduzierte Instanz, welche genau dann und nur dann l¨osbar ist, wenn die Originalinstanz l¨osbar ist. Diese obere Schranke f¨ ur die Gr¨oße des Problemkerns ist halb so groß wie die bisher beste bekannte obere Schranke. Im zweiten Teil f¨ uhren wir zwei neue Verzweigunsstragtegien f¨ ur WeightedCluster-Editing ein. Bisherige Verzweigunsstrategien verzweigen f¨ ur jedes Konflikttripel in mindestens drei verschiedene F¨alle. Die neuen Verzweigunsstrategien sind einfach und effizient zugleich und verzweigen f¨ ur eine Kante in einem Konflikttripel nur in zwei F¨alle, n¨amlich die Kante zu l¨oschen oder sie auf permanent zu setzen. Wir zeigen, dass durch geeignetes W¨ahlen dieser Kante der Suchbaum eine maximale Gr¨oße von O(1.83k ) hat. Damit stellen wir hier den bisher schnellsten Algorithmus f¨ ur das Ganzzahlige-WeightedCluster-Editing-Problem und das Cluster-Editing-Problem vor. Die in dieser Arbeit pr¨asentierten Reduktionsregeln und Verzweigungsstrategien implementierten wir in zwei C++ basierten Programmen: Ein Graphenreduzierungsprogramm dient zur effizienten Reduzierung eines Eingabegraphen. Ein zweites Programm l¨ost eine Weighted-Cluster-Editing Instanz, indem es die vorgestellten Reduktionsregeln und die neue Verzweigungsstrategie implementiert. Wir nennen dieses Programm PEACE (Parameterized and Exact Algorithms for Cluster Editing) und verwenden es zum Klassifizieren von Objekten in disjunkte Klassen. In der folgenden Evaluation zeigen wir, dass PEACE wesentlich schneller ist als eine fr¨ uhere Version dieses Programmes. Im Detail erreichen wir einen Laufzeitverbesserung im Bereich von Faktor 100000. Wir zeigen, dass ein Großteil dieses Geschwindigkeitszuwachses durch die effektive Verkleinerung des Graphen infolge der parameter-unabh¨angigen Reduktionsregeln zu begr¨ unden ist. Abschließend wenden wir unser Clustering-Tool auf biologische Daten an, insbesondere auf Microarray-Daten von verschiedenen Krebsarten. Diese Daten werden mit verschiedenen Distanzmaßen in Distanzmatrizen umgewandelt. Weiterhin wird eine Erwartungswertmaximierungsmethode verwendet, um die Daten vorzuverarbeiten. Die erzeugten gewichteten Graphen bearbeiten wir mit unserem Clustering-Tool PEACE. Die Klassifizierungsqualit¨at unserer Methode vergleichen wir mit anderen, h¨aufig verwendeten ClusteringVerfahren und stellen fest, dass unser Clustering-Tool oftmals besser ist als andere Klassifizierungsmethoden. In zwei F¨allen jedoch k¨onnen wir die Grenii

zen unseres Programmes beobachten. Diese Arbeit ist im Detail wie folgt aufgebaut: Im ersten Kapitel geben wir eine Einf¨ uhrung in unsere Arbeit und f¨ uhren den Leser in den biologischen Hintergrund ein. Weiterhin beschreiben wir verschiedene Clustering-Verfahren und geben eine Einf¨ uhrung in die parametrisierte Komplexit¨at. Außerdem f¨ uhren wir in das Weighted-Cluster-Editing-Problem ein. In Kapitel ¨ 2 geben wir eine Ubersicht u ur das Cluster¨ber die bisherige Forschung f¨ Editing-Problem und das Weighted-Cluster-Editing-Problem. Im dritten Kapitel geben wir Reduktionsregeln f¨ ur das Weighted-ClusterEditing-Problem an und zeigen, dass diese zu einem Problemkern f¨ uhren. In Kapitel 4 studieren wir zwei neue Verzweigungsstragtegien f¨ ur Suchbaumalgorithmen f¨ ur das Weighted-Cluster-Editing-Problem. Im f¨ unften Kapitel stellen wir unsere Implementierung der bisherigen Resultate in zwei Programmen vor. Diese Implementierungen evaluieren wir im sechsten Kapitel anhand von k¨ unstlich erzeugten Graphen und untersuchen weiterhin die Effizienz der Reduktionsregeln anhand von Protein¨ahnlichkeitsgraphen. In Kapitel 7 wenden wir unser Clustering-Tool auf Microarray-Datens¨atzen an und zeigen, dass unser Programm oftmals bessere Klassifizierungsresultate liefert als andere Methoden. Abschließend fassen wir diese Arbeit in Kapitel 8 zusammen und diskutieren offene Probleme.

iii

iv

Abstract In the last years more and more microarray data became available and opened up new opportunities in biochemistry and molecular biology. A key step in analysing microarray data is to cluster genes or tissues using their expression profiles. Many available clustering methods are based on transforming expression data into a similarity matrix between objects, or a weighted undirected graph labeled with similarity values. The graph theoretical problem Weighted Cluster Editing takes an undirected graph plus weights for all vertex tuples as input, and aims to find the set of edge changes with least weight such that the remaining graph is an union of disjoint cliques. The Weighted Cluster Editing problem is NPcomplete. However, recently several efficient algorithms have been proposed in the literature. In this thesis we present a new and surprisingly simple branching strategy for Integer Weighted Cluster Editing which leads to an overall search tree size of O(1.83k ) and thereby present the fastest algorithm known for Cluster Editing and Integer Weighted Cluster Editing with worst-case running time O(1.83k + |V |3 ). This algorithm works for the general Weighted Cluster Editing problem but provable worst-case running times are slightly worse. We present a set of parameter-independent as well as parameter-dependent reduction rules for Weighted Cluster Editing that significantly cut down the size of an instance. We implemented our new techniques in a clustering tool named PEACE (Parameterized and Exact Algorithms for Cluster Editing) which was engineered and optimized such that it is capable of clustering instances with up to 200 objects and therewith proves its practical relevance. We tested PEACE on gene expression data for tissue classification and compared our method against other well-known clustering algorithms. PEACE often outperforms other clustering methods with respect to cluster quality while having reasonable running time.

vi

Contents 1 Introduction 1.1 Biological Background . . . 1.2 Clustering Methods . . . . . 1.3 Parameterized Complexity . 1.4 Branching Vectors . . . . . . 1.5 Preliminaries and Notations

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

1 3 4 7 8 8

2 Previous Results 13 2.1 The Unweighted Cluster Editing Problem . . . . . . . . . . . . . 13 2.2 The Weighted Cluster Editing Problem . . . . . . . . . . . . . . 14 3 Reduction Rules and Problem Kernel 17 3.1 Parameter-Independent Data Reduction . . . . . . . . . . . . . 17 3.2 Parameter-Dependent Data Reduction Rules . . . . . . . . . . . 22 3.3 Problem Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 Algorithms for Weighted Cluster Editing 4.1 The basic algorithm . . . . . . . . . . . . . 4.2 Two Simple Branching Strategies . . . . . 4.3 An O(1.83k ) Branching Strategy . . . . . . 4.3.1 Solving Paths and Cycles . . . . . . 4.3.2 Refined Analysis of Edge Branching 4.4 Real-valued Edge Weights . . . . . . . . . 4.5 Branch and Bound . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

5 Implementation 5.1 User Manual . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Graph File Format . . . . . . . . . . . . . . . 5.1.2 Result File Format . . . . . . . . . . . . . . . 5.1.3 Cluster Reduction Tool . . . . . . . . . . . . . 5.1.4 PEACE . . . . . . . . . . . . . . . . . . . . . 5.2 Class Hierarchy . . . . . . . . . . . . . . . . . . . . . 5.3 Implementation of Reduction Rules . . . . . . . . . . 5.4 Finding an Edge With Minimum Branching Number vii

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

29 29 30 32 33 34 43 43

. . . . . . . .

45 46 46 46 47 47 48 49 52

6 Evaluation 55 6.1 Comparison Against a Previous Version . . . . . . . . . . . . . . 55 6.2 The Power of Reduction Rules . . . . . . . . . . . . . . . . . . . 57 7 Clustering of Biological Data 7.1 Used Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Preprocessing and Distance Measures . . . . . . . . . . . . . . . 7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63 63 64 65

8 Conclusion 73 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 8.2 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 8.3 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Bibliography

77

Appendix

83

A PEACE Options

83

B Class Hierarchy

86

C Lung Cancer Similarity Graph

87

viii

Chapter 1 Introduction Clustering objects according to their pairwise similarities is a central topic in computational biology. In particular, clustering of genes and tissues based on their expression profile has gained considerable relevance in the last years. To provide biologists with meaningful biological clustering results, a lot of promising methods were developed and investigated over the last decades. Approaches reach from simple ideas like clustering an object with its nearest neighbor to more involved ideas based on self organizing maps. However, so far there are no persuasive clustering methods for clustering biological data. Often, a clustering method may show good results on some kind of data but reveals weaknesses for other kinds of data. Hartuv and Shamir [22] presented a new promising graph-based clustering approach CLICK. Given a similarity graph, the authors compute highly connected subgraphs which correspond to the actual clusters. Similar to Hartuv and Shamir, we examine a graph-based clustering approach. In this work we focus on the Weighted Cluster Editing problem and its applicability for clustering biological data. The input of this problem is also a similarity graph, where vertices represent objects and weighted vertex tuples reflect the extent of similarity between objects. Corresponding to the similarity between objects, a cost function is defined which returns the modification cost (insertion or deletion) of an edge. The Weighted Cluster Editing problem is to find a set of edge modifications with minimum cost such that the remaining graph is a union of disjoint cliques. Based on the assumption that a correct clustering corresponds to a cluster graph, that is a graph that consists of disjoint cliques only, the input similarity graph contains some errors due to “noise”. Since Weighted Cluster Editing uses a maximum parsimony approach, we reconstruct the cluster graph with minimum modification costs. The Weighted Cluster Editing problem is NP-hard [42]. Consequently, there exists no polynomial-time algorithm to solve this problem. Rahmann et al. [39] presented a fixed-parameter algorithm which solves the Weighted Cluster Editing problem in reasonable running time to some 1

extent. In an earlier publication, the author and others improved on this result by presenting an algorithm based on a refined branching strategy with running time O(2.42k + |V |3 log |V |) [5]. In this thesis we aim to improve this fixed-parameter algorithm. We investigate a new branching strategy to improve the theoretical as well as the practical running time. Furthermore we explore new ideas for data reduction, since we believe that more powerful reduction rules may result in a significantly faster algorithm. In addition, we search for different lower bounds to improve the existing branch-and-bound strategy. We aim to implement all new approaches in a novel clustering tool PEACE (Parameterized and Exact Algorithms for Cluster Editing) which solves Weighted Cluster Editing instances more efficiently. In an evaluation part we examine our implementation and compare it to a previous version of our algorithm [5]. We further have a closer look at the power of reduction rules. In the second part of this work we study the applicability of Weighted Cluster Editing to cluster biological data. To do so, we apply PEACE to different microarray datasets of Monti et al. [33] to classify tissue data. Since we know the true solution for these datasets, we can use the adjusted Rand index [24] to measure the quality of our solution. We aim to show that clustering using the concept of Weighted Cluster Editing based on an exact graph-based clustering can outperform current gold-standard clustering approaches such as CAST [3] and CLICK [44]. This thesis is structured as follows: In the remainder of this chapter we provide background information about molecular biology and genetics. Furthermore, we give an overview about clustering methods, plus a quick introduction to parameterized complexity and the use of branching vectors. In Chapter 2 we summarize previous results for Cluster Editing and Weighted Cluster Editing. We recapitulate existing reduction rules and give a set of new reduction rules for the Weighted Cluster Editing problem in Chapter 3. Furthermore, we prove a smaller problem kernel, which is a reduced problem instance that can be solved if and only if the original instance is solvable. In Chapter 4 we introduce new algorithms for Weighted Cluster Editing by providing two new branching strategies and some lower bounds. We present two software tools in Chapter 5: The clustering tool PEACE which solves the Weighted Cluster Editing problem using the techniques from this thesis, and a cluster reduction tool which reduces a Weighted Cluster Editing problem instance using out parameter-independent reduction rules. In addition, we give a quick insight into the actual implementation and provide a short user manual. We evaluated our clustering tool PEACE in Chapter 6. We compare running times with a previous version of our tool published in [5]. Furthermore we analyse the power of parameter-independent reduction rules. 2

In Chapter 7 we give an example how our Weighted Cluster Editing can be used to cluster biological data. We show how we preprocess our data and present promising results. In Chapter 8 we conclude this work by discussing our results and summarizing open problems.

1.1

Biological Background

DNA (deoxyribonucleic acid) is the hereditary material of most species. It is found in nearly every cell of an organism. DNA is a long polymer of simple repeating units called nucleotides. This sequence of nucleotides encodes information. Usually DNA does not exist as a single molecule, but instead two DNA molecules form a double helical structure. The DNA of a cell is usually partitioned in one or more very long molecules that are organized as chromosomes. In organelles and prokaryotes DNA can also exist in a circular structure called plasmid. An organism’s basic complement of DNA is called its genome. Genes are functional regions of DNA that contain genetic information and usually encode for functional molecules, for example proteins. Note that there exist other functional regions such as repetitive elements. However, many scientists believe that genes are the most important factor for the development and functionality of an organism. During the process of transcription a gene is transcribed into its corresponding precursor RNA sequence. Protein-encoding DNA is transcribed into a precursor RNA sequence and finally results in the mRNA (messenger RNA) of the gene. mRNA can be translated into a protein. This process is called translation. During the developmental process of an organism, the point in time when a gene is expressed is very important. In an adult organism, expressed genes control parts of the body metabolism. Therefore gene expression is a highly regulated process, both time-dependent and metabolite-dependent. Regulation happens during the transcription process: The affinity of RNA-Polymerase (an essential protein of the transcriptional protein complex) to a specific gene is determined by a combination of different factors such as other molecules present, surrounding enhancer and silencer sequences, and structure of the chromosomal DNA. A higher affinity usually leads to an increased RNA production and may also effect the translation rate. Post-transcriptional processes are responsible for converting the primary transcript RNA into the final mature mRNA. RNA splicing, 5’-capping and 3’-polyadenylation are the three main modifications to convert a precursor messenger RNA into a mature mRNA. These modifications and other processes contribute to the actual translation rate by determining, say, the half-life period of the transcript. The translation process is regulated similar to the transcription process, for example by repressor molecules. After translation, post-translational modifications are often 3

responsible for determining the functionality of a protein. The modifications include phosphorylation or other simple chemical modifications, removement of amino acids, attachment to other polypeptide chains, or cutting the created amino acid chain. Post-translational modifications are often required for the functionality of the protein. The expression level of one or many genes of a cell, a certain tissue type, or an organism are summarized in an expression profile. Expression profiles are very important in biochemistry, molecular biology, and of course in medicine where certain expression profiles may correspond to some diseases. One can also monitor the expression level a gene over different conditions such as time, or other environmental factors that might affect the expression level in some way. Especially in the last decade, more and more expression data has become available throughout the use of using microarray technology. With microarrays it is possible to measure the concentration of thousands of proteins or RNA sequences at the same time. On each spot on an array, a cDNA molecule or a synthetic oligonucleotide with a certain sequence is fixed. Before RNA molecule concentrations are measured, the sample RNA is converted into cDNA molecules by reverse transcription. These molecules are marked, for example using fluorophores. If sample RNA molecules with the complementary sequence to some spotted sequence are present, these molecules bind. Since every sample cDNA molecule is marked, it is possible to measure the concentration of a specific RNA molecule. Using protein microarrays, one can measure the concentration of proteins based on different protein-protein interactions. Similar expression profiles of two genes can be explained by a similar function of the corresponding RNA or protein, or by dependencies in a signal transduction pathway. Similar expression profiles of organisms can be explained by a similar metabolism state which may be caused by similar environmental reasons or diseases. This explains why clustering of similar genes, tissues, or organisms based on expression profiles is vital for modern biological data analysis. For a more detailed overview on biochemistry and molecular genetics we refer to [27] and [35].

1.2

Clustering Methods

Clustering is the problem of finding a correct classification of objects in groups, such that members of the same group are similar in some way but objects in different groups are not. Distance-based clustering uses a distance measure to access the similarity between objects. In contrast, conceptual clustering groups objects according to their fit to the descriptive concept. Distance-based methods are very popular. In contrast, conceptual methods are rarely used, since the underlying concept is often unknown. Clustering can be considered 4

one of the most important techniques for automated data analysis. In the following, we concentrate on distance-based clustering methods. Clustering algorithms can be classified into hierarchical and partitional algorithms. Hierarchical clustering starts with an initial clustering and successively refines this clustering. It can be further divided into agglomerative and divisive clustering. Agglomerative clustering starts with each element as a single cluster, and successively merges them into larger clusters. Divisive clustering starts with one large cluster containing all elements, and successively divides it into smaller clusters. In contrast, partitional algorithms determine all final clusters at once. We now give an overview over the most important and commonly used clustering techniques, see [26] for a more detailed overview: • Agglomerative clustering methods are very common and have been frequently studied in the literature. Given a distance matrix, agglomerative clustering starts with each element as a single cluster, and iteratively merges the two clusters with the smallest distance. The algorithms may stop at a certain level of dissimilarity or if a certain number of clusters is obtained. There exist a plenitude of distance measures for two clusters A and B, for example: – Nearest neighbor clustering or single linkage clustering: D(A, B) := min{d(x, y) : x ∈ A, y ∈ B} – Complete linkage clustering or farthest neighbor clustering: D(A, B) := max{d(x, y) : x ∈ A, y ∈ B} – Average linkage clustering or UPGMA (Unweighted Pair-Groups Method Average): D(A, B) :=

1 XX d(x, y) |A||B| x∈A y∈B

– Ward’s hierarchical clustering [48] merges two clusters such that the variance of the clusters are increased by a minimum value. • K-means [31] is a simple partitional clustering method. Given a distance matrix and the number of clusters k, the algorithm randomly assigns k centroids, one for each cluster. A good initial location is important for the clustering result. Usually, the k centroids are placed as far away as possible from each other. In each phase, every data point is assigned to the closest centroid. Then all centroids are re-calculated such that the new centroid is the average of all points in the cluster. This is repeated until convergence occurs. Fuzzy clustering [41] and Expectation-Maximization (EM) [20] are slightly different approaches which are based on k-means. 5

• Self-organizing maps (SOM) [55,28] are artificial neural networks and produce a low-dimensional representation of the training sample. As for k-means one chooses the number of clusters and the shape of the network, often a rectangular grid. The initial network size corresponds to the expected number of clusters in the data. Initially all nodes of the network are randomly populated. Nodes are refined during an iteration in a similar fashion as used for k-means. • Spectral Clustering [19] uses the eigenvectors of the given similarity matrix to cluster the objects in k clusters. The number of clusters has to be determined manually. Roughly spoken, spectral clustering performs a dimensionality reduction and uses a simple clustering method for the actual clustering, for example k-means. Spectral clustering is often used for image segmentation since it is able to detect clusters independently of their shape. • Clique-based clustering [22] is based on a graph-theoretical concept. Every vertex represents an object, every edge corresponds to a certain similarity between two objects. The basic idea is that an ideal cluster graph is a disjoint union of cliques. The algorithm tries to find the original cluster graph based on the corrupted input graph. In this thesis we wil examine the strength of clique-based clustering. A key step for any clustering method is selecting the distance measure. This selection is vital since it influences the shape of the objective space. Common distance measures for two vectors p = (p1 , p2 , · · · , pn ) and q = (q1 , q2 , · · · , qn ) with averages p¯, q¯ are: • Euclidean distance (2-norm distance) v u n uX t (pi − qi )2 i=1

• Manhattan distance (1-norm distance) n X

|pi − qi |

i=1

• Pearson’s product-moment correlation coefficient P (pi − p¯)(qi − q¯) cov(p, q) pP = pP i σp σq ¯)2 ¯)2 i (pi − p i (qi − q

• Dot product

n X i=1

6

pi qi

In Section 7 we evaluate these four distances for the task of tissue classification.

1.3

Parameterized Complexity

The Weighted Cluster Editing problem is NP-hard. Unless NP = P, there exists no algorithm which solves the problem in polynomial running time. Several approaches have been investigated to approach NP-hard problems, such as approximation algorithms, randomized algorithms, and heuristics. However, none of them guarantees to find an optimal solution. In the last years, a new area in complexity theory, namely parameterized complexity [11], has achieved remarkable results in solving NP-hard problems exactly and efficiently. It is based on the observation that for many NPhard problems, the combinatorial explosion can be confined to a small part of the input, the parameter. Beside the normal input we therefore define the parameter as additional input for our problem. Definition 1. A parameterized problem is a language L ⊆ Σ∗ × Σ∗ for some finite alphabet Σ. For a problem instance (x, k) ∈ L, k denotes the parameter. Assuming that the parameter is small in practice, k is fixed to a reasonable value. Definition 2. A parameterized problem L is fixed-parameter tractable with respect to k if it can be determined in f (k) · nc time whether (x, k) ∈ L, where f can be an arbitrary function depending only on k, c is constant and n denotes the overall input size. The associated complexity class that contains all parameterized problems that are fixed-parameter tractable is called FPT. Hence, the fixed-parameter algorithm has polynomial running time, if we assume k to be constant. Fixed-parameter algorithms are often recursive search tree algorithms. Additionally, data reduction methods are applied to reduce the initial instance to a reduced instance of smaller size, so that the search space is decreased. If the size of a reduced instance depends only on the size of the parameter, the reduced instance is called problem kernel: Definition 3. A reduced instance (I ′ , k ′ ) of an instance (I, k) for a problem L is called problem kernel if 1. k ′ ≤ k, I ′ ≤ g(k) for some function g only depending on k, 2. (I, k) ∈ L if and only if (I ′ , k ′ ) ∈ L, and 3. the reduction from (I, k) to (I ′ , k ′ ) is computable in polynomial time. 7

As mentioned before, it is highly desirable to receive a valid problem kernel after the data reduction. Often data reduction is also applied while traversing the search tree to reduce the practical running time; this technique is called interleaving. For more details on parameterized complexity we refer to [36]. Weighted Cluster Editing has a problem kernel of size O(k 2) and can be solved using a recursive search tree algorithm [5, 39]. In this thesis we further decrease the size of the problem kernel and introduce a new recursive search tree algorithm.

1.4

Branching Vectors

Search tree algorithms work recursively: The number of recursive calls corresponds to the nodes in the according tree. To approximate the worst-case running time of an algorithm, one needs to calculate the number of nodes in the search tree. This number can be calculated by solving homogeneous, linear recurrences with constant coefficients or by determining the positive real root of the corresponding characteristic polynomial [30]. Suppose a search tree algorithm for a given problem with problem size n which calls itself recursively for problem instances of sizes n − d1 , . . . , n − di , then (d1 , . . . , di) is the corresponding branching vector for this search tree algorithm. It corresponds to the recurrence tn = tn−d1 + · · · + tn−di where tn denotes the number of leaves in the search tree with tj = 1 for 0 ≤ j < d and d = max{d1 , . . . , di}. The “characteristic polynomial” for the branching vector (d1 , . . . , di ) is x−b1 + · · · + x−bi − 1. If α is the maximum positive real root of the characteristic polynomial, then tn is O(αn). We call α the branching number that corresponds to the branching vector (d1 , . . . , di ). For every search tree algorithm the correspondent branching vector can be computed. Each element in the vector corresponds to exactly one branch and denotes the amount by which the parameter is decreased in this branch. If α denotes the maximum positive real root of the characteristic polynomial of the given branching vector then the size of the search tree is O(αk ). For example, a branching vector of (1, 1) corresponds to a search tree of size O(2k ), whereas a branching vector of (1, 23 ) corresponds to a search tree of size O(1.76k ).

1.5

Preliminaries and Notations

In this section we give some basic notations and preliminaries which are used in this work. In this thesis we concentrate on undirected graphs without loops and multiple edges. Our focus lies on the Weighted Cluster Editing problem, but we first give a definition of the unweighted Cluster Editing problem: 8

Definition 4. Let G = (V, E) be an undirected unweighted graph. Our task is to find a set of edge modifications (insertions and deletions) of minimum cardinality, such that the modified graph consists of disjoint cliques. The vertices V of the graph correspond to objects to be clustered. An edge between two vertices indicates a certain similarity between the two correspondent objects. For brevity, we use uv as shorthand for an unordered pair  V {u, v} ∈ 2 . We call cliques also clusters, and a graph which is a disjoint union of cliques is a cluster graph. A graph is a cluster graph if and only if there exists no conflict triple in the graph. A conflict triple are three distinct vertices u, v, w such that uv and uw are edges in G but vw is not (see Figure 1.1). In this work we use uvw as shorthand for this conflict triple. u w v

Figure 1.1: Conflict triple uvw: Solid lines are present edges, whereas the dotted line indicates a missing edge. The weighted graph modification problem Weighted Cluster Editing is defined as follows: Definition 5. Given an undirected weighted Graph G = (V, E) and a cost  V function c : 2 → R which returns the modification costs for every tuple  {u, v} ∈ V2 . The task is to find a set of edge modifications (insertions and deletions) with minimum total cost, such that the modified graph is a cluster graph. In the following, we use a slightly modified version of cost function c: if uv is present in the graph set s(uv) := c(uv) so that s(uv) > 0 holds; if uv is absent from the graph set s(uv) := −c(uv) so that s(uv) < 0 holds. Note that throughout this work we assume modification costs/weights of the input graph to be at least one. If this is not the case we multiplicate all weights with a constant so that our claim is true. However, we stress that during the computation tuples uv with weight |s(uv)| < 1 can occur. We further distinguish between two types of conflict triples uvw: if the absent edge has weight 0 ≥ s(vw) > −1 then the conflict triple is called “weak”, whereas for s(vw) ≤ −1 this conflict triple is called “strong”. We visualize such conflict triples as it is displayed in Figure 1.2. Further, we say that an edge is “forbidden” if and only if s(uv) = −∞ and an edge is “permanent” if and only if s(uv) = ∞. Note that “permanent” and “forbidden” edges are not allowed to be modified to obtain a solution. In the following, we define the parameterized Weighted Cluster Editing problem with the modification costs as parameter k: 9

u

u w

w

v

v

Figure 1.2: A strong (left) and a weak (right) conflict triple. Solid lines indicate a present edge with deletion cost greater than 1, dotted lines are absent edges with insertion cost greater or equal to 1, dashed lines are absent edge with insertion cost less than 1. Definition 6. Given an undirected weighted Graph G = (V, E), a cost function  s : V2 → R and a parameter k. The task is to find a set of edge modifications (insertions and deletions) with total cost less than or equal to k, such that the modified graph is a cluster graph. The neighborhood of vertex u is N(u) := {w ∈ V | s(uw) > 0}. N(u)∪N(v) denotes the set of all common neighbors of u, v, whereas N(u)∆N(v) denotes the set of all non-common neighbors that is, the symmetric set difference of the neighborhoods. Furthermore, we use the notation G[C] = (C, E ′) for the induced subgraph of G = (V, E) over the vertex set C ⊆ V , so E ′ = {uv | u, v ∈ C and uv ∈ E}. We define two simple lower bounds from B¨ocker et al. [5] which are frequently used in this work: Definition 7. The induced costs icp(uv) for setting an edge uv to “permanent”, are defined by: X icp(uv) = min{|s(uw)|, |s(vw)|} + | min{0, s(uv)}|. w∈N (u)△N (v)

The induced costs icf (uv) for setting an edge uv to “forbidden” are defined by: icf (uv) =

X

min{s(uw), s(vw)} + max{0, s(uv)}.

w∈N (u)∩N (v)

If uv is set to “permanent” then we have to decide for every non-common neighbor w of u, v whether we insert the missing edge or delete the present edge to w; otherwise, uvw forms a conflict triple. In addition, we might have to insert uv itself to set uv to “permanent”. If uv is set to “forbidden” then for every common neighbor w of u, v at least one of the edges to w needs to be deleted; otherwise, uvw forms a conflict triple. Again, we might have to add the costs for deleting uv. Definition 8. A graph cut is a partition of the vertices of a weighted graph G = (V, E) into two vertex sets S and T such that S ∪ T = V and S ∩ T = ∅. 10

Definition 9. The minimum graph cut of G = (V, E) is a graph cut into S and TP, where E ′ = {uv | u ∈ S, v ∈ T, uv ∈ E} denotes the cutting edges, such that uv∈E ′ s(uv) is minimal among all graph cuts. Definition 10. The minimum s-t cut of G = (V, E) is a minimum graph cut into S and T , such that s ∈ S and t ∈ T .

In the following, we refer to the minimum graph cut as min-cut or minimum separation costs of the graph. We refer to the minimum s-t cut as min-s-t-cut. Note that the min-cut of a graph equals the maximum flow in this graph since the max-flow min-cut theorem [12] holds. A simple algorithm to find a mincut with running time O(|V ||E| + |V |2 log |V |) was introduced by Stoer and Wagner [45].

11

12

Chapter 2 Previous Results In this chapter we give a brief overview about previous results for Cluster Editing and Weighted Cluster Editing.

2.1

The Unweighted Cluster Editing Problem

The Cluster Editing problem is also known as the Transitive Graph Projection problem and was first solved for specially structured graphs by Zahn [54]. In 1986, Kˇriv´anek and Mor´avek [29] proved the NP-completeness of Hierarchical-tree Clustering of which Cluster Editing is a special case. Several years later, motivated by computational biology questions, the problem was rediscovered by Ben-Dor et al. [3]. Shamir et al. [42] showed again the NP-completeness of Cluster Editing. In addition, Sharan and Shamir [44] explored related problems where they modified the Cluster Editing problem and utilizes graph-theoretic and statistical techniques to identify clusters. Bansal et al. [2] introduced Cluster Editing as a special case of Correlation Clustering which is motivated by applications in machine learning. They again showed the NP-completeness of Cluster Editing. Independently, Chen et al. [9] examined a very similar problem in the context of phylogenetic trees, essentially showing that Cluster Editing is NP-hard. A first approximation for Cluster Editing was provided by Charikar et al. [8]: There exists some constant ε > 0 such that it is NP-hard to approximate Cluster Editing within a factor of 1 + ε. Furthermore, they provided a polynomial time factor-4 approximation algorithm for this problem. Ailon et al. [1] presented a randomized expected factor-3 approximation algorithm. There exists a very similar problem named Cluster Deletion: There, one is allowed to delete edges only to transform a graph into a disjoint union of cliques. Cluster Deletion was shown to be NP-hard by Natanzon [34]. Later also the APX-hardness of Cluster Deletion was proven by Shamir et al. [42]. 13

Cai [7] considered a more general graph modification problem that also allows vertex deletions besides edge insertions and deletions. Moreover he showed that if the graph property is hereditary and can be characterized by a finite set of forbidden induced subgraphs, the problem is fixed-parameter tractable for the number of modifications as parameter. Cluster Editing fulfills these requirements where an induced path of length two is this forbidden subgraph. Thus, it can be inferred that Cluster Editing is fixedparameter tractable and can be solved with the in [7] presented O(3k |G|4) time algorithm. However, the algorithm relies on complicated implementations and does not seem to be very fast for practical applications. The first non-trivial fixed-parameter algorithm has been given by Gramm et al. [17]: With a refined branching strategy which is still suitable for practical implementation they achieved a running time of O(2.27k + |V |3 ). Later, Gramm et al. [16] presented the currently fastest algorithm so with running time O(1.92k + |V |3 ) which can be seen as a mostly theoretical result since the branching results in more than 1300 cases. Damaschke [10] studied the enumeration of all solutions with a minimal set of edge changes and gave a fixed-parameter algorithm with running time O ∗ (2.4k ). Gramm et al. [17] presented a data reduction for the problem which runs in O(|V |3 ) time and results in a problem kernel of size O(k 2). In detail, the problem kernel has 2k 2 + 2k vertices. The running time for the data reduction has been further improved by Protti et al. [38] whereas Fellows [13,14] showed that a linear size kernel of size 24k can be achieved by a polynomial-time reduction. By using the concept of critical cliques, Guo [18] presented an O(|V ||E|2 ) time algorithm which results in a problem kernel with at most 4k vertices. Unfortunately, it seems that this last important result cannot be transferred to Weighted Cluster Editingsince the concept of critical cliques does not hold for weighted graphs.

2.2

The Weighted Cluster Editing Problem

The Weighted Cluster Editing problem is a generalization of the unweighted Cluster Editing problem. Hence, we infer that WeightedCluster Editing is NP-hard. In contrast, to the unweighted Cluster Editing problem there has not been much research for Weighted Cluster Editing. Several heuristics for the problem such as HCS [21], CLICK [43], CAST [3], and FORCE [51] are known. CAST tries to find the optimal solution with high probability, both HCS and CLICK greedily use minimal cuts to find an approximate solution, while FORCE uses a graph layout algorithm. However, there are just a few algorithms known so far which return an exact solution for Weighted Cluster Editing. Rahmann et al. [39] adapted the fixed-parameter algorithm for Cluster Editing of Gramm et al. [16] to the weighted case using the modification costs as parameter and achieved a 14

running time of O(3k + |V |3 log |V |). This simple recursive algorithm branches in three cases for every conflict triple, namely either deleting one of the edges or inserting the missing edge. In the following, we refer to this branching strategy as O(3k ) branching strategy. In [5] the authors give a refined branching strategy with running time O(2.42k + |V |3 log V ) Furthermore, the presented data reduction results in problem kernel for Weighted Cluster Editing with 2k 2 + 2k vertices. Another very important result of this paper was Reduction Rule 3 which merges two vertices when the edge between them is set to “permanent”. As it was shown in the evaluation part of [5], the new reduction rule significantly improves the running time in practice. Unexpectedly, the simple O(3k ) branching strategy outperforms the refined branching strategy if Reduction Rule 3 is used. The authors conjecture that merging vertices might be used for an even more efficient branching. In this thesis we have a closer look at this reduction rule and introduce a new branching strategy which leads to the fastest algorithm known for the Weighted Cluster Editing problem.

15

16

Chapter 3 Reduction Rules and Problem Kernel In order to downsize the input graph as much as possible it is useful to define a set of reduction rules. By applying these rules, a given problem instance (G, k) can be reduced. In detail, the number of vertices, edges or the size of the parameter k can be reduced, respectively. Reduction rules are usually required to have polynomial running time, and can often improve the running times in practice. In the following, we distinguish between parameter-independent and parameter-dependent reduction rules.

3.1

Parameter-Independent Data Reduction

Parameter-independent data reduction rules reduce a given instance independently of the value of the parameter. They are mostly straightforward and of course can be applied for non-fixed-parameter algorithms as well. The following reduction rule is obvious and needs no proof: Reduction Rule 3.1.1. Remove all connected components which are cliques from the input graph. Next, we present a reduction rule which was recently published by B¨ocker et al. [5]. The rule can also be applied in a parameter-dependent context; in this case, we show how to adjust k accordingly. In Section 3.3 we show that the application of this reduction rule leads to a smaller problem kernel. In addition, this reduction rule is vital for a new branching strategy in Section 4.2. Reduction Rule 3.1.2. (Merging Rule) If an edge uv is set to “permanent”, we infer that u and v must be in the same clique in every solution. In this case we merge u and v and create a new vertex u′ . For clarification in the following, see Figure 3.1. As consequence of merging u and v, for all w ∈ V \ {u, v} we join edges uw, vw. If w is common-neighbor 17

(a)

u

v c1

c2

(b)

u

(c)

v c1

u

c2

c1

c2

c1 > c2

c1 + c2 w

pay c2

u

v c1

c2

c1 ≤ c2

c1 − c2

c1 + c2 w

w

pay c1

u′

u′

u′

u′

(d)

w

w

w

w

v

c2 − c1 w

Figure 3.1: Merging two vertices u, v into a new vertex u′ : Let c1 = |s(uw)|, c2 = |s(vw)| be the edit costs. Dotted edges are nonexistent. This Figure is taken from [5]. of u and v, we create a new edge u′w, such that the deletion costs the sum of deletion costs of both uw and vw. If w is neither a neighbor of u nor of v, we set the insertion cost of the absent edge u′ w analogously. In case w is a noncommon neighbor, uvw is a conflict triple. Hence, we must decide whether to delete the existing or insert the missing edge. We simple carry out the cheaper operation by summing the weights (one is positive, one is negative) of both edges to calculate s(u′ w) and decrease k by the minimum of the absolute values. Thus, we maintain the possibility to edit u′ w later. This is how we merge u, v into a new vertex u′ : For each vertex w ∈ V \ {u, v} set s(u′ w) ← s(uw) + s(vw). Let k ← k − icp(uv), and delete u and v from the graph. The proof of the correctness of this reduction rule can be found in [5]. To complete this section we now introduce a set of new reduction rules: Reduction Rule 3.1.3. (Light Edge Rule) Given vertices u, v with s(uv) < 0 such that X |s(uv)| ≥ s(vw) w∈N (v)

then uv will be set to “forbidden”. The following Lemma shows that Reduction Rule 3.1.3 is correct. P Lemma 1. Let uv be an nonexistent edge with |s(uv)| ≥ w∈N (v) s(vw). Then there exists an optimal solution Gopt such that Gopt does not contain uv. Proof. Let Gopt be an optimal solution where uv is present. Clearly the cost for connecting u and v in G is |s(uv)|. Still deleting all edges incident to v in G results in cost less or equal |s(uv)|. Thus, any optimal solution Gopt where uv is present can be transformed into a solution where v is a clique itself with lower or same costs.

18

Reduction Rule 3.1.4. (Almost Clique Rule) Let C be a vertex set, such that G[C] forms a connected component in G and let kC denote the minimum separation costs of G[C]. If X X kC ≥ |s(uv)| + s(uv) u∈C,v∈V \C,s(uv)>0

u,v∈C,s(uv)≤0

then all edges uv ∈ C × C will be set to “permanent”. The following Lemma shows the correctness of Reduction Rule 3.1.4. Lemma 2. Let C be a vertex set, such that G[C] forms a connected component in G and let kC denote the minimum separation costs of G[C]. If X X kC ≥ |s(uv)| + s(uv) u∈C,v∈V \C,s(uv)>0

u,v∈C,s(uv)≤0

holds, then there exists an optimal solution such that all vertices from C are in the same cluster. Proof. Suppose C is a vertex set for which the above presumption holds. In addition, let Gopt be an optimal solution where not all vertices from C are in the same cluster. Therefore, some edges in G[C] have been deleted, with costs of at least kC . In the following, we show it is cheaper to delete all incident edges to G[C] than to separate G[C]. The costs for deleting all edges to neighbors of the connected component G[C] P so that G[C] is not connected to any other vertex outside G[C] are at most u∈C,v∈V \C,s(uv)>0 s(uv). Connecting all vertices P in G[C] costs at most u,v∈C,s(uv)≤0 |s(uv)|. Thus, it is equally expensive or cheaper to leave vertices from C in the same cluster than to separate G[C]. Concerning the computationally exhaustive search for any subgraph G[C] in G it is useful to apply a greedy strategy or a heuristic to find highly connected subgraphs. We present such a strategy in Chapter 5. Reduction Rule 3.1.5. (Heavy Clique Rule) Let C be a vertex set, such that G[C] = (C, E) forms a clique in G. Let E ′ ⊆ E denote a set of |C| − 1 edges such that no edge in E \ E ′ has smaller weight than an edge in E ′ . In addition, let y ∈ C be the vertex which has the smallest sum of edge weights to all vertices v ∈ V \ C, more precise: X X for allw ∈ C \ {y}we have |s(yz)| ≤ |s(wz)|. z∈V \C

z∈V \C

Set C ′ := C \ {y}. If X

s(e) ≥

w∈C ′

e∈E ′

holds, then all edges uv ∈

C 2

X X



|s(wz)|

z∈V \C

will be set to “permanent”. 19

So whenever the weight of the |C| − 1 edges with smallest weight in G[C] is greater than or equal to the connection of the |C| − 1 vertices with vertices not in C, the clique G[C] will be set to “permanent”. Lemma 3. Reduction Rule 3.1.5 is correct. Proof. Let C denote a vertex set such that the presumption of Reduction Rule 3.1.5 holds for C. Thus, E ′ denotes the set of the |C| − 1 edges in E with smallest weight, y ∈ C denotes the vertex which has the smallest sum of edge weights to all vertices v ∈ V \ C and C ′ := C \ {y}. Now let Gopt be an optimal solution where Gopt [C] is not an induced clique in Gopt . In the following, we show it is more expensive to separate G[C] than to delete all vertices in C ′ from their cliques in Gopt and inserting them into the clique of y. To separate G[C], some edges in E have to be deleted from G. To split G[C],P one has to delete at least |C| − 1 edges. This G[C] implies costs of at least e∈E ′ s(e). On the other hand, P deletingPthe vertices in C ′ from their cliques would induce costs of at most w∈C ′ z∈N (w)\C s(wx) in the optimal solution Gopt since every vertex needs to be disconnected from all its original neighbors in G\C. In addition, all vertices in C ′ to the clique in Gopt of y induces Pconnecting P costs of at most w∈C ′ z∈V \N (w) |s(wz)| since every vertex in C ′ needs to be connected to at most all its original non-neighbors in G. It follows that the costs for deleting the vertices C ′ from their cliques in Gopt and P Pinserting them into the clique in Gopt which contains y are at most w∈C ′ z∈V \C |s(wz)|. By our presumption, we know these costs are less than the separation costs for G[C]. Thus, any optimal solution can be transformed into a solution where G[C] is an induced clique in Gopt [C]. Due to the complexity of the exhaustive search for any induced clique in G it is useful to apply Reduction Rule 3.1.5 just on edges. Actually this reduction rule can be transformed into a new reduction rule which is only applicable to edges: Reduction Rule 3.1.6. (Heavy Edge Rule) Given vertices u, v with s(uv) > 0 such that X s(uv) ≥ |s(vw)| w∈V \{u,v}

then uv will be set to “permanent”. The following Lemma shows that Reduction Rule 3.1.6 is correct: P Lemma 4. Let uv be an edge with s(uv) ≥ w∈V \{u,v} |s(vw)| in G. Then there exists an optimal solution Gopt such that Gopt contains uv. 20

Proof. Let Gopt be an optimal solution where uv is absent. In the following, we show that we can transform this optimal solution into a cheaper one where u and v are connected by moving v into the clique of u in Gopt . Clearly the cost for putting u and v into different cliques is s(uv). In contrast, deleting v P from all its original neighbors except of u in G results in costs of at most w∈N (v)\{u} s(vw) and inserting v to the connected component of u reP sults in costs of at most w∈V \N (v) |s(vw)|. We conclude the overall costs P of w∈V \{u,v} |s(vw)| by adding the two above sums. Thus, this solution is cheaper than the optimal solution since we know by our presumption that the transformation costs are less than s(uv). In the following, we present a more involved data reduction rule from B¨ocker et al. [6]. This rule generalizes the “critical clique” concept of Guo [18]: Given an unweighted graph G, a critical clique C is a maximal clique in G such that any two vertices u, v ∈ C share the same neighborhood, N(u) ∪ {u} = N(v) ∪ {v}. For unweighted Cluster Editing, Guo [18] observed that all vertices of a critical clique of the input graph G must be in the same cluster of an optimal clustering. Unfortunately, this observation does not hold for weighted graphs, since edges differ in their weights. Even if we assume that two vertices have similarly weighted neighborhoods they can end up in different clusters in an optimal solution. We give an example from B¨ocker et al. [6] for such a situation: Example. Consider the graph displayed in Figure 3.2. We assume all nonexisting edges to have weight s(xy) = −99. In the displayed graph, u, v form a critical clique. In addition, the weights of their neighborhood are “monotone”: That is, s(ua) < s(ub) < s(uc) < s(ud) and, similarly, s(va) < s(vb) < s(vc) < s(vd). However, u and v do not end up in the same cluster. The optimal clustering for this Weighted Cluster Editing instance is {u, b, c} and {v, a, d} with cost 5 + 1 + 2 + 3 + 14 = 25. In contrast, any solution that does remove edge uv has cost at least 30.

Figure 3.2: Example showing that critical cliques cannot be merged in weighted graphs. Non-existing edges are not displayed. Figure is taken from [6] On the positive side, in weighted graphs it is possible to merge vertices 21

that have “almost identical” neighborhoods if the edge between these vertices has a certain weight. We introduce some short definitions for the following reduction rule. Let P s(v, U) := u∈U s(vu) define a more compact form of our similarity function, returning the sum of edge weights from vertex v to any vertex u ∈ U. Further, let Nu := N(u) \ (N(v) ∪ v), Nv := N(v) \ (N(u) ∪ u) denote the exclusive neighbors of u and v. In addition, let W := V − (Nu ∪ Nv ∪ {u, v}) contain all other vertices. Now let ∆u := s(u, Nu ) − s(u, Nv ) denote the costs for setting the edges to all exclusive neighbors of v plus the costs for deleting the edges to all exclusive neighbors of u. Analogously we define ∆v := s(v, Nv ) − s(v, Nu ). The reduction rule runs over all possible partitions of the vertices W into Cu and Cv where Cu denotes the final clique containing u and Cv denotes the final clique containing v: Reduction Rule 3.1.7. (Critical Clique Rule) If  s(uv) ≥ max min s(v, Cv ) − s(v, Cu ) + ∆v , s(u, Cu ) − s(u, Cv ) + ∆v (3.1) Cu ,Cv

holds then uv can be set to “permanent”, where the maximum runs over all subsets Cu , Cv ⊆ W with Cu ∩ Cv = ∅. An efficient computation of the maximum in equation (3.1) is not trivial. An naive approach, which simply considers every possibility, results in running time O(3|W |) what is obviously unsatisfactory. Instead we make use of dynamic programming to solve this problem in polynomial running time and space. In addition, some care has to be taken for real-valued instances, since the dynamic programming approach cannot be applied directly to real-valued edge weights. Details concerning the implementational aspect of this reduction rule can be found in Chapter 5. We assume that a combination of parameter-independent reduction rules can decrease the size of the problem instance significantly. We present a more detailed analysis of the power of parameter-independent reduction rules in Section 6.2.

3.2

Parameter-Dependent Data Reduction Rules

In contrast to parameter-independent reduction rules parameter-dependent data reduction rules refer to the value k of the current parameter. These reduction routines can lead to a provable problem kernel. In the next section we show that a reduction with some of the reduction rules of this chapter leads to a problem kernel. First we present a reduction rule recently published in [5]. In the following, we extend this reduction rule. Furthermore, we need this rule for the proof of the problem kernel in Section 3.3. 22

Reduction Rule 3.2.1. For each set of two vertices u, v from V we make use of the lower bound for costs icp(uv)induced when uv is set to “permanent” or “forbidden” icf (uv), as defined in Section 1.5: • For all u, v ∈ V where icf (uv) > k: Insert uv if necessary, and set uv to “permanent” by assigning s(uv) ← +∞. • For all u, v ∈ V where icp(uv) > k: Delete uv if necessary, and set uv to “forbidden” by assigning s(uv) ← −∞. If there are two vertices u, v such that both conditions hold simultaneously, the problem instance (G, k) is not solvable. The proof of the reduction rule above can be found in [5]. Note, that Reduction Rule 3.2.1 can be further improved by combining the local lower bounds icp and icf with a global lower bound. We give more details in Section 4.5. In the following, we improve Reduction Rule 3.2.1. To do so, we have a closer look at the lower bounds icp and icf . Both lower bounds consider only common or non-common neighbors, respectively. However, we can also consider other vertices which might induce costs if an edge is set to “forbidden” or “permanent”. For example, suppose uv is set to forbidden: Clearly, for every common neighbor z of u and v either uz or vz needs to be deleted. In addition, it might be the case that further deletions are necessary to separate u and v. For example, suppose that uz is deleted: It is obvious to see that for every common neighbor x of u and z we also have to decide whether to delete ux or zx. We therefore introduce a new reduction rule: We define icfset (uv, C) and icpset (uv, NC) as the induced costs for setting uv to “forbidden” or “permanent” regarding only vertices from the vertex set C or NC, respectively: X icfset (uv, C) = min{s(uw), s(vw)} w∈N (u)∩N (v)∩C

icpset (uv, NC) =

X

min{|s(uw)|, |s(vw)|}

w∈(N (u)△N (v))∩N C

For simplicity we need some more short definitions: NC (u, v) denotes the set of common neighbors of u, v, meaning N(v)∩N(v). NE (u, v) denotes the set of exclusive neighbors of u with respect to v, meaning {w | s(uw) > 0, s(vw) ≤ 0}. NN (u, v) is the set of non-neighbors of u, v, meaning {w | s(uw) ≤ 0, s(vw) ≤ 0}. We further define icf 0 (uv) = max{0, s(uv)} and icp0 (uv) = | min{0, s(uv)}| as the induced costs of depth zero for setting uv to “forbidden” or “permanent” which correspond to the weight of uv itself. We can now define the “induced costs forbidden of depth two” icf 2 (uv) and “induced costs permanent of depth two” icp2 (uv). With depth two we mean 23

that we consider not only common neighbors and non-common neighbors but also common neighbors of common neighbors and non-common neighbors of non-common neighbors: icf 2 (uv) = icf 0 (uv) +

X

w∈NC (u,v)

n min s(uw) + icfset (uw, NE (u, v) ∩ N(w)), s(vw) + icfset (vw, NE (v, u) ∩ N(w)

icp2 (uv) = icp0 (uv) +

X

w∈N (u)∆N (v)

o

n min s(uw) + icfset (uw, NC (u, w) ∩ N(v)), |s(vw)| + icpset (vw, N(w) ∩ NN (u, v)) o +icpset (vw, NC (u, v) \ N(w))

Reduction Rule 3.2.2. This is how we make use of icf 2 and icp2 : • For all u, v ∈ V where icf 2 (uv) > k: Insert uv if necessary, and set uv to “permanent” by assigning s(uv) ← +∞. • For all u, v ∈ V where icp2 (uv) > k: Delete (uv) if necessary, and set uv to “forbidden” by assigning s(uv) ← −∞. In words, icf 2 (uv) defines the necessary minimum cost for deleting uv and disconnect it with all its common neighbors. Additional when an edge to a common neighbor w is deleted it is possible that w and u or v respectively also have common neighbors and one edge to this neighbor z has to be deleted. To be sure that edges are not deleted twice, only vertices z which are not common neighbors of u and v are involved in the calculation. The similar argumentation is valid for the calculation of icp2 (uv). Lemma 5. Rule 3.2.2 is correct. Proof. We show the correctness of Reduction Rule 3.2.2 by assuming that a solution of an instance (G, k) is not allowed to contain conflict triples. Consider two vertices u, v. Setting uv to “forbidden” immediately implies there is a conflict triple wuv for each common neighbor w ∈ NC (u, v) of u and v. As uv cannot be added to G we have to delete either uw or vw from G which implies the costs min{s(uw), s(vw)}. Furthermore, if we delete uw we have to decide for any common neighbor z of u and w which is non-common neighbor of u and v whether we delete uz or wz, since the deletion of uw implies a conflict triple u, w, z (see Figure 3.3). Clearly this implies additional costs of icfset (uw, NE (u, v) ∩ N(w)). 24

u

−∞

v

z w

Figure 3.3: Example for icf 2 : Edge uv is set to “forbidden”. For the case that uw is deleted, one has to decide whether uz or wz gets deleted. Due to the fact that z ∈ NE (u, v) ∩ N(w) we know that w 6= z and that no edge weight is counted twice. The same argumentation holds for the case that vw is deleted. Setting uv to “permanent” immediately implies conflict triple uvw for each non-common neighbor w ∈ N(u) △ N(v) of u and v. For the proof we assume without loss of generality that w is neighbor of u. As uv cannot be deleted from G we have to delete either uw or insert vw which implies the costs min{s(uw), |s(vw)|}. Furthermore, if we delete uw we have to decide for any common neighbor z of u and w which is a common neighbor of u and v whether we delete uz or wz since the deletion of uw implies a conflict triple zuw (see Figure 3.4). This implies the additional costs of icpset (uw, (NC (u, w) ∩ N(v)).

u

+∞

v

z w

Figure 3.4: Example for icp2 : Edge uv is set to “permanent”. For the case that uw is deleted, one has to decide whether uz or wz gets deleted. We again choose z to be a common neighbor of u, v to avoid counting edge uz or wz twice in our calculation. For the other case where vw is inserted we have to decide for any common neighbor z of u and v which is not connected to w whether we delete vz or insert wz since vwz forms a conflict triple (see Figure 3.5). This implies costs of icpset (vw, NC (u, v) \ N(w)). Clearly w 6= z so that no edge weight is counted twice. Second, if vw is inserted we have to decide for all vertices z which are neighbors of w and not neighbors of u or v and not u itself if we delete wz or insert vz (see Figure 3.5), which implies costs of icpset (vw, N(w) ∩ NN (u, v)). 25

u

+∞

v

u

+∞

z

v

z

w

w

Figure 3.5: Example for icp2 : Edge uv is set to “permanent”. Left: z is noncommon neighbor of w and v. Right: z is non-neighbor of w and v. For the case that vw is inserted, one has to decide whether vz or wz gets modified. Again w 6= z holds. At least we have to add the costs of setting uv to “forbidden” or “permanent” which results in cost of icf 0 (uv) or icp0 (uv), respectively.

It is possible to extend this rule to greater depths then two. However, the resulting rules would be much more complicated and of course also takes additional computation time and therefore are not relevant in practice. The observation above can be generalized for the induced costs for setting an edge to “forbidden” without any additional effort using a well-known algorithm in computer science which results in the following reduction rule: Reduction Rule 3.2.3. Setting edge uv to “forbidden” is not longer affordable if k < min-s-t-cut for s = u and s = v, and therefore we can be set uv to “permanent”. Lemma 6. Reduction Rule 3.2.3 is correct. Proof. By definition, min-s-t-cut returns the cheapest costs to separate the graph into two. Clearly setting uv to “forbidden” would result in at least costs of min-s-t-cut and would be greater than the parameter k. Therefore, uv is present in any solutions of (G, k) and can be set to “permanent”. In an actual implementation one has to keep in mind that the minimum s-t cut algorithm can slow down the implementation for large graphs. Therefore, Reduction Rule 3.2.3 should only be applied in selected cases.

3.3

Problem Kernel

In the following we show that Integer Weighted Cluster Editing has a problem kernel consisting of at most k 2 + 3k + 2 vertices and 21 k 3 + 25 k 2 + 5k + 2 edges and thereby improve the best known problem kernel so far of size 2k 2 + k [5]. 26

To show the size of the problem kernel Reduction Rules 3.1.1, 3.2.1 and 3.1.2 are taken into account. Reduction Rule 3.1.1 deletes every connected component which forms a clique from the graph. Reduction Rule 3.2.1 sets edges to forbidden or permanent if their change is unaffordable. Finally, Reduction Rule 3.1.2 merges two vertices if the edge between them is set to permanent. We show that the resulting problem kernel has at most k 2 + 3k + 2 vertices and 21 k 3 + 25 k 2 + 5k + 2 edges and thereby improve the previous result from B¨ocker et al. [5] by a factor of 2. Lemma 7. Integer Weighted Cluster Editing has a problem kernel which contains at most k 2 + 3k + 2 vertices and at most 21 k 3 + 52 k 2 + 5k + 2 edges. It can be found in O(|V |3 log |V |) time. Proof. Let G = (V, E) be a graph which is reduced with respect to Reduction Rules 3.1.1, 3.2.1, 3.1.2. Without loss of generality we assume that G is connected. For a non-connected graph G, we process every connected component separately. Since Reduction Rule 3.1.1 deletes all isolated cliques from the given graph, G is not a clique and we need at least one edge modification to transform G = (V, E) into a graph G′ = (V, E ′ ) of disjoint cliques. Let the parameter k denote the minimum costs for edge modifications to obtain cluster graph G′ from the input graph G. The minimum costs for edge modifications k is obtained by summing the costs for edge additions ka and the costs for edge deletions kd . Since we assume a minimum cost of 1 for every edge we infer that the induced cost for setting an edge to forbidden or permanent are always greater than or equal to the number of common or non-common neighbors, respectively. Thus, it is sufficient to apply Reduction Rule 3.2.1 if the number of common or non-common neighbors respectively is greater than the parameter k. Let VC ⊂ V denote the vertex set of a largest clique C in G′ . We know that |VC | ≥ |V |/(kd + 1)(∗) , since with every edge deletion the number of connected components is increased by at most one. Suppose u, v are two vertices in C. Moreover we infer that u, v have at least |VC | − ka − 2 common neighbors in G, otherwise u, v cannot be in the same clique in the final solution since at most ka edges can be inserted. Furthermore, u and v cannot have more than k common neighbors, since otherwise Reduction Rule 3.2.1 and 3.1.2 can be applied and u and v would be merged to one vertex. Thus, we infer |VC | − ka − 2 ≤ k and therewith |VC | ≤ k + ka + 2(∗∗) . We conclude: |V | ≤(∗) |VC | ≤(∗∗) k + ka + 2 = kd + 2ka + 2 kd + 1 With this equation we can define an upper bound for |V |: |V | ≤ kd2 + 2ka kd + 2ka + 3kd + 2 ≤ k 2 + 3k + 2 The same reasoning can be used to define an upper bound for the number of edges |E|: We know that the larges Clique C has at least |E ′ |/(kd + 1) edges 27

since equally distributing the edges is the worst case. Since we know that C has at most kd + 2ka + 2 vertices we can conclude that |EC | ≤ (kd + 2ka + 2)2 /2. Thus, we conclude: (kd + 2ka + 2)2 k2 |E ′ | ≤ |EC | ≤ = d + 2ka kd + 2ka2 + 2kd + 4ka + 2 kd + 1 2 2 This result leads to the upper bound for |E ′ |: 1 5 |E ′ | ≤ kd3 + 2kd2 ka + 2kd ka2 + kd2 + 6kd ka + 2ka2 + 4kd + 4ka + 2 2 2 At last we have to add k, since we deleted at most k edges. Our final upper bound is: 1 5 |E| ≤ k 3 + k 2 + 5k + 2 2 2

28

Chapter 4 Algorithms for Weighted Cluster Editing In this chapter we introduce two new branching strategies which lead to two new recursive algorithms with running time O(2.62k + |V |3 ) and O(2k + |V |3 ) for integer-weighted Cluster Editing. We further show by refined analysis of the second branching strategy that our algorithm even achieves a running time of O(1.83k +|V |3 ). In Section 4.4 we generalize these results to real-valued Weighted Cluster Editing. Our algorithms are search tree algorithm as the algorithm for unweighted Cluster Editing introduced by Gramm et al. [16] and the algorithm for the Weighted Cluster Editing problem from B¨ocker et al. [5]. Both algorithms are based on the observation that all conflict triples need to be resolved. Therefore, they branch in three or five cases for a conflict triple. In contrast to those algorithms, we use a new and surprisingly simple branching strategy based on Reduction Rule 3.1.2, that branches at an edge which essentially does not have to be in a conflict triple. First we present the basic recursive fixed-parameter algorithm from B¨ocker et al. [5] for Weighted Cluster Editing. After this, we introduce our new branching strategies and show in a refined analysis that our second branching strategy leads to search tree size O(1.83k ).

4.1

The basic algorithm

We now present the basic recursive fixed-parameter algorithm for Weighted Cluster Editing introduced by B¨ocker et al. [5]. We assume that an input instance is connected since it is not useful to connect two disjoint components. In addition, we can assume that we can solve a problem instance with an arbitrary graph of size less than or equal to an integer c in constant time. Furthermore, we have to make an additional assumption which is required solely to achieve provable running times: Every given graph is integer-weighted 29

with s(uv) ∈ Z such that no edge has weight zero. During the course of computations, such edges can appear and require additional attention when analyzing the algorithm. An unweighted Cluster Editing instance can be encoded by assigning edge weights s(uv) ∈ {+1, −1}. In Section 4.4 we discuss necessary modifications to adopt our algorithms to real-valued weighted graphs. First of all our algorithm reduces the input instance (G, k) with Reduction Rules 3.1.1, 3.2.1 and 3.1.2. which takes running time O(|V |3 ) for an integerweighted instance. This running time can be argued as in [5]: The initial computation of icf (uv) and icp(uv) (used by Reduction Rule 3.2.1) for all u, v ∈ V takes O(|V |3 ) time. During the reduction every edge can be set to “forbidden” or “permanent” at most once. After an edge was set to one of these values, icf and icp need to be updated which can be carried out in O(|V |) time. Note that after a merging operation updating icf , icp and graph G takes only linear time too. To find an edge uv where icf (uv) or icp(uv) is greater than the parameter k, we can make use of an array data structure instead of priority heaps since we use integer weights. Thus, finding such an edge takes only linear time. Since we have at most |V |2 edges to reduce and a reduction cycle requires only linear time, the whole initial reduction takes O(|V |3 ) time for an integer-weighted instance and O(|V |3 log |V |) for a real-valued weighted instance. Note that further reduction can be achieved by applying more Reduction Rules from Chapter 3. In Section 5.3 more details about the implementational aspect of these reduction rules is provided. Beyond the initial reduction our algorithm uses interleaving, applying reduction rules in every node of the search tree, to reduce the size of the instance. After the instance is reduced the algorithm calls itself recursively. The number of calls depends on the branching strategy. Two different branching strategies are discussed in the following section. A branch of the search tree can be rejected if either k < 0 or k is less than a calculated lower bound from Section 4.5. If so, we increase k and start our algorithm again. An instance (G′ , k ′ ) is solved if G′ does not contain any conflict triple and k ′ ≥ 0. In the following section we introduce two new branching strategies, which determine the size of the search tree and therewith determine the running time of the algorithms.

4.2

Two Simple Branching Strategies

First we introduce a very simple branching strategy which leads to a search tree of size O(2.62k ). The branching strategy is as follows: Let uv be an edge of a conflict triple. Then, we either set uv to “forbidden”, or we set uv to “permanent” and thereby merge the vertices u and v to vertex xuv with Reduction Rule 3.1.2. 30

This simple branching strategy leads to a branching number of 2.62: Clearly deleting the present edge uv reduces the parameter by at least one since we assume minimum weights for the edges. In case we set uv to “permanent”, we can decrease our parameter by min{s(uw), −s(vw)} since we merge vertices u, v. By joining edges uw and vw into a single edge with weight s(uw) + s(vw), it is possible that edges with weight zero occur. However, in case s(vw) = 0 joining uv with any other edge does not decrease our parameter. To avoid this case we use a bookkeeping trick: We assume that joining edges uw and vw only reduces the parameter by min{s(uw), −s(vw)} − 21 (see Figure 4.1). In case we join edges of weight zero, we may assume that this decreases our parameter by the remaining 21 . Using this bookkeeping trick, our edge branching strategy has a branching vector of (1, 12 ) with corresponded branching number 2.62. An algorithm from Section 4.1 which uses this branching strategy has running time O(2.62k + |V |3 ). +∞

u

−1

1

xuv

v k←k−

1 2

0( 21 ) w

w

Figure 4.1: An example branching at conflict triple uvw: Vertices u and v are merged to vertex xuv . Edges uw and vw are joined to edge xuv w. The parameter gets reduced by 21 and cost 21 is booked for the zero weight edge xuv w. In the following, we assume that resolving a conflict triple or joining a zero weight edge always implies at least cost 12 . As a second branching strategy, we improve the above branching strategy by choosing always an edge such that the branching number is maximized. We observed a certain structure for graphs which contain only edges which are part of at most one conflict triple: Lemma 8. Given an integer-weighted and connected graph G. If every edge of G is part of at most one conflict triple, then G is either a clique or a clique minus a single edge. Proof. If G = (V, E) contains no conflict triple, G is a clique, we assume that this is not the case: Let uvw be a conflict triple uvw with uv, uw ∈ E and vw ∈ / E. Lets assume that another vertex x of G is adjacent to v. If this is the case then also ux ∈ E most hold, since otherwise uv is part of two conflict triples in contrast to our assumption. Similarly, ux ∈ E implies vx ∈ E. The same argumentation holds if we replace v by w. We conclude that if vertex x is adjacent to u, v or w then it has to be adjacent to all of them. 31

So lets assume x, y are two vertices adjacent to uvw. In case xy ∈ / E, xvw and vxy are two conflict triples containing the edge xv, a contradiction to our assumptions. Finally, we assume two vertices x, y such that x is adjacent to uvw while y is not adjacent to uvw. Since G is connected we assume xy ∈ E. Now, the edge vx is part of the two conflict triples xvw and xvy, which is again a contradiction to our assumption. Thus, any vertex x but v, w must be adjacent to all other vertices in G. We also observe that a graph with just absent edges of weight zero is trivial since setting all edges in this graph to “permanent” does not influence the parameter and solves the problem instance immediately. Lemma 9. For an integer-weighted graph, the edge branching strategy that chooses an edge with minimal branching number, has at least branching vector (1, 1). Proof. Recall that if we resolve a conflict triple, we reduce k by at least 21 and if we join an edge of weight zero, we reduce k by 12 as well. As long as the algorithm can choose a present edge uv which is member of two conflict triples we can conclude branching vector (1, 1) using our bookkeeping-trick: Deleting uv still results in cost 1 whereas merging u and v reduces the parameter by at least 1 since two conflict triples are resolved. In case uv is a member of only one conflict triple uvw, we know by Lemma 8 that G is a clique minus an edge, namely vw. We suppose that s(vw) < 0 since otherwise vw can be inserted without any costs and G would already form a clique. We distinguish between the subcases that one of the joined edges uw and vw has an absolute weight greater than one or both edges have an absolute weight equal to one, respectively. If the first subcase holds then joining uw and vw results in cost ≥ 1 since we do not make use of our bookkeeping trick. In the other case we can omit our bookkeeping trick and reduce the parameter by exactly 1. The resulting instance is trivial since the created absent edge has weight zero. Thus, the resulting branching vector is always (1, 1) as claimed. In total, our edge branching algorithm has a running time of O(2k + |V |3 ) for integer-weighted graphs. For the Cluster Deletion problem where only edge deletions but no insertions are allowed, the above strategy results in a running time of O(1.62k ) because we can give all non-existing edges uv weight s(uv) = −∞. In Section 5.4 we show how the branching number of an edge is efficiently computed.

4.3

An O(1.83k ) Branching Strategy

In the following, we show that the second branching strategy from the previous section leads to a search tree of size less than O(2k ). 32

 We choose edge uv ∈ V2 such that uv has the smallest branching number  of all tuples in V2 . The branching number is computed from the branching vector (a, b) where a are the cost for deleting edge uv, while b is the cost for merging the edge. Clearly if and only if one of these cost is zero, the edge has an infinite branching number. An edge uv with branching number less than infinity is not necessarily part of any conflict triple: Using our bookkeeping trick from the previous section we can generate cost 12 by joining an edge with weight zero with any arbitrary edge. Thus, the edge with minimal branching number might not be part of any conflict triple. We conclude that resolving a conflict triple or joining an edge of weight zero respectively results in cost 21 . In the worst case, all edges in the graph have weight −1, 0 or +1. If we ignore the bookkeeping peculiarities, then finding an edge with minimal branching number is equivalent to finding an edge uv that is part of a maximal number of conflict triples. Theorem 1. For an integer-weighted graph, the edge branching strategy that chooses an edge with minimal branching number, results in a search tree which size is bounded to O(1.83k ). We leave the proof of the above Theorem to the following sections.

4.3.1

Solving Paths and Cycles

The branching strategy introduced in previous section is sufficient for solving instances which are cycles or paths with a branching number of at most 1.76. Keep in mind that the edge with the smallest branching number is always chosen for branching. Lemma 10. Given a graph G = (V, E) which is either a cycle or a path. There always exists an edge for which branching by edge merging results in a branching vector of (1, 32 ) with a branching number of 1.76. Proof. Suppose (u, v, w, x) induce a path of length three. Clearly v, w, u and v, w, x form a conflict triple. We distinguish between two cases: Case 1: If s(wu) < 0 and s(vx) < 0 then we can branch at vw with the desired branching number: Deleting vw results in costs of at least 1 whereas setting vw to “permanent” joins two pairs of edges. If the absolute weight of one of the edges vx, vu, wx and wu is at least two then joining this edge and thereby solving one conflict triple results in cost of at least 1 and solving the remaining one by joining two edges results in cost 21 since we use our bookkeeping trick to avoid “blank” zero weight edges. If the absolute weight of vx, vu, wx and wu is equal to one then the resulting vertex zvw forms a connected component on its own so that we can reduce the parameter in this branch by exactly 2. Thus, branching at vw results in a branching vector of at least (1, 23 ). 33

Case 2: We suppose that s(wu) = 0 and s(vx) ≤ 0. Furthermore, we suppose that there exists an additional vertex y such that s(xy) > 0. Since w, x, v and w, x, y form conflict triples and uw is a zero weight edge, branching for wx results in cost of 1 for deleting wx and cost of 3 · 21 = 32 for setting wx to “permanent” as two conflict triples are resolved and one zero weight edge is joined with another edge. If such a vertex y does not exist we can assume that u is connected to a vertex t which has one additional neighbor. Clearly the above argumentation holds for ut. Thus, the resulting branching vector is (1, 23 ). The remaining instances can be solved in constant time. Note, that instances which are cycles or paths can be solved generally in polynomial-time using dynamic programming. However, cycles or paths barely occur in real-world instances e.g. from biological data.

4.3.2

Refined Analysis of Edge Branching

In the following we, prove Theorem 1. Keep in mind that isolated cliques are deleted from the graph and all connected components are solved individually. Clearly if ∀xy ∈ V × V : s(xy) ≥ 0 holds, the instance is trivial and can be set as a clique. Furthermore we assume that a graph which is a path or a cycle can be solved using this branching strategy since Lemma 10 holds. In the following, we prove that this branching strategy leads to a search tree of size O(1.83k ) for all non-trivial instances which are not cycles or paths. The proof is structured as follows: Lemma 11 and 12 introduce how, given a certain structure, two consecutive branchings lead to branching vector (2, 2, 25 , 3) with branching number 1.83. In Lemma 13 we show that if only one absent edge xy exists with s(xy) < 0 the instance can be solved in polynomial time. After we proved this special case we show with Lemma 14 that a structure as required by Lemma 12 does always exist or another specific structure exists in the graph. Finally we show with Lemma 15 in Lemma 16 that if this additional structure exists the graph is either simple with respect to Lemma 13 or Lemma 12 holds somewhere else in the graph. With this last proof we show that the required structure of Lemma 12 does always exist and therewith Theorem 1 holds. Since our branching strategy chooses always the edge with minimal branching number we have to introduce a new Lemma to analyse the consecutive use of this branching strategy: Lemma 11. Given integers a, b, c, d such that the branching number of (a, b) is less than 1.325, and the branching number of (c, d) is less than 1.273. Then the branching number of (a + 2, b + 2, c + 2, d + 2) is less than 1.351. Proof. It is sufficient to check the claim for branching vectors   (a, b) ∈ (1, 5), (2, 3) and (c, d) ∈ (1, 7), (2, 4), (3, 3) . 34

All other branching vectors (a′ , b′ ) and (c′ , d′ ) that satisfy the conditions of the lemma, are dominated by these branching vectors: there exist (a, b) and (c, d) such that a ≤ a′ , b ≤ b′ , c ≤ c′ , and d ≤ d′ . Among the six possible combinations, the largest branching number results from branching vector (3, 7, 3, 9) and the corresponding branching number is 1.3509. For the following Corollary it is worth to necessary that if b is the branching number of branching vector (b1 , b2 , . . . , bn ) then b2 is the branching number of branching vector ( b21 , b22 , . . . , b2n ). Corollary 2. Given number a, b, c, d ∈ 21 N ({0, 0.5, 1.0, 1.5, . . .}) such that the branching number of (a, b) is ≤ 1.755, and the branching number of (c, d) is ≤ 1.6191. Then, the branching number of (a + 1, b + 1, c + 1, d + 1) is ≤ 1.83. In particular, branching vectors (1, 23 ) and (1, 2) satisfy the conditions of this corollary. We now introduce the central reasoning for the proof of Theorem 1 in the following Lemma: Lemma 12. Let G = (V, E) be an integer-weighted and connected graph, where we assume all edges of weight zero to be non-existing. Assume there is a triangle x, y, z with xy, xz, yz ∈ E and two additional vertices v1 , v2 such that for each vertex vi one of the following conditions holds, where x and y can be swapped: (i) s(vi x) > 0 and s(vi y) < 0 (ii) s(vi x) = 0 and s(vi y) = 0 (iii) s(vi x) = 0 and s(vi y) < 0 and s(vi z) ≥ 0 (iv) s(vi x) > 0 and s(vi y) = 0 and s(vi z) ≤ 0 Branching by edge merging has a branching vector of (2, 2, 52 , 3) with branching number 1.83. See Figure 4.2 for clearness. Proof. Initially we branch for edge xy with branching vector (1, 1) as deleting xy induces costs of 1 and merging xy results in costs 2 · 21 = 1 since for each vi a conflict triple or a zero weight edge is resolved. In the following, we refine this branching vector from two into four branching cases. First we split the branching case where xy is set to “forbidden” into two subcases: Deleting xz or yz results in costs of 1 whereas setting xz or yz to “permanent” resolves the conflict triple xzy resulting in costs of 1 since s(xy) = −∞. Additionally if condition (ii), (iii) or (iv) holds then a zero weight edge is resolved (See Figure 4.3). Otherwise if condition (i) holds then xz or yz forms a conflict triple with vi which is resolved. Hence, setting 35

either xz or yz to “permanent” results in costs of 23 . Thus, in this step of the first branch a branching vector of (1, 32 ) holds. Second we split the branching case where xy is merged into two subcases. Let wxy be the vertex resulting from merging xy. Deleting wxy z induces costs of 2 as s(wxy z) ≥ 2. Merging wxy z induces costs of 1: If condition (i) holds for vi then s(wxy vi ) = 0 otherwise the first branching at xy results already in a branching vector of (1, 23 ) with branching number 1.76 since either s(xvi ) or |s(yvi)| was greater than one in this case. Similarly s(wxy vi ) = 0 if condition two holds for vi so that in both cases joining wxy vi with zvi results in costs of 12 (See Figure 4.4). As a consequence of conditions (iii) and (iv), merging wxy z would result in costs of 12 for each vi since either with vz a zero weight edge would be joined or wxy zv forms a conflict triple. Thus, in the second branch a branching vector of (1, 2) holds. Since we branch always at the edge with the smallest branching number we can use Corollary 2 to merge the branching vectors (1, 23 ) and (1, 2) to the branching vector (2, 2, 52 , 3) with the corresponding branching number 1.83. See Figure 4.5 for an example branching.

Note that a slightly adoption of our branching strategy leads to a branching number of 1.82: By Corollary 2, it is possible to branch at zero weight edges since we saved costs 12 using our bookkeeping trick. However, in a real-world algorithm we branch just at existing edges since branching at absent edges corresponds to branching number ∞. Thus, we can guarantee that deleting the edge we branch for costs at least 1. Using this adoption we can conclude branching number 1.82 for branching vector (2, 2, 25 , 3). In the following, we prove that instance to which we refer as simple can be solved in polynomial time: Lemma 13. Given an integer-weighted and connected graph G = (V, E). If only one absent edge xy with s(xy) < 0 exists then the instance can be solved in polynomial time. i)

ii)

v

x

x z

y

iii)

v

x z

iv)

v ≥0

x

z

y

y

v ≤0

z y

Figure 4.2: Cases (i) − (iv) from Lemma 12. Solid lines are present edges, dashed lines are absent edges with weight zero, dotted lines are other absent edges. 36

i)

ii)

v

x

iii)

v

x

−∞

z

y

≥0

x

−∞

iv)

v

−∞

z

y

z

y

v ≤0

x −∞

z

y

Figure 4.3: Cases (i) − (iv) from Lemma 12 after xy is set to “permanent” and x, y are merged. v

v

i)

ii) ≥2

wxy

v iii)

≥2 z

wxy

v ≥0

iv)

≥2 z

wxy

≤0 ≥2

z

wxy

z

Figure 4.4: Cases (i) − (iv) from Lemma 12 after xy is set to “forbidden”. Proof. Given graph G = (V, E) with only one absent edge xy with s(xy) < 0 we have to distinguish between two possible solutions. First x, y could be in the same cluster in an optimal solution. This would require to insert xy and no further operation since if other absent edges exist in G they have weight zero and can be inserted without any additional costs. Second x, y could be not in the same cluster in an optimal solution. By applying a min-s-t-cut algorithm, we can calculate the minimum costs for separating G in two components such that x, y are not member of the same component. The resulting graph is already a cluster graph since zero weight edges between connected components stay absent whereas zero weight edges within a connected component can be inserted without any costs. By comparing |s(xy)| with the minimum s-t-cut value, one can easily obtain the optimal solution. Since an min-s-t-cut algorithm has polynomial running time the Lemma is correct. We now show that either an edge is member of three conflict triples and a branching vector of (1, 23 ) holds or a structure as required in Lemma 12 does always exist. In the following Lemma we show that either always Lemma 12 holds or a specific structure exists in the graphs: Lemma 14. For every undirected weighted graph G one of the two propositions holds: (i) Branching by edge merging has a branching vector with corresponding branching number of at most 1.83 (ii) An edge vw exists in G which is member of a strong conflict triple with u and a second conflict triple with x and a triangle vwy with y. 37

v1

x z y v1

v1 v2

x −∞

wxy

z

≥2

z

y

-1

-1

v2

v1

v2

v1

v1

v1

x −∞ −∞

wxy

uxz

z

−∞

uxyz

z

−∞ y

-2

y

v2

-2.5

-3

v2

v2

-2

v2

Figure 4.5: An example for branching with branching vector (2, 2, 52 , 3). Note, in this example condition (i) holds for vertex v1 and condition (ii) holds for vertex v2 . Proof. If G is a path or a cycle Lemma 10 holds. If only one absent edge with weight less than zero exists the instance would be simple with respect to Lemma 13. Since G is connected and not a trivial clique an edge vw exists which is member of at least one strong conflict triple vwu and a second conflict triple vwx. Let y denote an additional vertex. If y is common neighbor of v, w proposition (ii) holds. If y is non-common neighbor of v, w proposition (i) holds since vw is member of three conflict triples and branching vector (1, 32 ) holds with branching number 1.76 at vw. If vy or vw is a zero weight edge proposition (i) also holds. So, let s(vy), s(wy) < 0 for another vertex y which is connected to either u or x. Further, we assume without loss of generality that u is neighbor of v. In the following, we distinguish between two cases: Case 1: Let w be neighbor of x. We assume with loss of generality s(uy) > 0 since y is connected to u or x (Figure 4.6a). If s(ux) > 0 then edge uv would be member of three conflict triples or proposition (ii) holds, see Figure 4.6b. Otherwise if s(ux) ≤ 0 this local structure forms a cycle. If always just this case occurs for all edges vw which are member of a strong conflict triple and a second conflict triple then G would be a cycle or a path. Thus, we can follow 38

a)

u

b)

v

u

c)

v

u

v

y

y

w

x

d)

v

y w

u

w

x

y w

x

x

Figure 4.6: Illustration of Lemma 14. Present edges are solid lines, absent edges are displayed as dotted lines. (a) Case 1 of Lemma 14. (b) Case 1 if s(ux) > 0. (c) Case 2 if s(xy) > 0. (d) Case 2 if s(uy) > 0. that there must exist an edge vw such that one of the other cases holds. Case 2: Let v be neighbor of x. If s(xy) > 0 than edge vx would be member of three conflict triples or if s(ux) > 0 proposition (ii) holds, see Figure 4.6c. From this it follows s(uy) > 0 since y is connected to u or x. Now edge vu would be member of three conflict triples or proposition (ii) holds, see Figure 4.6d.

In the following Lemma is just a helping Lemma for the final Lemma. We now prove that if a certain subgraph exist at which we can branch at nearly all edges with branching vector (1, 1) than we can extend this subgraph only such that it stays simple with respect to Lemma 13 or a branching vector of (1 32 ) holds. Lemma 15. Given a connected subgraph Gsub = (V, E) of G with the following properties: (i) at most one absent edge e with s(e) < 0 exists in Gsub (ii) at least 2 absent edges are incident to one vertex x ∈ V (iii) ∀uv ∈ E : ∃w ∈ V : Branching at at least two edges of uv, uw, vw has branching vector of (1, 1). This instance is simple with respect to Lemma 13 or at one edge in G the branching vector (1, 32 ) with branching number 1.76 holds. See Figure 4.7a for an example. Proof. Suppose Gsub with the above properties exists in G. Let z1 be another vertex of G which shares an edge with at least one vertex of Gsub . Clearly if z1 does not exist the instance is simple with respect to Lemma 13 since G has property (i). Graph property (iii) has a special characteristic: if z1 is not connected to all vertices of Gsub a branching vector of (1, 23 ) holds at one edge in Gsub : Since every edge uv forms a vertex triple u, v, w with an additional 39

a)

u

b)

u

c)

u

z1 v

v

v

y w

z1

y w

y w

z2 x

x

v1 x

Figure 4.7: Illustration of Lemma 15. Present edges are solid lines, absent edges are displayed as dotted lines and zero weight edges are displayed as dashed lines. (a) An example for a graph for which the condition of the Lemma holds. (b) The case that two additional vertices zi are not connected. (c) The case that a vertex vi exists which has no connection to u, v, w, x or y but to zi . vertex w and at at least two edges of uv, uw, vw a branching vector of (1, 1) holds already, at at least one of the two edges the costs for merging would increase by 12 since z1 is not connected to all three vertices u, v, w. Thus, we now suppose all vertices zi ∈ G which are connected to at least one vertex of Gsub to be connected to all vertices in Gsub. If not all vertices zi are connected with each other, then for the edge from x to one of the not connected vertices zi the branching vector (1, 23 ) holds since xzi is member of three conflict triples, see Figure 4.7b. If a vertex vi ∈ G exists which is not connected to Gsub but connected to at least one vertex zi then edge vi zi is member of |V | conflict triples with all vertices of Gsub, see Figure 4.7c. In this case a branching vector of (1, 32 ) holds at vi zi . Finally we observe that if all vertices zi ∈ G are connected to all vertices in Gsub and connected to each other the instance is simple with respect to Lemma 13 since property (i) holds. With the following Lemma we finally prove Theorem 1. In Lemma 14 we showed that a branching vector with branching number 1.83 does hold or an edge exists in G which is member of a strong conflict triple, second conflict triple and a triangle. We now show, if this structure exists then either there Lemma 12 or Lemma 15 holds somewhere in the graph. Lemma 16. Given an integer-weighted and connected graph G = (V, E). Branching by edge merging has a branching vector with branching number at most 1.83. Proof. From Lemma 14 we know that our claimed branching is possible or an edge vw ∈ E exists which is member of two conflict triples with vertices u, x ∈ V and a triangle with vertex y ∈ V . If both conflict triples are strong then condition (i) of Lemma 12 holds and our Lemma is correct. Otherwise the second conflict triple is weak. We assume without loss of generality that vwu is the weak conflict triple and therewith s(uw) = 0 and s(vu) > 0. 40

a)

u

b)

u

c)

u

d)

u

z1 v

v

v

y w

y

y

w

x

z1 v

w

x

y w

x

x

Figure 4.8: Illustration of Case 1 of Lemma 16. Present edges are solid lines, absent edges are displayed as dotted lines. (a) Subcase that s(uy) > 0 and s(yx) ≤ 0. (b) Subcase that s(uy), s(yx) > 0 and s(ux) ≤ 0. (c) Subcase that an additional vertex z1 is connected to y but to no other vertex. (d) Subcase that z1 is connected to all vertices except of y. In the following, we prove by exhaustive case distinction. First we distinguish between two cases: Either x being neighbor of w or x being neighbor of v. In each case we analyze the subcases for every possible weight of the edges ux, uy and xy. Case 1: Let x be neighbor of w and s(vx) < 0. If s(uy) ≤ 0, condition (iv) of Lemma 12 holds for u and the claimed branching vector holds for vw. Otherwise if s(uy) > 0 and s(yx) ≤ 0 a branching vector of (1, 23 ) holds at wx, see Figure 4.8a. In the case that s(uy) > 0, s(yx) > 0 and s(ux) ≤ 0 condition (iii) of Lemma 12 holds, see Figure 4.8b. Otherwise if uy, yx and ux are present we can use a similar argumentation as in Lemma 15: Assume z1 is an additional vertex connected to u, v, w, x or y. If z1 does not exist G is simple with respect to Lemma 13. If z1 is connected to u, v, w or x and not to all of them, then a branching vector of (1, 23 ) holds at an edge since vertices u, v, w, x form a subgraph Gsub with property (iii) from Lemma 15. If u, v, w, x are not connected to z1 but y is, Lemma 15 holds for {u, v, w, x, y, z1}, see Figure 4.8c. So lets assume all additional vertices zi to be connected to u, v, w, x. Clearly the instance stays simple with respect to Lemma 13 if for all additional vertices zi holds s(zi y) ≥ 0 and s(zi zj ) ≥ 0. Suppose s(yzi) < 0 or s(zi zj ) < 0, respectively: Condition (i) of Lemma 12 holds since vzi is member of a triangle with u and member of two conflict triples with x and y or zj respectively, see for instance Figure 4.8d. In the remaining case, there exists a vertex a which is not connected to u, v, w, x, y but to a vertex zi . It is easy to see that edge azj is member of four conflict triples with u, v, w, x and therewith the a branching vector of (1, 2) holds at azi . Case 2: Let x be neighbor of v and s(wx) < 0. If s(uy) ≤ 0 condition (iv) of Lemma 12 holds and the claimed branching vector holds for vw. Otherwise s(uy) > 0, see Figure 4.9a. If one of the weights s(ux), s(yx) is less than 41

a)

u

b)

u

c)

u

d)

u

z1 v

v

v

y w

y

y

w

x

z1 v

w

x

y w

x

x

Figure 4.9: Illustration of Case 2 of Lemma 16. Present edges are solid lines, absent edges are displayed as dotted lines. (a) The subcase that s(uy) > 0. (b) An example subcase for that Case 1 or Lemma 15 holds, respectively. (c) The subcase that s(uy), s(yx), s(ux) > 0 and z1 is connected to v, w, y but not to u and x. (d) The subcase that z1 is completely connected to the other vertices. zero and the other greater than zero, Case 1 holds for vx. If one of these edge weights is equal to zero and the other one greater than zero Lemma 15 holds for Gsub [{u, v, w, x, y}], see Figure 4.9b for an example. If s(ux), s(yx) ≤ 0 branching vector (1, 23 ) holds at vx since vx is member of three conflict triples with w, u, y. If s(ux), s(yx) > 0 we can use a similar argumentation as in Lemma 15: Lets assume z1 is an additional vertex connected to u, v, w, x or y. If z1 does not exist the instance is simple with respect to Lemma 13.. If z1 is connected to v, w or y but not all of them a branching vector of (1, 23 ) holds at vw, vy or wy, since property (iii) from Lemma 15 holds for {v, w, y}, see Figure 4.9c for an example. First lets assume all edges from z1 to v, w, y to be absent. If s(uz1 ) > 0 branching vector (1, 23 ) holds at uz1 since it is member of two conflict triples and wu is a zero weight edge. If s(uz1 ) ≤ 0 and s(xz1 ) > 0 edge xz1 is member of three conflict triples and again branching vector (1, 32 ) holds. Second, lets assume all edges from z1 to v, w, y to be present. If s(uz1 ) ≥ 0 and s(xz1 ) ≥ 0 this subgraph is trivial with respect to Lemma 13. Otherwise if either s(uz1 ) < 0 or s(xz1 ) < 0 Lemma 12 holds at either at uy or yx, respectively. So lets assume all additional vertices zi to be connected to v, w, y and s(uz1 ) ≥ 0, s(xz1 ) ≥ 0 (see Figure 4.8d). Clearly the instance stays simple with respect to Lemma 13 if for all additional vertices zi holds that s(zi zj ) ≥ 0. If s(zi zj ) < 0 then wz1 is member of three conflict triples and branching vector (1, 23 ) holds at wz1 . In the remaining case, there exists a vertex a which is not connected to u, v, w, x, y but to a vertex zi . It is easy to see that edge azi is member of three conflict triples with u, v, w and therewith the a branching vector of (1, 23 ) holds at azi . We Finlay observe that through Lemma 16 we can conclude the correctness of Theorem 1. 42

4.4

Real-valued Edge Weights

In this section we discuss how our approach can be modified such that it can handle real-valued edge weights. Both branching strategies can be adapted such that they work for real-valued edge weights. Let us redo the analyis for our second simple branching strategy from Section 4.2, but this time for weighted graphs: We assume all edges to have an initial weight of greater or equal to one. Let uv be an edge in a conflict triple uvw. In case we set uv to “permanent” and merge u, v, we use our bookkeeping trick to put asside 12 for the so created edge and infer deletion costs of at least 21 . In case we join an edge with weight less than one, we use the remaining 21 to decrease k. In the other case, we delete an edge: As mentioned above we can only assume deletion costs of 12 since every edge is allowed to have weight less than 1. Finally, we can conclude branching vector (1, 12 ) with corresponding branching number 2.62, which is slightly worse than for integer-weighted edges but should be seen as a worst-case scenario. In the same manner we can adapt our refined analysis from Section 4.3.2 to real-valued edges. The analysis is basically the same, accept from the fact that we can only assume deletion costs 21 instead of 1 and that we only refine the second branch where the edge is set to “permanent”. From this we can follow branching vector (0.5, 2, 2) with corresponding branching number 2.39. Note, our branching strategy is not sufficient for paths, cylces and cliques minus one edge in case of real-valued edge weights. However, these special graphs can be solved in polynomial time. We conclude that our algorithm which includes the edge merging branching strategy has worst-case running time O(2.39k + |V |3 log |V |), and this is currently the fastest known strategy for the Weighted Cluster Editing problem.

4.5

Branch and Bound

In this section we introduce fast methods to compute a lower bound b(G) on the costs of an instance G. The use of lower bounds has three big advantages: branches in the search tree can be cut off, since b(G) > k implies that there is no solution for this graph and parameter k. Second, in case the graph separates in two or more connected components we can bound every search tree using the sum of bounds for all connected components. And third, we can use global lower bounds to improve lower bounds that are used by reduction rules. Let m denote the number of conflict triples in our instance (G, k). All these m conflict triples need to be resolved to make G a cluster graph. Let m(uv) denote the number of conflict triples in G that contain vertex tuple uv. Suppose F is a set P of tuples that need to be modified to make G a cluster graph. We infer that uv∈F m(uv) ≥ m must hold. Let further r(uv) := |s(uv)|/m(uv) denote the costs to resolve one conflict triple at uv. To resolve m conflicts in our 43

graph we have to pay at least b1 (G) := m minuv {r(uv)}. A more involved lower bound is to sort all tuples uv according to the ratio r(uv), then go through this sorted set F ′ as long P list from smallest to largest ratio and add tuples uv to aP as uv∈F ′ m(uv) ≤ m holds. The sum of edge weights b2 (G) := uv∈F ′ |s(uv)| is a tighter lower bound but obviously requires more computation time. The minimum cost for resolving a conflict triple vuw is clearly: min{s(uv), s(uw), −s(vw)}. Let CT be a set of conflict triples such that no two conflict triples in CT share an edge. Since resolving one conflict Ptriple of CT does not resolve another conflict triple, we can infer b3 (G) = vuw∈CT min{s(uv), s(uw), −s(vw)} is a lower bound for G. Since finding the set CT with maximum lower bound value is computational expensive, we use a greedy approach to construct the set of non-overlapping conflict triples CT . We can make use of the lower bound above to improve the lower bound icf (uv) and icp(uv) of Reduction Rule 3.2.1. Given a set of non-overlapping conflict triples CT6u 6v such that no conflict triple wxy ∈ CT6u6v contains vertex u or v. Clearly X min{|s(wx)|, |s(wy)|, |s(xy)|} b6u6v (G) = wxy∈CT6u 6v

is lower bound for solving all other conflict triples involving not u and v. Thus, icp(uv) + b6u6v and icf (uv) + b6u6v is a lower bound for the costs induced when uv is set to “permanent” or “forbidden” respectivly. A greedy strategy to construct CT6u 6v has running time O(n3). Since this reduction rule is important for the initial reduction and the interleaving process this improvement leads to better running times in practice. A third lower bound uses the approach from above. We introduce b4 (G) = max {ipc(uv) + b6u6v (G)} u,v∈V

as a new lower bound for G. Since the conflict triples in CT6u 6v and all conflict triples containing uv are non-overlapping this lower bound is correct. Again, we use a greedy approach to choose uv and to construct CT6u 6v . It is worth to mention that lower bounds b3 (G) and b4 (G) can be computed simultaneously and combined to a new lower bound b5 (G) = max{b3 (G), b4 (G)}. Another approach for a lower bound is the relaxation of an integer linear program (ILP) for cluster editing to a linear program (LP). When recursing we can update the LP problem, comparable to branch-and-bound methods for ILP. Finally, we can use upper bounds from approximation algorithms to compute lower bounds, dividing the upper bound by the approximation factor.

44

Chapter 5 Implementation We implemented the results from the previous chapters in two software tools. At first we implemented a cluster reduction tool, which takes a graph instance as input and returns a reduced graph as result. This tool uses most of the reduction rules presented in Section 3. In detail, the following reduction rules are implemented: • Rule 3.1.1 : Delete all disjoint cliques • Rule 3.1.2 : Merging Rule • Rule 3.1.3 : Light Edge Rule • Rule 3.1.6 : Heavy Edge Rule • Rule 3.1.4 : Almost Clique Rule • Rule 3.1.7 : Critical Clique Rule As second tool we implemented a clustering tool named PEACE (Parameterized and Exact Algorithms for Cluster Editing) which implements the branching strategy introduced in Section 4.3, the reduction rules listed above and the first two lower bounds presented in Section 4.5. PEACE takes a graph instance as input and returns an optimal clustering of the graph as well as minimum modification costs. We implemented both tools in C++. We mention here that PEACE includes an ILP appraoch for Weighted Cluster Editing. However, since this is not part of this thesis we omit any details here. In this chapter we give a brief overview about the implementation of the cluster reduction tool and PEACE. First we give a short description of our software and provide a short user manual. We also introduce the class hierarchy of our novel clustering tool PEACE, and give some implementation details about important reduction rules. 45

5.1

User Manual

5.1.1

Graph File Format

A graph is represented in straight forward file format. The first line contains the number of vertices |V | in the graph. The following |V | lines contain the names of the |V | vertices. After the graph setup lines the triangle shaped adjacency matrix with modification costs follows. The first line of the adjacency matrix contains the modification costs of edges from the first vertex, whose name is indicated in line two, to all other vertices in the order of their appearance in the file. The second line of the matrix contains all modification costs of edges from the second vertex to all other vertices but the first vertex, and so on. Note that negative values indicate that an edge is not present in the graph and the absolute value equals the modification costs. Example: 3 Lung Liver Heart 13.23 -4.58 3.16 This example graph consists of three vertices named Lung, Liver and Heart. Lung and Liver are connected by an edge with weight 13.23, Liver and Heart by an edge with weight 3.16. In contrast, Lung and Heart are not connected by an edge and inserting this missing edge would imply costs of 4.58.

5.1.2

Result File Format

First note that result files are only written by PEACE. To reproduce a result the first line contains the program call. For each input connected component the file contains the number of the input component, the name of all vertices, the set of modifications, the actual clusters and the overall modification costs to obtain this result. To each vertex name a number is assigned to provide a more compact form of the modification set, in which edge modifications between vertices are displayed using these assigned numbers of the vertices. The cluster result itself contains again the actual names of the vertices. Example: ./weightedclusterediting tissue.cm tissue.cm_output -X 2 Component 0 Vertices: 0=Lung, 1=Liver, 2=Heart 46

Modifications: del(1,2) Component 1: Lung Liver Component 2: Heart Costs: 3.16 This example output is produced from the input file from the previous section. If more than one input component exists a similar block is written for every input component. Our example consists of just one input component numbered with 0. In this case the optimal solution is to delete the edge between vertices Liver and Heart. Hence, input component 0 is divided into two cliques: Lung and Liver form one clique, whereas Heart is a singleton.

5.1.3

Cluster Reduction Tool

As already mentioned, the cluster reduction tool takes a graph as input and returns a reduced graph as output. The executable file of the cluster reduction tool is “clusterreduction”. The user has to call the program as follows: ./clusterreduction input-file output-file The indicated input file has to have the graph file format introduced in 5.1.1. For every connected component an output file is written in the graph file format. The components are numbered and end with “.cm”. Note that a vertex produced by merging is named by the concatenated names of the merged vertices separated by a vertical slash. The user gets further information on the standard output: amount of time, the number of connected components in the resulting file and the number of vertices in the reduced graph. The running time of this tool is quite moderate and is about a minute for a weighted graph of size 100.

5.1.4

PEACE

The clustering tool PEACE takes a graph as input and returns a result file (see Section 5.1.2). It implements the fixed-parameter algorithm for Weighted Cluster Editing presented in Chapter 4. It starts with an initial parameter k, which is obtained automatically from the input graph. Iteratively the parameter is increased iteratively until a solution is obtained. The executable file is “weightedclusterediting”. The user has to call the program as follows: ./weightedclusterediting [options] input(file) output-file The input can be in different formats depending on the input mode, which can be changed by option “–mode” or “-X”. Here we we just mention that using option “-X 2” PEACE reads a graph file with the format presented in Section 5.1.1. PEACE supports more input formats, for example a file of edges. For more details concerning the input and other options see Table A.1 in the appendix. 47

The output file is written in the result file format that was presented in Section 5.1.2. The user gets further information about the current state of the program on the standard output.

5.2

Class Hierarchy

We implemented both tools in the object-oriented programming language C++. In this section we introduce the class hierarchy of both tools. Note that both tools share most of the classes since the actual program core consists of the graph and the reduction rules. We first introduce the core classes of both tools: The main class of our software is WeightedProblemInstance. Its task is to contain and to reduce a Weighted Cluster Editing problem instance and therefore it contains a CostsGraph object, the value of the current parameter and objects which implement reduction rules. The CostsGraph class manages a graph and is able to merge vertices and to search for connected components in the graph. The class MergingReduction implements Reduction Rule 3.1.2 and is responsible for calculating the new modification costs in the graph after a merge operation. All following classes implement reduction rules and return a list of edges which should be either “permanent” or “forbidden” concerning the reduction rule. An EdgeReduction object saves the current icf and icp costs introduced in Reduction Rule 3.2.1 and updates them after edge modifications. The Almost Clique Rule (3.1.4) is implemented in the class AlmostClique and the Critical Clique Rule (3.1.7) is implemented in the class CriticalClique. Beside the reduction rule classes the WeightedProblemInstance class itself implements the Light Edge Rule (3.1.3), Heavy Edge Rule (3.1.6) and Reduction Rule 3.1.1. In a reduction cycle the WeightedProblemInstance object calls some or all of its reduction rules and consequently sets some edges to “forbidden” and “permanent”. If an edge is set to a certain value the icp and icf costs in the EdgeReduction object are updated and, where necessary, vertices are merged in the CostsGraph object. We now give more details about the WeightedProblemInstance class. It provides the following features: • It is able to split itself and all containing objects depending on the connected components of the underlying graph. • It provides three different functions to start the instance reduction, namely maxReduce, strongReduce and reduce. • It provides a function which returns a global lower bound for this instance. • It provides a function which returns the edge with minimum branching number. 48

The three different reduction functions differ in their reduction power. Function maxReduce reduces an instance until no reduction with any reduction rule is possible anymore and is used in the initial reduction. In addition, function maxReduce makes use of a lower bound to improve the icp and icf costs of Reduction Rule 3.2.1 as introduced in Section 4.5 which is implemented in class EdgeReduction. In contrast, function reduce uses only parameter-dependent reduction rules and is used during traversing the search tree. Function strongReduce invokes a medium reduction and is used every now and then during traversing the search tree. All classes from above except class EdgeReduction are used in the cluster reduction tool. PEACE uses all classes, since it also uses the parameterdependent reduction rule 3.2.1. As second part of PEACE we now introduce the actual search tree algorithm to obtain the optimal solution: The SearchTreeWeighted class contains the actual search tree algorithm. A main function calls itself recursively for every node in the search tree. Every instance of this function contains its correspondent WeightedProblemInstance object. If the graph splits into two or more connected components, the SearchTreeWeighted object may create new SearchTreeWeighted objects for every component and solves the components independently. Vital to branch-and-bound is the ability of a WeightedProblemInstance object to return a lower bound for the current instance. The edge at which the algorithm branches is returned by the WeightedProblemInstance object in every node in the search tree. To get more insight to the class hierarchy see Figure B.1 or have a closer look at the commented source code.

5.3

Implementation of Reduction Rules

We now give a brief insight into the implementation details of some reduction rules. In the implementation of Reduction Rule 3.2.1 in class EdgeReduction it is necessary to return efficiently edges with a high icp and icf value. One could assume that keeping edges in order dependent on their icp or icf value using a priority queue achieves the best running time since updating takes O(log n) and picking the maximum element takes constant time. However, in practice it is not useful to keep the edges in order. Instead we search for the maximum in an unordered array in O(n) time and update in O(1) time. Note that in our case icf and icp are saved in a matrix, not in an array. In the same manner we search for the edge with minimum branching number in class BNManger. Furthermore, it is necessary to mention that the Almost Clique Reduction Rule 3.1.4, implemented in class AlmostClique, is not accurately implemented as introduced in Section 3. Due to the computationally, exhaustive search for any subgraph G[C] in G it turned out that using a greedy strategy to find such “almost cliques” is pretty effective. The greedy search starts at the vertex 49

with the greatest P connectivity value, where the connectivity value for vertex u is defined as v∈V \{u} s(uv). Then the vertex with greatest sum of weights to vertices in the already created vertex set C is added iteratively to C. The iteration is aborted if the sum of weights is beneath a certain border. For every created vertex set C we try to apply the Almost Clique Rule. For the set of remaining vertices V \ C we start again our greedy search until each vertex is at least once part of a highly connected vertex set C. In the following, we give some more details about the implementation of the Critical Clique Reduction Rule 3.1.7 from Chapter 3. In detail, we present a dynamic programming approach to efficiently find the maximum of (3.1) over all subsets Cu , Cv . We remind the reader of equation (3.1):  s(uv) ≥ max min s(v, Cv ) − s(v, Cu ) + ∆v , s(u, Cu) − s(u, Cv ) + ∆w Cu ,Cv

If s(uv) exceeds the maximum of (3.1), we can set uv to permanent using Reduction Rule 3.1.7. Recall that we separated V into W, Nu , Nv where Nu and Nv denote the exclusive neighbors of u or v respectively and W := V − (Nu ∪ Nv ∪{u, v}). Furthermore, ∆u and ∆v are defined as ∆u := s(u, Nu ) − s(u, Nv ) and ∆v := s(v, Nv ) − s(v, Nu ) and can be calculated in linear time. To solve this equation we have find the maximum of (3.1) for every possible partition of vertices Cu , Cv ⊆ W with Cu ∩ Cv = ∅. To avoid the enumeration of all possible partitions, B¨ocker et al. [6] formalized this problem as follows: We are given a set B of tuples (x, y) of integers. For our reduction rule, we set B := (s(u, w), s(v, w)) : w ∈ W . P We define P P P x B := (x,y)∈B x as the sum of all x-values of set B and y B := (x,y)∈B y as the sum of all y-values of set B. Since every tuple in B corresponds to the a vertex w ∈ W we have to assign every tuple in B to one of the three buckets B0 , B1 , B2 such that nP o P P P min B − B , B − B (5.1) x 1 x 2 y 2 y 1 is maximized. By assigning a tuple (x, y), that corresponds to a vertex w to one of the three buckets, we decide whether w is member of Cu , Cv or neither of them. It is easy to check that maximizing equation (5.1) and adding ∆u and ∆v results in the same maximum as equation (3.1) from Chapter 3. The tuples in B0 are ignored, so clearly a lower bound for the maximum is zero. As already mentioned in Chapter 3 a trivial approach requires 3|B| running time which is obviously undesirable. In the following, we use this formulation of the Critical Clique Reduction Rule. We use a dynamic programming approach to find the maximum of (5.1). To define upper bounds for thePsize of our Boolean P dynamic programming matrices, therefore we set X := (x,y)∈B |x| and Y := (x,y)∈B |y|. Hence the our dynamic programming matrices are Dj [−X . . . X, −Y . . . Y ]. Let B = 50

{(x1 , y1), . . . , (xk , yk )} be the set of tuples. We set Dj [x, y] toP ‘true’ if P there exists a partition B P P0 , B1 , B2 of {(x1 , y1 ), . . . , (xj , yj )} such that x B1 − x B2 = x and y B2 − y B1 = y. Clearly, initially D0 [x, y] is ‘true’ for (x, y) = (0, 0) and false for any other tuple. Now, we can assign an element (xj , yj ) to one of the three buckets: Dj [x, y] = Dj−1 [x, y] or Dj−1 [x + xj , y − yj ] or Dj−1 [x − xj , y + yj ].

(5.2)

Clearly, D|W | can be computed in time O(|W |XY ) and space O(XY ). Thus, the maximum of equation (5.1) equals  max min{x, y} . (5.3) D|W | [x,y]=‘true’

To find a solution for equation (3.1) we have to take the exclusive neighbors of u and v and therewith ∆u and ∆v into account. We compute D|W | as above. If uv satisfies  s(uv) ≥ max min{x + ∆u , y + ∆v } (5.4) D|W | [x,y]=‘true’

then uv must be part of the optimal solution, so we can set uv to “permanent” and merge uv using Reduction Rule 3.1.2. Clearly, if s(uv) ≤ min{∆u , ∆v } then uv cannot satisfy the above equation. Consequently, we can skip this edge and proceed with the next one. On the other hand, in case uv satisfies  1 X |s(uw) − s(vw)| + ∆u + ∆v s(uv) ≥ w∈W 2 then we set u to “permanent” without computing the dynamic programming table. Unfortunately, the quadratic running time of the above dynamic programming is to slow for real-valued applications. In particular, if we take real valued edge weights into account where we multiplicate every edge with a high value to achieve higher precision. However, a closer analysis reveals that we can reduce running time and space to linear running time and space, as follows: We define Mj [x] to be the maximal index y such that Dj [x, y] is ‘true’. We initialize M0 [0] = 0 and M0 [x] = −∞ for all x 6= 0. We use the recurrence  Mj [x] = max Mj−1 [x], Mj−1 [x + xj ] − yj , Mj−1 [x − xj ] + yj and compute the maximum as maxx min{x, M|W | [x]}. This maximum equals the maximum in (5.3), since an entry Dj [x, y] = ‘true’ dominates an entry Dj [x, y ′ ] = ‘true’ for y > y ′ : No other value than the maximum entry is used for the computation of (5.3). Hence, it is sufficient to save always just the maximum index y of one column x in Dj where Dj [x, y] is true. We conclude running time O(|W |X) and space O(X) for our linear approach.

Concerning parameter-independent data reduction for real-valued graphs, we have to adopt our critical clique Reduction Rule 3.1.7 to real-valued weights 51

since the dynamic programming approach cannot be applied directly to realvalued edge weights. We can multiply weights in an real-value weighted graph by an arbitrary constant c ∈ R without changing the optimal solution of the problem. Clearly in practice we use a moderate c to prevent long running times. The constant c allows us to influence running time and space of our dynamic programming solution, but also the precision of our method. Let B = {(x1 , y1), . . . , (xk , yk )} be the set of real-valued tuples. For x ∈ R we define x by x = ⌈x⌉ for x ≥ 0, and x = ⌊x⌋ for x < 0. Analogously we define x by x = ⌊x⌋ for x ≥ 0, and x = ⌈x⌉ for x log y for x, y > 0. So, a branching vector has minimum branching number, if the logarithm of the branching number is minimum. 52

We proceed as follows: we pre-compute branching numbers for branching vectors of the form (1, b) for b ≥ 1 and fit a cubic function to the pre-computed values. In the following, we assume that we can rapidly access these values using the cubic function lbn 1 (x) : [1, ∞) → R, where lbn 1 (x) is the logarithm of the branching vector (1, x) or simply the log braning number. We can now compute the log branching number of an arbitrary branching vector (a, b) with a ≤ b as a1 · lbn 1 (b/a). In detail, we used gnuplot [50] to fit two functions for different intervals to a set of log branching numbers with two decimal places. We received the following functions: f1 (x) =

0.0157118x3 + 2.69316x2 + 15.9154x + 9.12524 x3 + 12.911x2 + 21.9904x + 4.13278

f2 (x) =

0.00435418x3 + 3.67704x2 + 74.3963x + 5.25471 x ∈ [10, 40] x3 + 41.0281x2 + 120.572x − 87.1522

x ∈ [1, 10]

A validation set of log branching numbers with three decimal places was used to calculate the root mean square error which was 8.21387·10−7 for function f1 and 1.22469·10−7 for f2 . We further assume that f2 is sufficient to approximate log branching numbers greater than 40. Consequently we define:

lbn 1 =



f1 (x), x ≤ 10 f2 (x), else

Both functions can be hard-coded in the source code of the algorithm to derive a branching number for a binary branching vector in constant time. In our implementation the class BNManager returns the edge with minimum branching number and uses the icp costs, calculated in class EdgeReduction, as the costs resulting from merging two vertices. The edge with minimum branching number is searched by computing the branching numbers of all edges and simply search the minimum.

53

54

Chapter 6 Evaluation In this chapter we present some evaluation results. In detail, we show that our new branching strategy compared to the trivial O(3k ) branching strategy results in a dramatically improvement of practical running times. In addition, we show that our set of reduction rules presented in Section 3 is very powerful on biological datasets. All running times of the following sections were measured on an AMD Opteron-275 2.2 GHz with 3 GB of memory running operation system SunOS 5.1. Note that all statistical analysis and charts are produced with R [25], whereas graphs are visualized with yEd Graph-Editor TM [52, 49].

6.1

Comparison Against a Previous Version

We compared our method against the parameterized algorithm presented in [39] which implements the simple O(3k ) branching strategy. This previous version of our software does not implement any parameter-independent reduction rule, except the Merging Reduction Rule 3.1.2. We omitted a comparison against the refined branching strategy [5], since this approach is slower in practice than the O(3k ) branching strategy as introduced in B¨ocker et al. [5]. To compare both algorithms we used two datasets which were introduced by Rahmann et al. [39]. The first dataset consists of artificially created graphs, whereas the second dataset consists of graphs derived from the COG database [47]. In the following, we introduce these both datasets. The artificial graphs of size n are obtained by iteratively grouping k objects in a cluster until no elements are left, where k is selected randomly. The similarity values of objects within a cluster are drawn from the Gaussian distribution Nµ=21,σ=20 , whereas similarity values of objects between clusters are drawn from Nµ=−21,σ=20 . This leads to a “noisy” graph such that the probability of seeing an undesired or missing edge is about 0.147 per vertex pair. We used a sample set with graphs of size 10 to 100 and tested both algorithms on these datasets. The results are displayed in Table 6.1. An illustration of the results is shown in Figure 6.1. We see that the simple O(3k ) approach 55

easily solves instances up to size 50. For instances of size 60 and 70 the simple strategy needs up to a day to solve them. The O(3k ) algorithm is absolutely not applicable for instances of greater size. In contrast to the O(3k ) algorithm, the new clustering tool PEACE which implements the O(1.83k ) strategy is capable of solving instances up to size 100 easily in at most some minutes. |V | |E| #inst 10 11-30 10 20 65-165 10 30 138-296 10 40 251-533 10 10 50 402-821 60 515-1252 5 70 694-1911 5 80 1141-2094 5 90 1248-2969 5 100 1711-3157 5

#edit 8.3 28.1 66.7 115.5 183.2 263 351.6 459 594 728.6

avg.costs 95.8 301.9 671.2 1238.3 1860.0 2742.3 3608.5 4719.4 6106.6 7494.4

avg.edge 24.0 24.4 23.8 24.3 23.9 24.0 24.1 24.2 23.8 24.0

time 3k 10 ms 54 ms 1s 29.2 s 7.6 min 27 h 58 h∗ 19 days∗ ∗ ∗

time 1.83k 3 ms 14 ms 163 ms 1.2 s 1.6 s 32 s 43 s 23 s 166 s 36 s

Table 6.1: Comparison of average running times, O(1.83k ) edge branching and O(3k ) branching strategy from [5]. Running times for artificial data, 10 instances per bucket for sizes 10–50, 5 instances for sizes 60–100. ‘# inst’ is the number instances in this size range, ‘# edit’ is the average number of edit operations, ‘avg. costs’ are the average total costs, ‘avg. edge’ is the average cost per edge For size 70 (80, 90, 100) one (four, all five, all five) instance(s) did not stop after 20 days of computation using the O(3k ) strategy (∗ ). For average running times, we ignored these unfinished instances. In a second evaluation we used the protein sequences of the 66 organisms from the COG database [47] from http://www.ncbi.nlm.nih.gov/COG. Rahmann et al. [39] obtained this dataset using the bi-directional BLAST hits of all against all. The sum of the negative logarithms to base 10 of the highest-scoring BLAST hits of u against v is taken to obtain the similarity score s(u → v). The actual symmetric similarity score is obtained by s(uv) := min{s(u → v), s(v → u)} − T , where T denotes a threshold. For this dataset Rahmann et al. used a threshold of T = 10, corresponding to an E-value of 10−10 . The resulting similarity matrix corresponds to a graph of 42563 (trivially transitive) connected components of size 1 and 2, and 8037 larger components, of which 3964 are not transitive; this is the input for our algorithms. The results are displayed in Table 6.2. An illustration of the results is shown in Figure 6.2. We see that the simple O(3k ) approach easily solves instances up to size 50. For instances of size greater than 50 the O(3k ) algorithm reveals its first weakness. Some instances of size greater 80 could not be solved. Thus, the O(3k ) algorithm is not applicable for relatively large instances. In contrast 56

1e+06 1e+04 1e+02 1e+00 1e−02

running time in sec (log scaled)

3^k 1.83^k

20

40

60

80

100

graph size

Figure 6.1: Comparison of average running times, O(1.83k ) edge branching and O(3k ) branching strategy from [5]. Running times are obtained from artificial data. to the O(3k ) algorithm, the new clustering tool PEACE which implements the O(1.83k ) strategy is capable of solving instances up to 100 easily in at most some seconds. Concluding the above results it is obvious to see that also in practice the O(1.83k ) strategy is much faster than the simple O(3k ) strategy. In detail, we achieved a speed-up of up to 100000. In the next section we give a reasoning for this tremendous speed-up and show that the speed of PEACE is not only concerned with the branching strategy but also can be argued by the effects of our reduction rules.

6.2

The Power of Reduction Rules

Reduction rules are central in parameterized complexity. Suppose that no reduction rule is known for a problem. Clearly the practical running time nearly equals the running time of a normal non-parameterized algorithm for this problem. In this section we show that the set of reduction rules introduced in Chapter 3, in particular the parameter-independent reduction rules, lead to a tremendous reduction of the input instances. We choose the “Leukemia” dataset from Monti et al. [33] as a good exam57

|V | 3-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99 101-139 140-179 180-217 221-254 260-298

|E| #inst 2-35 1963 9-170 797 332 33-405 62-715 179 143-1172 181 204-1603 128 191-2214 91 353-3078 65 266-3388 31 494-4504 26 452-7693 65 1308-12745 22 2183-14969 20 2160-25594 10 3489-31165 16

#edit avg.costs 2.1 6.0 10.4 31.9 28.4 89.0 58.6 193.0 96.0 312.0 133.1 412.7 184.1 649.6 208.6 736.0 323.4 1340.0 370.2 1386.6 585.5 2167.8 1002.7 4324.8 1535.2 6636.4 2641.1 8845.7 2455.2 12495.8

time 3k 2 ms 12 ms 103 ms 680 ms 91 s 24 min 47 min 6.8 h 3.7 h∗ 49 s∗ -

time 1.83k 1.0 ms 2.0 ms 8.3 ms 27 ms 66 ms 471 ms 186 ms 1.9 s 2.1 s 1.5 s 11.7 s 1.3 h 6 min 25 h∗ 14 min∗

Table 6.2: Comparison of average running times of O(1.83k ) edge branching and O(3k ) branching strategy from [5] for protein similarity data. The head of this table is analog to Table 6.1. Three instances of sizes 84, 91, and 98 did not stop after 13 days of computation using the O(3k ) strategy (∗ ). Two instances of sizes 221 and 275 did not stop after 13 days of computation using the O(1.83k ) strategy (∗ ). We ignored these unfinished instances in our average running time. For computational reasons instances of size greater than 100 have not been computed with the O(3k ) strategy. ple to show the capability of reduction rules. In Section 7.1 we introduce this dataset more detailed. Figure 6.3 displays the “Leukemia” dataset as similarity graph and its reduction using our cluster reduction tool. As introduced in Chapter 5 our cluster reduction tool implements the following reduction rules: Rule 3.1.1 (delete all disjoint cliques), Merging Rule 3.1.2, Light Edge Rule 3.1.3, Heavy Edge Rule 3.1.6, Almost Clique Rule 3.1.4 and Critical Clique Rule 3.1.7. A combination of all these reduction rules leads to the reduced instance as displayed in Figure 6.3. The input graph is reduced from its original size 38 to size 8, where the greatest instance has size 7. In particular, all reduction rules are necessary to reduce an instance so efficiently. Indeed it is the combination of them which leads to this result. Figure 6.4 shows parameter-independent reductions with different combinations of parameter-independent reduction rules. It is worth to mention that the critical clique rule has the greatest reduction power. This rule alone is already capable to reduce the “Leukemia” dataset down to size 8. However, using only this rule is not efficient due to its bad running time. By using a 58

1e+05 1e+03 1e+01 1e−03

1e−01

running time in sec (log scaled)

3^k 1.83^k

10

20

50

100

200

graph size (log scaled)

Figure 6.2: Comparison of average running times, O(1.83k ) edge branching and O(3k ) branching strategy from [5]. Running times are obtained from instances of proteins of the COG database. combination of cheap but not so effective and expensive but more effective reduction rules, one can achieve the best results as displayed in Table 6.3. To get an overall estimation of the power of reduction rules we applied our cluster reduction tool to the COG protein similarity graphs from the dataset from Rahmann et al. [39] introduced in the previous section. The two plots in Figure 6.5 display the result of the reduction using our cluster reduction tool. The average reduction of the protein similarity instances is about 82%. This result is mainly obtained since most of the instances of this dataset are of size less than 80. For larger instances our reduction rules show some weaknesses. Usually we decide whether a reduction rule applies or not by summing over some edit costs. Clearly this sum becomes larger if the graph contains more vertices. Thus, it is less likely that a reduction rule is applied in a large graph. The red line in plot two in Figure 6.5 shows that the reduction ratio gets significantly smaller for large instances. Note that the reduction with our cluster reduction tools requires only seconds or minutes.

59

37 35

33 36

28 27

32

3

16

29

31

1

30

34

37|30|34|29|36|35|33|32|31|27|28 18

0

4

8

5

6

16

17|11|15|7|2|12|9|8|6|13|0|3|1 11 17 14

10

15 7

19

12

18|4 5

9

2

21

26

13

10

14

23 22 25

26|21|23|24|20|22|25|19

20

24

Figure 6.3: Reduction of the “‘Leukemia” dataset from Monti et al [33]. Left: Original instance. Right: Reduced instances using all implemented parameterindependent reduction rules.

Simple x x x

Reduction Rules Almost Clique Critical Clique x x

x x

red. size 33 24 8 8

time 3 ms 13 ms 96 ms 431 ms

Table 6.3: Reductions of the “Leukemia” dataset from Monti et al. [33] with different combinations of our reduction rules. The left part of the table indicates which reduction rules are used where “Simple” indicates a combination of the two simples reduction rules, namely the Heavy Edge Rule and Light Edge Rule. “red. size” is the size of the instance after the reduction and “time” is the running time of our cluster reduction tool. It is easy to see that the Critical Clique Rule is very powerful, since it achieves the same reduction as a combination of all rules together. However, it is more efficient to use a combination of all of them, since the running time is significantly smaller.

60

37 35

33 36

28

35 27

28 32

3

16

29

27

1

30

33

18

4 5

6

37|30|34|29

4

6 11

11

5

17

17 14

10

15 7

36

16

0

8

8

32

31

34

18

0

3

1

31

15

19

12

14

10

12

7|2

9

9 19 2

13

21

26

13

26|21

23

23

22

22

25 25

20

24

24

20

37|30|34|29|36|35|33|32|31|27|28

3

1

16

37|30|34|29|36|35|33|32|31|27|28 0 6

18|4

8

16

17|11|15|7|2|12|9|8|6|13|0|3|1

5 17

14

11 10

19

12

15

18|4 5

7|2

9 26|21 13

10 22

14

23

25 24|20

26|21|23|24|20|22|25|19

Figure 6.4: Various reductions of the “Leukemia” dataset from Monti et al. [33]. The upper left graph shows the original input instance (38 vertices). The upper right graph was reduced using only simple parameter-independent rules such as the Heavy Edge Rule and Light Edge Rule (33 vertices). For the reduced graph in the lower left corner we used the Almost Clique Rule in addition (24 vertices). Finally we added the Critical Clique Rule to achieve the greatest reduction as displayed in the lower right graph (8 vertices). Running times for each reduction are noted in Table 6.3.

61

1e+04 1e+00 1e−04

running time in sec

5

10

20

50

100

200

500

1000

2000

80 100 20

40

60

average

0

reduction ratio in %

graph size

5

10

20

50

100

200

500

1000

2000

graph size

Figure 6.5: Reduction of the COG protein similarity graphs from [39]. The upper plot shows the reduction required just seconds or minutes. The lower plot shows our clustering tool reduced the protein similarity instances by 82% on average. The red line shows that the reduction ratio becomes significantly smaller for larger instances.

62

Chapter 7 Clustering of Biological Data In this chapter we have a closer look at the applicability of WeightedCluster Editing for clustering biological data. We focus on microarray data since this kind of data is vital to molecular biology and genetics. We evaluated our clustering tool PEACE on six microarray datasets which we preprocessed to apply our technique. At first we introduce the six used datasets. We then give more details about necessary preprocessing steps to transform microarray data into a suitable similarity graph. Finally we show in a result section that PEACE is capable of clustering microarray data and often outperforms other well-known clustering methods.

7.1

Used Datasets

We present six datasets from Monti et al. [33]. For all datasets we know the “true” clustering solution. • “Leukemia” [15] contains 38 samples monitored over 99 genes. This dataset contains bone marrow samples from acute leukemia patients and can be divided into three classes. Note that we used this dataset already in the previous chapter. • “Novartis-tissue” [46] contains 103 samples monitored over 1000 genes. This dataset contains tissue samples from four distinct cancer types. • “St. Jude” [53] contains 248 samples monitored over 985 genes. This dataset consists of bone marrow samples from pediatric acute leukemia patients and can be divided in 6 classes. • “Lung cancer” [4] contains 197 samples monitored over 1000 genes. This dataset contains lung cancer samples from 4 known classes, where one class is highly heterogeneous and not well understood. 63

• “CNS tumor” [37] contains 48 samples monitored over 1000 genes. This dataset comprises of samples from embryonal tumors of the central nervous system and can be classified into 5 classes. • “Normal tissue” [40] contains 99 samples monitored over 1277 genes. This dataset consists of 13 distinct tissue types including breast, lung, and pancreas. We used these datasets to evaluate PEACE.

7.2

Preprocessing and Distance Measures

All datasets have been normalized first, which is an essential step in clustering. For instance one can avoid that a clustering method clusters samples by a highly expressed gene which is not significant for these samples. For our application all datasets have been row- and column-normalized such that every row and column sums to 0 and has standard deviation 1. This can be done by alternate normalization of rows and columns until convergence occurs using the following normalization term: xi ←

xi − µx¯ σx¯

Where x¯ = (x1 , x2 , . . . , xn )T denotes the data vector, µx¯ denotes the mean and σx¯ the standard deviation of this vector. To apply distance based clustering methods such as our method a similarity matrix has to be constructed from the raw feature data. As introduced in Section 1.2 there exists a plenitude of distance measures. In the literature usually the dot product or Pearson’s correlation coefficient is recommended. In this thesis we consider both distance measures, but also the Euclidean and Manhattan distance. Many clustering methods use a similarity matrix as input but further need the number of clusters as additional input. Our clustering approach uses the Weighted Cluster Editing problem and therefore needs no further information about the actual solution and therewith is unbiased. Unfortunately the similarity matrix needs to be transformed into a weighted similarity graph first. Therefore, a threshold needs to be defined. The actual edge value is obtained by subtracting the threshold from the similarity value. Positive values result in an edge, whereas negative values indicate that the edge is not present in the graph. Clearly the choice of the threshold is essential for the clustering since it determines the connectivity of the graph and therewith in some way the number of clusters. If one has some meta knowledge the threshold can be estimated manually. However, such knowledge is often not available 64

According to the suggestions of Sharan et al. [44] we use a probabilistic assumption to estimate the threshold. Sharan et al. [44] observed for biological data that pairwise similarity values between elements are normally distributed. Clearly similarity values between non-mates (elements of different clusters) are smaller than between mates (elements of one cluster) on average. Let µT denote the mean and σT2 the variance of similarity values between mates. Analog µF denotes the mean and σF2 the variance of similarity values between non-mates. Furthermore, pmates denotes the probability of two vertices to be mates. These five values are the parameters of the Two-Parameter Gaussian-Mixture model and can be estimated using an Expectation Maximization (EM) algorithm. We use this probabilistic assumption to adapt the similarity values. The final adapted similarity value wij is obtained by calculating the log-odds of being a mate edge or a non-mate edge, using the following equation: wij = log

pmates f (Sij |i,j are mates) (1 − pmates )f (Sij |i,j are non-mates)

where f denotes the corresponding probability density function of being mate or non-mate and Sij denotes the original similarity value. The actual modification cost for an edge equals the adapted weight. Therewith uncertain edges get modification cost close to zero. A sample estimation of the probability function parameters of the “Leukemia” dataset is displayed in Figure 7.1. Using log-odds as similarity values has another advantage for our method. The obtained modification costs for some vertex tuples are pretty high since we use a quotient of probabilities. For instance see Figure 7.1: Suppose a tuple has weight 200. Since the probability of being a non-mate is nearly zero for this tuple, the obtained edge weight is going to be very high. Tuples with very high modification costs are central for some of our reduction rules and often get reduced by them. Thus, we also achieve a speed-up for PEACE by using log-odds as similarity values.

7.3

Results

Monti et al. [33] evaluated three unsupervised clustering methods on this dataset: hierarchical clustering (HC), consensus clustering with hierarchical clustering (CC-HC), and consensus clustering with self-organizing maps (CCSOM). As comparison method they used a naive-Bayes (NB) classifier, which is a supervised learning technique. Note that the naive-Bayes classifier (NB) was trained on the dataset of which it was tested on by leave-one-out crossvalidation and can be seen as the “gold standard” method. In addition, we applied two other state of the art clustering methods to all six datasets, namely CAST [3] (implemented in TIGR MultiExperiment Viewer 4.0, http://www.tm4.org/mev.html) and CLICK [43] (implemented in EXPANDER 3.2, http://acgt.cs.tau.ac.il/expander/). We used CAST with two distance metrics, Euclidean distance and Pearson correlation. The 65

0.004 0.003 0.002 0.000

0.001

density / probability

density function. non−mates density function. mates prob.function non−mates prob.function mates

−400

−200

0

200

400

600

edge values

Figure 7.1: Estimated probability functions from the “Leukemia” dataset are displayed with solid lines. The dashed lines indicate the densities of mate and non-mate edges, which we aimed to estimate. affinity threshold of CAST was optimized in steps of 0.05. CLICK required resulting clusters to contain at least 15 elements. Consequently, it refrained in three cases. As comparison measure we used the adjusted Rand index [24]. It measures the agreement between partitions with potentially different numbers of clusters by measuring the ratio of correctly clustered elements against all elements. Therefore, a confusion matrix is calculated. For more details see the Appendix of [33]. An adjusted Rand index of 1 corresponds to perfect agreement, whereas 0 corresponds to a random partition. The adjusted Rand index has frequently been recommended in the literature for cluster comparison, see for instance [32]. Concerning the different distance measures we observed that using dot products and Pearson correlation as distance measure results always in the same clustering. The Manhattan distance was usually the worst distance measure. However, in contrast to what is reported in the literature, the Euclidean distance achieves good results. In the following, we concentrate on the results obtained with the dot product and the Euclidean distance, since they had the best quality. We observed that PEACE shows very good performance for the four small datasets regarding clustering quality. In three of the four cases it outperforms 66

all other clustering methods. In the fourth case we returned the second best clustering, see Figure 7.2. In two cases our clustering is even of better quality than the “gold standard” naive-Bayes supervised learning. For more details concerning the clustering quality see Table 7.1.

67

Dataset Leukemia Novartis-tissue CNS tumors Normal tissues St. Jude Lung cancer

68

Leukemia Novartis-tissue CNS tumors Normal tissues St. Jude Lung cancer

size 38 103 42 99 248 197

38 103 42 99 248 197

classes 3 4 5 13 6 4+

3 4 5 13 6 4+

NB 1.0 0.946 0.632 0.655 0.971 0.904

HC 0.648 0.830 0.628 0.457 0.949 0.307

PEACE (rank) Eucl. Dot Product 0.722 (4.) 0.919 (1.) 0.958 (1.) 0.934 (2.) 0.604 (3.) 0.611 (2.) 0.680 (1.) 0.622 (2.) 0.853 (4.) n/a 0.097 (7.) 0.140 (6.)

CC-HC 0.648 0.921 0.549 0.457 0.948 0.310

CC-SOM 0.721 0.897 0.429 0.214 0.825 0.233

CAST 0.783 0.800 0.501 0.545 0.848 0.263

CLICK n/a 0.909 n/a n/a 0.916 0.200

Manh. 0.742 (3.) 0.896 (6.) 0.452 (6.) 0.524 (4.) 0.665 (7.) 0.046 (8.)

Table 7.1: Adjusted Rand indices analyzing gene expression data from Monti et al. [33]. Remind that an adjusted Rand index of 1 is optimal, whereas 0 corresponds to a random clustering. The term ‘rank’ denotes the rank of PEACE among all analyzed clustering methods. The best clustering method for each dataset is printed bold. We observe that using dot products as distance measure leads to constant good results. In contrast, the Euclidean distance even outperforms the supervised learning method in two cases which is a remarkable result.

1.0 0.0

0.2

0.4

0.6

0.8

Monti CAST CLICK WCE

Leukemia

Novartis

CNS

Normal

St.Jude

Lung Cancer

Figure 7.2: Barplot of adjusted Rand indices of different clustering methods for the used datasets. “Monti” represents the best achieved adjusted Rand index of HC, CC-HC and CC-SOM. “PEACE” represents the best achieved adjusted Rand index of our clustering tool PEACE. In the following, we have a closer look at the clustering results. For the “Leukemia” dataset we observe that, using dot products as distance measure, acute myelogenous leukemia (AML, samples 27 to 37) and acute lymphoblastic leukemia (ALL, samples 0 to 26) have been classified correctly. The ALL class can be further divided into T and B cell subtypes. Except of sample number 5, which is incorrectly assigned to the ALL-T cluster, all samples have been classified correctly to ALL-T or ALL-B. A closer look at the similarity matrix reveals that sample number 5 has high similarity values to both of these clusters. The corresponding similarity graph is displayed in Figure 6.3. The performance on the “St. Jude” dataset was only of average quality. Here PEACE merges the T-ALL lineage gene cluster and MLL chimeric gene cluster, which are separated in the correct solution, even though they show high similarity to each other. Ignoring this error and merging both clusters in the true solution, PEACE would even reach an adjusted Rand index of 0.912. The “Lung Cancer” dataset was clustered very poorly. The dataset consists of three small clusters plus a large (139 samples) highly heterozygous cluster. Figure C.1 in appendix shows that this cluster is also highly heterozygous in the similarity graph. PEACE reconstructs the small clusters correctly, but splits the large cluster in smaller clusters. It remains to mention that this large cluster is not well understood. The heat plot in Figure 7.3 reveals the 69

complexity of this dataset.

Figure 7.3: Heat plot of the “Lung Cancer” dataset from Monti et al. [33]. It consists of three well defined clusters on the lower left and one highly heterozygous cluster on the upper right. In addition, we observed that the running time for the four small datasets is very low (see Table 7.2). This can be argued with the strong initial data reduction. To obtain a solution for the small datasets PEACE computed only seconds. In contrast to these promising results, the computation for the large datasets take more than a day. Clearly these both datasets unfold the boundaries of our clustering tool PEACE concerning running time and clustering quality. Finally, we can conclude that Weighted Cluster Editingcan be used to cluster biological data. Especially for small datasets, with no heterozygous clusters PEACE shows very good performance concerning the cluster quality while having reasonable running times.

70

Dataset Leukemia Novartis-tissues St. Jude (Eucl.) Lung cancer CNS tumors Normal tissues

size 38 103 248 197 42 90

# edit 100 282 2649 1103 132 228

time 96 ms 4.8 s 88.5 s 14.8 s 312 ms 595 ms

PIDR extend 81 % 25 % 0% 11 % 19 % 51 %

newsize 7 77 248 176 34 44

Clustering time 3k time 1.83k 26 ms 5 ms 7.1 h 17.8 s ∗ 5.0 h ∗



63.6 s 15.3 h

1.5 s 11.0 s

Table 7.2: Running times for gene expression data from Monti et al. [33] using Euclidean distance (“St. Jude”) or dot product (all other) as similarity measure. ‘# edit’ is the number of edit operations, ‘PIDR’ represents parameter-independent data reduction, ’newsize’ is the number of vertices left after parameter-independent data reduction, ‘Clustering’ is the actual time needed to obtain a solution. Instances with (∗) could not be computed within 7 days.

71

72

Chapter 8 Conclusion Finally we conclude this work by summarizing our results. Moreover we discuss open problems.

8.1

Summary

In this thesis we concentrated on the Weighted Cluster Editing problem and its applicability to biological clustering. Beside theoretical results we presented the novel clustering tool PEACE based on new reduction rules and a new branching strategy. We showed in Chapter 3 that merging vertices with Reduction Rule 3.1.2 leads to a smaller problem kernel for Integer Weighted Cluster Editing. We thereby improved our previous result from [5]. Beyond this we presented some new reduction rules which reduce problem instances efficiently as presented in Section 6.2. In Chapter 4 we introduced two new branching strategies for the WeightedCluster Editing problem. We provided an easy proof that branching at an edge with minimal branching number leads to a search tree of size O(2k ) for integer-weighted graphs. Moreover we refined the proof and showed that the same simple branching strategy even leads to a search tree of size O(1.83k ). With this we provided the fastest known algorithm for Weighted Cluster Editingand unweighted Cluster Editing. In addition, we introduced three new lower bounds to achieve a greater speed-up in practice. We implemented our new branching strategies, lower bounds and reduction rules in a novel clustering tool PEACE which solves the Weighted Cluster Editing problem efficiently. In Chapter 5 we provided a user manual and gave an introduction into implementation details. In Chapter 6 we showed that PEACE, which implements the O(1.83k ) branching strategy, achieves a speed-up of up to 100000 compared to a previous version of our tool presented in [5]. Moreover we provided another reasoning for this tremendous speed-up and analyzed the power of our reduction rules. Finally we applied PEACE to six expression datasets from Monti et al. [33] 73

in Chapter 7. We observed that PEACE often outperforms other clustering methods with respect to clustering quality and achieved reasonable running times. We thereby proved the relevance of our clustering tool PEACE especially for small datasets.

8.2

Open problems

Beside algorithm engineering, some things remain to be done to improve PEACE: A more involved parameter-independent reduction rule. All presented reduction rules are highly dependent on the size of the graph. Since for every rule we iterate over a certain vertex set and sum up costs, this sum gets larger for larger graphs. A promising approach would be an adaption of the critical clique concept [18], which would enable our software to merge highly connected subgraphs. Another idea would be the counterpart of our Critical Clique Rule 3.1.7. The Critical Clique rule merges vertices with almost identical neighborhoods and thereby merges vertices within clusters. A new rule could detect vertices with very different neighborhoods and thereby delete edges between clusters. More refined analysis of branching by edge merging. We believe that the here introduced branching strategy, branching by edge merging, leads to an even smaller search tree size than O(1.83k ). This can be argued with the enormous speed-up in practice. An even more refined proof is necessary to prove this assumption. Using LP as lower bound. As mentioned in Section 4.5 we need to implement a linear program approach of Weighted Cluster Editing to calculate a lower bound while traversing the search tree. We believe that this results in a tighter lower bound and could improve the recursion and the efficiency of some reduction rules. Web Interface. To provide PEACE to a broader range of users a web interface would be highly desirable. Such an interface could have an expression data matrix as input, and could automatically create a job for a cluster architecture. The clustering result is then returned to the user by email. Computational expensive jobs will be deleted after a certain time of computation. Application to unweighted Cluster Editing. It would be interesting to observe the performance of PEACE applied to unweighted Cluster Editing instances concerning both, the running time and the reduction power. To do so, one can apply the O(4k) kernelization pre74

sented by Guo [18] in advance. Since we merge vertices the resulting graph is weighted and can be solved by our tool. We believe that we can solve problem instances also with large parameter k, say greater than 500. We will further investigate in this direction. More robust preprocessing. First, one has to think about how to deal with missing data in the microarray matrix. Second, and much more important, the edge value estimation using an EM algorithm introduced in Section 7 has to be modified. We assume that vertex tuples within clusters are more similar than between clusters, which is correct to some extend. However, some clusters reveal a smaller similarity than other clusters and similarity scores between clusters differ from each other. Clearly the similarity scores between two clusters and within a cluster are normally distributed. This often results in a slightly sloppy distribution curve. Thus, one has to consider other distributions than just the normal distribution to estimate the similarity values between mates and non-mates.

8.3

Acknowledgment

I would like to thank my supervisor Professor Dr. Sebastian B¨ocker, as well as Anke Truß and Quang Bao Anh Bui for their support and helpful comments. Thanks also to Svenja Simon and Thilo Muth for supportive work. Moreover I would like to thank my friends for staying patient while I was annoying them with ideas and current progress reports.

75

76

Bibliography [1] N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking and clustering. Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 684–693, 2005. [2] N. Bansal, A. Blum, and S. Chawla. Correlation Clustering. Machine Learning, 56(1):89–113, 2004. [3] A. Ben-Dor, R. Shamir, Z. Yakhini, et al. Clustering Gene Expression Patterns. Journal of Computational Biology, 6(3/4):281–297, 1999. [4] A. Bhattacharjee, W. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences, 98(24):13790, 2001. [5] S. B¨ocker, S. Briesemeister, Q. B. A. Bui, and A. Truß. A fixed-parameter approach for Weighted Cluster Editing. To appear in Proc. of Asia-Pacific Bioinformatics Conference (APBC 2008), Kyoto, Japan. [6] S. B¨ocker, S. Briesemeister, Q. B. A. Bui, and A. Truß. PEACE: Parameterized algorithms for Cluster Editing. Submitted to International Workshop of Exact and Parameterized Computation (IWPEC 2008). [7] L. Cai. Fixed-parameter tractability of graph modification problems for hereditary properties. Inf. Process. Lett., 58(4):171–176, 1996. [8] M. Charikar, V. Guruswami, and A. Wirth. Clustering with qualitative information. Journal of Computer and System Sciences, 71(3):360–383, 2005. [9] Z.-Z. Chen, T. Jiang, and G. Lin. Computing phylogenetic roots with bounded degrees and errors. SIAM Journal on Computing, 32(4):864– 879, 2003. [10] P. Damaschke. On the fixed-parameter enumerability of Cluster Editing. Proc. of 31st WG, 3787:283–294, 2005. 77

[11] R. Downey and M. Fellows. Parameterized Complexity. Springer New York, 1999. [12] P. Elias, A. Feinstein, and C. Shannon. A note on the maximum flow through a network. Information Theory, IEEE Transactions on, 2(4):117– 119, 1956. [13] M. R. Fellows. The lost continent of polynomial time: Preprocessing and kernelization. In IWPEC, pages 276–277, 2006. [14] M. R. Fellows, M. A. Langston, F. Rosamond, and P. Shaw. Polynomialtime linear kernelization for Cluster Editing. Manuscript, 2006. [15] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286(5439):531, 1999. [16] J. Gramm, J. Guo, F. H¨ uffner, and R. Niedermeier. Automated Generation of Search Tree Algorithms for Hard Graph Modification Problems. Algorithmica, 39(4):321–347, 2004. [17] J. Gramm, J. Guo, F. Huffner, and R. Niedermeier. Graph-modeled data clustering: Fixed-parameter algorithms for clique generation. Proc. 5th CIAC, 2653:108–119, 2005. [18] J. Guo. A more effective linear kernelization for Cluster Editing. Lecture Notes in Computer Science (LNCS), 4614:36–47, 2007. [19] L. Hagen and A. Kahng. New spectral methods for ratio cut partitioning and clustering. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 11(9):1074–1085, 1992. [20] J. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc. New York, NY, USA, 1975. [21] E. Hartuv, A. Schmitt, J. Lange, S. Meier-Ewert, H. Lehrach, and R. Shamir. An algorithm for clustering cDNA fingerprints. Genomics, 66(3):249–256, 2000. [22] E. Hartuv and R. Shamir. A clustering algorithm based on graph connectivity. Information Processing Letters, 76(4-6):175–181, 2000. [23] P. Hensgen. Umbrello UML Modeller. [24] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, 1985. 78

[25] R. Ihaka and R. Gentleman. R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996. [26] A. Jain, M. Murty, and P. Flynn. Data clustering: a review. ACM Computing Surveys (CSUR), 31(3):264–323, 1999. [27] R. Knippers. Molekulare Genetik. Thieme, 2001. [28] T. Kohonen. Self-Organizing Maps. Springer, 2001. [29] M. Kˇriv´anek and J. Mor´avek. NP-hard problems in hierarchical-tree clustering. Acta Informatica, 23(3):311–323, 1986. [30] O. Kullmann and H. Luckhardt. Deciding propositional tautologies: Algorithms and their complexity. Preprint, 1997. [31] J. MacQueen. On convergence of k-means and partitions with minimum average variance. Ann. Math. Statist, 36:1084, 1965. [32] G. Milligan and M. Cooper. A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21(4):441–458, 1986. [33] S. Monti, P. Tamayo, J. Mesirov, and T. Golub. Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning, 52(1):91–118, 2003. [34] A. Natanzon, R. Shamir, and R. Sharan. Complexity classification of some edge modification problems. Discrete Applied Mathematics, 113(1):109– 128, 2001. [35] D. L. Nelson and M. M. Cox. Lehninger Principles of Biochemistry, Fourth Edition. W. H. Freeman, April 2004. [36] R. Niedermeier. Invitation to Fixed Parameter Algorithms. Oxford University Press, 2006. [37] S. Pomeroy, P. Tamayo, M. Gaasenbeek, L. Sturla, M. Angelo, M. McLaughlin, J. Kim, L. Goumnerova, P. Black, C. Lau, et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870):436–442, 2002. [38] F. Protti, M. D. da Silva, and J. L. Szwarcfiter. Applying modular decomposition to parameterized bicluster editing. In IWPEC, pages 1–12, 2006. 79

[39] S. Rahmann, T. Wittkop, J. Baumbach, M. Martin, A. Truß, and S. B¨ocker. Exact and heuristic algorithms for weighted cluster editing. In Proc. of Computational Systems Bioinformics (CSB 2007), volume 6, pages 391–401, 2007. [40] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. Mesirov, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences, 98(26):15149–15154, 2001. [41] E. Ruspini. Numerical Methods for Fuzzy Clustering. American Elsevier Publishing Co, Inc, 1970. [42] R. Shamir, R. Sharan, and D. Tsur. Cluster graph modification problems. Discrete Applied Mathematics, 144(1-2):173–182, 2004. [43] R. Sharan et al. CLICK and EXPANDER: a system for clustering and visualizing gene expression data. Bioinformatics, 19(14):1787–1799, 2003. [44] R. Sharan and R. Shamir. CLICK: A clustering algorithm with applications to gene expression analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol, 8:307–16, 2000. [45] M. Stoer and F. Wagner. A simple min-cut algorithm. Journal of the ACM (JACM), 44(4):585–591, 1997. [46] A. Su, M. Cooke, K. Ching, Y. Hakak, J. Walker, T. Wiltshire, A. Orth, R. Vega, L. Sapinoso, A. Moqrich, et al. Large-scale analysis of the human and mouse transcriptomes. Proceedings of the National Academy of Sciences, 99(7):4465, 2002. [47] R. Tatusov, M. Galperin, D. Natale, and E. Koonin. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Research, 28(1):33–36, 2000. [48] J. Ward. Hierarchical Grouping for Evaluating Clustering Methods. 7. Am. Stat. Assoc, 58:236–244, 1963. [49] R. Wiese, M. Eiglsperger, and M. Kaufmann. yfiles: Visualization and automatic layout of graphs. Proceedings of the 9th International Symposium on Graph Drawing (GD01), pages 453–454, 2001. [50] T. Williams and C. Kelley. GNUPLOT: An Interactive Plotting Program. Manual, version, 3, 1998. [51] T. Wittkop, J. Baumbach, F. Lobo, and S. Rahmann. Large scale clustering of protein sequences with FORCE–A layout based heuristic for weighted cluster editing. BMC Bioinformatics, 8(1):396, 2007. 80

[52] T. yEd Java. Graph Editor, v2. 3.1 02. [53] E. Yeoh, M. Ross, S. Shurtleff, W. Williams, D. Patel, R. Mahfouz, F. Behm, S. Raimondi, M. Relling, A. Patel, et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1(2):133–143, 2002. [54] C. Zahn Jr. Approximating Symmetric Relations by Equivalence Relations. SIAM Journal on Applied Mathematics, 12:840, 1964. [55] X. Zhang and Y. Li. Self-organizing map as a new method for clustering and data analysis. Neural Networks, 1993. IJCNN’93-Nagoya. Proceedings of 1993 International Joint Conference on, 3, 1993.

81

82

Appendix A

PEACE Options

–mode or -X Option: 1 Read in an edge file in which every row describes the similarity of two vertices. Note that the format depends on option “–graphparser or -G”. To set how the similarity score is transformed into costs use options “–costparser or -C” and “–threshold or -T”. Option: 2 Read in a graph in form of a cost matrix file as presented in Section 5.1.1. Option: 3 Read in a directory which contains graph files in the format presented in Section 5.1.1. To avoid exhaustive computation the user can use option “–max component or -M” to restrict the maximum size of input graphs. Option: 4 Read in a similarity matrix file. To set how the similarity score is transformed in to costs use option “–costparser or -C” and “–threshold or -T”. Default: 1 –threshold or -T Option: float number Indicates threshold for similarity scores for being an edge or not. Default: 1e-40 83

–costparser or -C Option: BlastParser.exponentToCosts Exponent of input edge value is used to create the final similarity value from which the threshold is subtracted. Option: BlastParser.simpleCutOff Every edge above the threshold is −1 any other is +1. Option: DoubleParser.valueToCosts Input edge value simply equals the similarity value. Option: DoubleParser.simpleCutOff Every edge above the threshold is +1 any other is −1. Default: DoubleParser.valueToCosts –graphparser or -G Option: EdgeFileParser.initFrom5ColumnFile Set format of edge file such that the first two entries of a line indicate the names of the vertices and the fifths value indicates the similarity value between vertex pair. Option: EdgeFileParser.initFrom3ColumnFile Set format of edge file such that the first two entries of a line indicate the names of the vertices and the third value indicates the similarity value between vertex pair. Option: EdgeFileParser.initFrom12ColumnFile Set format of edge file such that the first two entries of a line indicate the names of the vertices and the 12ths value indicates the similarity value between vertex pair. Default: EdgeFileParser.initFrom5ColumnFile –max component or -M Option: integer number Indicates the maximum size of components considered in the calculation. Default: 32767 –parameter or -P Option: integer value ≥ 0 Indicates parameter of instance, meaning the maximum edit costs to obtain a solution. Option: 0 Iterative parameter increment. Default: 0

84

–parameter steps or -S Option: integer number ≥ 0 Indicates step size for parameter increment. Option: 0 Automatic calculation of step size. Default: 0 –no split or -s If this option is used the search tree split is turned off. Note that using this option will slow down the program significantly. –all solutions or -A If this option is used every solution is returned. Note that this option cannot be used with option “–no split or s”. Further note that this option leads to a great slow-down. Table A.1: List of options for the executable file “weightedclusterediting” of our clustering tool PEACE.

85

Appendix B

T y p e T y p e l i i M t r a n g e a r x T rG a h G S A t r a p e i i i V L V L E t t t t t e r s e r s c e p o n x x x h G io Eg t o s r a p rapEhd ptd n xceeR C d l l i i M R iM t t tBN e r g e c o n m o s q e u u c o n u A C h d b l i i P I M t t t aPn rroblem g e e g e r o e m n s a c e n u W C d L o e r o n w u B l i i t r c a q e u C C InstaceExptionSearchTeW k N a p s c ighted

Class Hierarchy

Figure B.1: UML diagram of the class hierarchy of our clustering tool PEACE. Red boxes correspond to Exception classes, green boxes correspond to reduction rule classes, blue boxes correspond to the main classes and grey boxes correspond to untiles and other helping classes. Note that for clarity reasons we left out so associations between classes. It was created using the Umbrello UML Modeller [23]

86

Appendix C Lung Cancer Similarity Graph

180

185

191 184 179 192

177

100

181 190 193 186

187 194

178

188

189

195

98

182 183 196

95

79

17

108

30 46

173

124 156

166

135

29

162 99 121 159

169 175 11

174 176 170

15

158 160

167

49

0

130

164

61

56

78

161

58 24

48

18 125

20 41

82

59

101

42

138

103

28 109 3

77

68 113

4 57

25

32 23

12

16

55

13 38

27

19

36

118 72 44

8 51 69

111 96

10

14

134

94

76

60

136

63

31

85 110

112

39

81 114

34

146

64 67

106

127

132

43

151 52

93 102

35 84

26

129

149 148

131 65

91

122

973

2 5

750

54 66 90

62

86 89

120

83

140 142 139

116 145

123

155 143 141

152

119

22

47 21

6 126

53 74 71

45 1

11737 105

115

87

33

128

70

104

88

133

80

172

40

97

168

165171

75

92

137

163 157

107

144 150

147 154

153

Figure C.1: The lung cancer dataset consists of three small clusters, which are clustered correctly, and a huge highly heterozygous cluster.

87

88

Selbstst¨ andigkeitserkl¨ arung Ich erkl¨are, dass ich die vorliegende Arbeit selbstst¨andig und nur unter Verwendung der angegebenen Quellen und Hilfsmittel angefertigt habe. Jena, 17.12.2007,

89

Suggest Documents