Many human activities can be described as a model of data collection and data analysis. The second stage is discovering
On Parameterized and Kernelization Algorithms for the Hierarchical Clustering Problem? Yixin Cao1 and Jianer Chen2,3 1 2
Computer & Automation Research Inst., Hungarian Academy of Sciences, Hungary School of Information Science & Engineering, Central South University, P.R. China 3 Department of Computer Science and Engineering, Texas A&M University, USA
Abstract. hierarchical clustering is an important problem with wide applications. In this paper, we approach the problem with a formulation based on weighted graphs and introduce new algorithmic techniques. Our new formulation and techniques lead to new kernelization algorithms and parameterized algorithms for the problem, which significantly improve previous algorithms for the problem.
1
Introduction
Many human activities can be described as a model of data collection and data analysis. The second stage is discovering knowledge in data, in which one of the most common tasks is to classify a large set of objects based on the collected information. This is called the clustering problem, and has incarnation in many disciplines, including biology, archaeology, geology, geography, business management, and social sciences [8, 12, 14, 15]. In this paper, we are focused on the hierarchical clustering problem, which is to recursively classify a given data set into a tree structure in which leaves represent the objects and inner nodes represent clusters of various granularity degrees. We start with some definitions. For an integer n ≥ 1, let [n] = {1, 2, . . . , n}. An n × n symmetric matrix D is a distance M -matrix if Dii = 0 for all i and 1 ≤ Dij ≤ M + 1 for i 6= j. A distance M -matrix D is an ultrametric M -matrix if it satisfies the ultrametric property: for any three i, j, k in [n], Dij ≤ max{Dik , Djk }. An M -hierarchical clustering C of a set X of n objects, which can be simply given as X = [n], is a rooted tree with the objects of X as leaves at level 0, the root at level M + 1, and a path of length exactly M + 1 from the root to any leaf. If we define a distance function dC for the objects in X based on C such that for any two x and y in X, dC (x, y) is the height of the subtree rooted at the lowest common ancestor of x and y, then the distance on the objects of X forms an ultrametric M -matrix DC [2]. It is also known that every ultrametric M -matrix induces an M -hierarchical clustering [2]. If the distance function dC is precise, then it is easy to construct the hierarchical clustering C [11]. Unfortunately, there are seldom, if any, data collection methods that can exclude possibilities of errors. As a consequence, the distance ?
Supported in part by the US NSF under the Grants CCF-0830455 and CCF-0917288.
matrix D formed by the distance function in general is not ultrametric, i.e., it contains inconsistent information. An important task in hierarchical clustering is to “correct” the errors and achieve data consistency. Formally, the hierarchical clustering problem we are concerned with is defined as follows: M -hierarchical clustering Given (D, k), where D is a distance M -matrix and k is an integer (i.e., the parameter), is there an ultrametric M -matrix D0 such that the difference d(D, D0 ) is bounded by k? P 0 Here d(D, D0 ) is defined as d(D, D0 ) = 1≤i t for u, v ∈ X 0 , then set Duv = t. In the following, we will be focused on the M -perspective graph. Now we are ready to generalize the cutting lemmas in [6] to hierarchical clustering. For an n×n matrix F , and a pair of index subsets I, J ⊆ [n], denote by F |I,J the submatrix of F determined by the row index I and the column index J. We write F |I as a shorthand for F |I,I . By definition, for an ultrametric matrix D0 , the submatrix D0 |I for any index subset I is also ultrametric. For a distance matrix D, the submatrix D|I for any index subset I can be regarded as an instance of hierarchical clustering (where the cost c∗ (D|I ) is defined naturally). Moreover, a solution S to the distance matrix D restricted to the index subset I is a solution to D|I , though the optimality may not transfer. Lemma 1. Let D be a distance M -matrix for the object set X = [n], let P = {X1 , X2 , . . . , Xp } be a partition of X, and let EP be the of edges in GM D whose Pset p ∗ ∗ c (D| ) ≤ c (D) ≤ two ends belong to two different parts in P. Then X i i=1 Pp M ∗ πD (EP ) + i=1 c (D|Xi ). Proof. Let S be an optimal solution to D. As noted above, for 1 ≤ i ≤ p, S|Xi is a solution to the submatrix D|Xi , which implies that c∗ (D|Xi ) ≤ c(S|Xi ). Thus, P Pp p ∗ ∗ i=1 c (D|Xi ) ≤ i=1 c(S|Xi ) ≤ c(S) = c (D). For the second inequality, suppose that we increase all inter-part distance to M + 1, that is, to M -split all parts in P by removing all edges in EP in the 0 graph GM D , then apply an optimal solution Si to each submatrix D|Xi . Then we 0 will obviously end up with a solution S to the matrix D, whose cost is M πD (EP )
+
p X i=1
c(Si0 )
=
M πD (EP )
+
p X
c∗ (D|Vi ),
i=1
which is no less than c∗ (D). This concludes the lemma.
t u
If there is a partition such that all inter-part pairs have distance M + 1, then M πD (EP ) = 0 and Lemma 1 gives Corollary 1. Let D be a distance M -matrix for the object set X = [n], and let P = {X1 , X2 , . . . , Xp } be a partition of X. If Duv = + 1 for each pair u and PM p v that belong to different parts of P, then c∗ (D) = i=1 c∗ (D|Xi ). When p = 2, i.e. the partition is P = {Y, Y }, where Y is a subset of X and Y = X \ Y , the edge set EP becomes the cut hY, Y i (i.e., the set of edges with M exactly one end in Y ), whose weight will be denoted by γD (Y ). Lemma 1 gives
Corollary 2. For any subset Y of X, we have c∗ (D|Y ) + c∗ (D|Y ) ≤ c∗ (D) ≤ M c∗ (D|Y ) + c∗ (D|Y ) + γD (Y ). M This suggests the following lower bound for γD (Y ).
Lemma 2. . Let S be an optimal solution to a distance M -matrix D for the M object set X = [n]. For any subset Y of X, c(S|Y,Y ) ≤ γD (Y ). Proof. The solution S can be divided into three disjoint parts: S|Y , S|Y , and S|Y,Y . By Corollary 2 (note c∗ (D) = c(S)), M (Y ). c(S) = c(S|Y ) + c(S|Y ) + c(S|Y,Y ) ≤ c∗ (D|Y ) + c∗ (D|Y ) + γD
(1)
Since S|Y is a solution to the submatrix D|Y and S|Y is a solution to the submatrix D|Y , we have c(S|Y ) ≥ c∗ (D|Y ) and c(S|Y ) ≥ c∗ (D|Y ), which combined M (Y ). t u with (1) gives immediately c(S|Y,Y ) ≤ γD For a distance M -matrix D, it is intuitive that the objective ultrametric M matrix should have its largest element bounded by M + 1. This intuition can be formally proved in the following lemma, which also verifies the validity of the definition of the M -hierarchical clustering problem. Lemma 3. Let D be a distance M -matrix, and let S 0 be an optimal solution to D. Then the matrix D0 = D + S 0 is an ultrametric M -matrix (i.e., the largest element in the matrix D0 has a value bounded by M + 1). 0 Proof. We prove the lemma by contradiction that d0 = max1≤i 00 M + 1 ≥ max1≤i max1≤i 0. Thus, c(S 00 ) < c(S 0 ). 0 00 Applying solutions S and S to D, we get two different matrices D0 and D00 = D + S 00 . By the above construction, for all t < d0 − 1, the t-perspective graphs for D0 and D00 are the same, which are unions of disjoint cliques. For t = d0 − 1, the t-perspective graph GtD00 for D00 is a single clique. Thus, by Proposition 1, D00 is ultrametric, so S 00 is a solution to D. However, this contradicts the facts that c(S 00 ) < c(S 0 ) and that S 0 is an optimal solution to D. t u
Without loss of generality, we will always assume in the rest of this paper that a distance M -matrix has at least one element of value M + 1.
3
A kernel of size 2k
To better understand our kernelization algorithm, we start with one that produces a kernel of size 4k, and discuss the difficulty for improving it. The second part is devoted to overcoming the difficulty and achieving the kernel of size 2k.
Warming up: a kernel of size 4k Fix a distance M -matrix D for X = [n]. For an object v in X, devote by Nv = {u : Duv < M + 1} the closed neighborhood of v in the graph GM D. A simple but important fact about a solution S of cost bounded by k to the distance M -matrix D is that at most 2k different objects in X have some of their distances to other objects changed. As a consequence, if we are also able to bound the number of objects that are not affected by S, we get a kernel. For such an unaffected object v, the v-th row of S consists of only 0’s. Thus, in the ultrametric matrix D0 = D +S, for any two objects u and w in X, where u ∈ Nv , 0 0 0 the distance Duw must satisfy (note Dvu = Dvu and Dvw = Dvw ): ( ≤ max(Dvu , Dvw ) ≤ M if u, w ∈ Nv ; 0 Duw (2) = max(Dvu , Dvw ) = M + 1 if u ∈ Nv , w 6∈ Nv . This is a necessary (but not sufficient) condition for a solution S to avoid v. If (2) is not satisfied by D, then D|Nv must be modified by S. To measure the cost of the modification, we introduce a number of functions, as follows: δ(v) = |{(u, w) : u, w ∈ Nv , u < w and Duw = M + 1}|, X M γ(v) = (M + 1 − Duw ) (i.e., γ(v) = γD (Nv )), u∈Nv ,w6∈Nv
ρ(v) = 2δ(v) + γ(v). We say that the neighborhood Nv is reducible if ρ(v) < |Nv |. We describe two reduction rules on a reducible neighborhood Nv . The first given in the following lemma claims that Nv can be put into a single M -clique. Lemma 4. For an object v with Nv reducible, there is an optimal solution S ∗ to D such that the maximum distance in (D + S ∗ )|Nv is bounded by M . Lemma 4 gives the rule for our first reduction rule immediately: Rule 1 For an object v in X such that Nv is reducible, replace every element M + 1 in the submatrix D|Nv by M , and decrease the parameter k by δ(v). After Rule 1, we have δ(v) = 0 and ρ(v) = γ(v). Now consider D|Nv ,N v . Rule 2 On aPreducible Nv on which Rule 1 has been applied, for each object x 6∈ Nv with u∈Nv (M + 1 − Dxu ) ≤ |Nv |/2, M -split x from Nv . Lemma 5. Rule 2 is safe. After Rules 1-2, the neighborhood Nv has a very simple structure: there is at most one “pendent” object in N v that is still attached to Nv , as shown by the following lemma. Lemma 6. For a reducible Nv on which Rules 1-2 have been applied, there is at most one object x 6∈ Nv such that D|Nv ,x have values not equal to M + 1.
Proof. By the condition of Rule 2, any object x in N v that still has distance smaller P than M + 1 to some objects in Nv after the application of Rule 2 must satisfy u∈Nv (M + 1 − Dxu ) > |Nv |/2. To prove the lemma, suppose on the contrary that there are two such objects x and y. Then we have X X M γ(v) = γD (Nv ) ≥ (M + 1 − Dxu ) + (M + 1 − Dyu ) > |Nv |. u∈Nv
u∈Nv
This contradicts that Nv is reducible and ρ(v) = 2δ(v) + γ(v) < |Nv |.
t u
Now we are ready to describe our kernelization algorithm. The Kernelization Algorithm. For each object v for which the set Nv is reducible 1. decrease value M + 1 in D|Nv to MP and decrease k accordingly; 2. for each element x 6∈ Nv such that u∈Nv (M + 1 − Dxu ) ≤ |Nv |/2, set all values in D|Nv ,x to M + 1 and decrease k accordingly. Note that there is only one condition tested by the algorithm, which is checked only once and is independent of the parameter k. This kernelization algorithm is applied iteratively, starting from the highest level M . In each run, we take each object set obtained in the splitting in the previous run and apply the kernelization algorithm, until there is no object set on which the kernelization algorithm is applicable. Therefore, the kernel consists of a set of object sets, each forms an independent instance of hierarchical clustering. To analyze the size of the final kernel, we count the relation between the object sets and the minimum number of modifications required to make the distance matrix ultrametric. Because our counting does not depend on the value of M , this ratio holds for all subsets that form independent instances of hierarchical clustering, and therefore for the entire object set X. Lemma 7. Let (D, k) be an instance of M -hierarchical clustering on which the kernelization algorithm is not applicable. If the size of D is larger than 4k, then there is no solution to D of cost bounded by k. Proof. Let matrix S be an optimal solution to the distance M -matrix D. For each pair v, w ∈ X, we divide the cost |Svw | into two halves and distribute them evenly P to v and w. By this procedure, each object P v gets a “cost” cost(v) = 1 |S |. The total cost of S is equal to uv u∈X\{v} v∈X cost(v). We count the 2 cost on each object, and pay special attention to objects with cost 0. For two objects u, v with Duv = M + 1, if there exists another object x such that Dux ≤ M and Dvx ≤ M , then at most one of the objects u and v can has cost 0: to make u, v, x satisfy the ultrametric property, at least one of the distances Duv , Dux , Dvx must be changed. Let ZS = {v1 , v2 , . . . , vr } be the set of objects with cost 0. For two objects vi , vj ∈ ZS , either Dvi vj = M + 1 and every other object has distance M + 1 to at least one of vi , vj ; or Dvi vj ≤ M and any other object has distance M + 1 to vi if and only if it has distance M + 1 to vj . As a result, the two neighborhoods Nvi and Nvj in GM D are either the
same (when Dvi vj ≤ M ) or disjoint (when Dvi vj = M + 1). Thus, without loss of generality, we can assume that all neighborhoods in {Nv1 , Nv2 , . . . , Nvr } are pairwise disjoint. Let NS = Nv1 ∪ Nv2 ∪ · · · ∪ Nvr . Since column D|X,vi for vi ∈ ZS is unchanged by the solution S, S must decrease the distance M + 1 between any pair of objects in Nvi to M , and increase the distance between objects in Nvi and N vi to M +1. These operations have cost δ(vi ) + γ(vi ). Accordingly, the total cost on the objects in Nvi is δ(vi ) + γ(vi )/2 = ρ(vi )/2. If Nvi is not reducible, then ρ(vi )/2 ≥ |Nvi |/2. On the other hand, if Nvi is reducible, then by Lemma 6, there can be at most one object x ∈ N vi that has distance bounded by M to some objects in Nvi . According to Rule 2, in this case ρ(vi ) ≥ γ(vi ) > |Nvi |/2. Thus, the cost ρ(vi )/2 is always strictly larger than |Nvi |/4. From this analysis, we get X
cost(v) =
v∈NS
r X X
cost(v) ≥
i=1 v∈Nvi
r X
|Nvi |/4 = |NS |/4.
(3)
i=1
On the other hand, each object w 6∈ NS bears a cost at least 1/2. Thus X cost(w) ≥ |X \ NS |/2. (4) w∈X\NS
Combining (3) and (4) shows that the cost of the optimal solution S to D is X X X cost(v) = cost(v) + cost(v) ≥ |NS |/4 + |X \ NS |/2 ≥ |X|/4. v∈X
v∈NS
v∈X\NS
Thus, if |X| > 4k, then the distance M -matrix D has no solution of cost ≤ k. t u Destination: a kernel of size 2k The main trouble to further improve the kernel size 4k given above is that for a reducible Nv , conflicts in Nv are only settled at level M , while conflicts may still occur at lower levels. Our idea for tackling this trouble is: if the cost to fix Nv at lower levels is large enough, we then use it in the counting to complement the deficiency; otherwise we will find another rule to reduce Nv . We first consider the case where no pendent object x 6∈ Nv (as described in Lemma 6) exists for Nv . In this case, Nv has been completely resolved in the perspective graph for level M , and we can treat D|Nv as an independent instance for the (M − 1)-hierarchical clustering problem, and continue to apply the Kernelization Algorithm (note that the Kernelization Algorithm does not depend on the value of the parameter k). The case where the pendent object x 6∈ Nv exists for Nv is more involved. After previous steps, the M -clique in the final solution is contained in Nv ∪ {x}. Thus, the (M −1)-clique containing v is a subset of Nv ∪{x}. Since Dvx = M +1, NvM −1 ⊆ Nv , where NvM −1 is the neighborhood of the object v in the (M − 1)−1 perspective graph GM . If NvM −1 is reducible in the M − 1 level, we can apply D again the Kernelization Algorithm. We can continue this procedure until
• we meet the first t such that Nvt is not reducible; • we meet the first t such that Nvt gets isolated; or • we hit the ground when t = 1. In the first situation, we stop. In the second situation, we apply the Kernelization Algorithm to Nvt as an independent instance. Thus, we only need to deal with the last situation, for which there is a pendent object xt for Nvt at each level t. At level t = 1, let N1 ⊂ Nv1 be the objects with distance 1 to x1 , and let N2 ⊂ Nv1 be the objects with distance 2 to x1 . Obviously, Nv1 = N1 ∪ N2 . Rule 3 Let v be an object such that Nvt is reducible and Rules 1-2 have been applied for all levels t. Pick any subset N12 ⊆ N1 with |N12 | = |N2 |, and remove N 0 = N12 ∪ N2 . For levels t ≥ 2, increase total distance from xt to Nvt − N 0 by 2|N2 | − 2|{u ∈ N12 : Duxt ≤ t}|, by arbitrarily choosing objects from Nvt − N 0 and increasing their distances to xt to no more than t + 1. We first verify the validity for Rule 3. Since x1 survives Rule 2, more objects in Nv1 have distance 1 to x1 than those with distance 2, i.e., |N1 | ≥ |N2 |, which shows the existence of the subset N12 . For an upper level 2 ≤ t ≤ M , the required increments in distance between xt and Nvt − N 0 is (|N12 | + |N2 | − |{u ∈ N12 : Duxt ≤ t}|) − |{u ∈ N12 : Duxt ≤ t}|, where the first parenthesis constitutes a set of objects with distance ≥ t + 1 to xt . This condition is always satisfied, since xt survived Rule 2. Lemma 8. Rule 3 is safe. Now we are ready to show that the Kernelization Algorithm has a kernel of size 2k. In the second situation, we treat Nvt as an independent instance and apply the Kernelization Algorithm. Since M is finite, we will eventually reach an instance at a lower level on which the second situation no longer holds, where the instance either is already internally ultrametric, or contains no reducible objects anymore. For the latter case, the internal cost is at least twice the number of objects in it, and we can use it to make up the deficiency in upper-level counting. For the former case, we use the following reduction rule that is almost the same as Rule 3, whose safeness follows from a similar argument as that for Rule 3. Rule 4 Let v be an object such that Nvt is reducible and Rules 1-2 have been applied for all levels t ≥ T . If NvT gets separated and D|NvT is ultrametric, then remove NvT , and from level t = T to level M , increase total distance from xt to Nvt −NvT by 2|NvT |, by arbitrarily choosing an object from Nvt −NvT and increasing its distance to xt to no more than t + 1. Summarizing the above discussions, we conclude with a kernel bound for the Kernelization Algorithm, which was claimed in the second part of Theorem 1. Theorem 2. Let (D, k) be an instance of the M -hierarchical clustering problem on which the Kernelization Algorithm has been applied. If the size of the distance M -matrix D is larger than 2k, then no solution to the distance M -matrix D has its cost bounded by k.
4
An improved parameterized algorithm
Inspired by the formulation and usage of perspective graphs in the last section, one might want to solve the M -hierarchical clustering problem in a level-by-level way, by picking an algorithm for the cluster editing problem, applying the algorithm to the M -perspective graph GM D , then applying the algorithm to the resulting instances at level M − 1, and so on. However, this greedy approach does not always work: it can be shown that the set of the operations in an optimal solution to the perspective graph at a higher level for cluster editing may not be a subset of the set of the operations in any optimal solution to the original instance for M -hierarchical clustering. On the other hand, this negative result does offer some useful information: it indicates that if we want to use an algorithm for cluster editing, we cannot use it as a black-box – we must know its internal mechanism. The hierarchical clustering problem, for which the cluster editing problem is a special case, can be resolved by the following breaking conflicttriangle process: to convert a distance matrix D into an ultrametric matrix, for any three indices i, j, k, if Dij , Dik , and Djk do not satisfy the ultrametric property, then at least one of them must change its value. It naturally suggests a 3-way branching search process for an optimal solution to D, which leads to an O∗ (3k )-time algorithm for the problem. For cluster editing, there have been several improved results, following the basic outline of breaking conflicttriangles. With the help of more careful branching steps and more complicated analysis techniques, the current best algorithm for cluster editing takes time O∗ (1.62k ) [5]. On the other hand, there has been no non-trivial parameterized algorithm for the general M -hierarchical clustering problem. Instead of adapting a single particular algorithm for the cluster editing problem to solve the M -hierarchical clustering problem, we go one step further. We show that any parameterized algorithm for the cluster editing problem, provided it is based on branching on breaking conflict-triangles, can be adapted to solve the M -hierarchical clustering problem, with the same time complexity as far as the exponential part is concerned. Indeed, what we show is a meta algorithm, which takes as an input, in addition to an instance IH = (D, k) of the M -hierarchical clustering problem, an algorithm for the cluster editing problem, and returns an optimal solution to IH . The algorithm Meta-HC given in Figure 1 shows how a meta algorithm is implemented that adapts an algorithm for cluster editing to solve hierarchical clustering. We give some explanations on the algorithm. In step 6 of the algorithm, we decrease each element value M + 1 to M and mark the value forbidden. This step simplifies the presentation of the algorithm. In particular, after each iterative run, we do not break the instance into smaller subinstances and solve them separately. Instead, we still treat it as a single instance. Note that if there is no conflict-triangle at level M , then the edges with distance M + 1 partition the objects with no conflict-triangles. Thus, decreasing them uniformly by 1 at level M − 1 will not create new conflict-triangles. On the other hand, in step 3 of the algorithm when we call the algorithm A for cluster
editing, the tags of “forbidden” will be discarded. Thus, the distance of an edge that gets decreased in a turn can be further decreased later. On the other hand, by the procedure, no repeated increment on a single edge can happen.
Algorithm Meta-HC(D, k, A) input: a distance matrix D, an integer k, an algorithm A for cluster editing output: an ultrametic matrix D0 such that d(D, D0 ) ≤ k if such D0 exists 1 2 3 4 5 6 7
M = max1≤i