c Pleiades Publishing, Ltd., 2015. ISSN 1990-4789, Journal of Applied and Industrial Mathematics, 2015, Vol. 9, No. 4, pp. 497–502. c A.V. Kel’manov, V.I. Khandeev, 2015, published in Diskretnyi Analiz i Issledovanie Operatsii, 2015, Vol. 22, No. 4, pp. 50–59. Original Russian Text
An Exact Pseudopolynomial Algorithm for a Problem of the Two-Cluster Partitioning of a Set of Vectors A. V. Kel’manov1, 2* and V. I. Khandeev1** 1
Sobolev Institute of Mathematics, pr. Akad. Koptyuga 4, Novosibirsk, 630090 Russia 2 Novosibirsk State University, ul. Pirogova 2, Novosibirsk, 630090 Russia Received September 16, 2014; in final form, February 22, 2015
Abstract—We consider the strongly NP-hard problem of partitioning a set of Euclidean vectors into two clusters of given sizes so as to minimize the sum of the squared distances from the elements of the clusters to their centers. It is assumed that the center of one of the clusters is unknown and determined as the average value over all vectors in the cluster. The center of the other cluster is the origin. We prove that, for a fixed dimension of the space, the problem is solvable in polynomial time. We also present and justify an exact pseudopolynomial algorithm in the case of integer components of the vectors. DOI: 10.1134/S1990478915040067 Keywords: partitioning, vector subset, squared Euclidean distance, NP-hardness, exact pseudopolynomial algorithm
INTRODUCTION The topic of the present article is a hard problem of discrete optimization. The article aims at justifying an exact pseudopolynomial algorithm for a special case of this problem. One of the well-known discrete optimization problems is the MSSC (Minimum Sum-of-Squares Clustering) problem [21, 26, 27]. In this problem, it is required to partition a given number of vectors in the Euclidean space into a family of clusters so as to minimize the sum over all clusters of intracluster sums of the squared distances from the elements of the clusters to their centers. The center of every cluster is determined as its geometric center; i.e., the average of the vectors contained in this cluster. This problem in particular appears in solving problems of approximation, data analysis, image recognition, computer geometry, and mathematical statistics. The questions of constructing algorithmic solutions to Problem MSSC are considered in hundreds of publications (for example, see [3, 7, 21, 23–27] and references therein). This problem had long been considered as NP-hard. However, a correct proof of its hardness is obtained recently [20] and initiated the study of similar problems (for example, see [1,3,8–18] and reference therein). The problem under consideration in the present article is such a problem. The meaningful situation that leads to the problem under study is as follows (for example, see [8, 9, 15]): There is a table with the results of the measurements of a set of characteristics of some object. The object can be in one of the two states: the passive state (all characteristics are zero) and the active state (at least one characteristic is nonzero). Moreover, each result of the measurements contains an error, and the correspondence between the results of the measurements and object states is unknown. It is required to find the subset of the tuples corresponding to the active state of the object and estimate the set of characteristics of the object in this case. The strong NP-hardness of the above problem was established in [16]. Some approximate polynomial algorithms with guaranteed accuracy were proposed in [8, 9, 19]. Some characteristics of these algorithms are given in Section 1. Here we observe only that the strong NP-hardness of the problem implies [22] that there are no exact polynomial and pseudopolynomial algorithms for this problem unless * **
E-mail:
[email protected] E-mail:
[email protected]
497
498
KEL’MANOV, KHANDEEV
P=NP. Therefore, it is of interest to find subclasses of this problem for which we can construct the exact algorithms of polynomial-time complexity. In the present paper, we show the polynomial solvability of the problem in the case when the dimension of the space is fixed. We construct an exact pseudopolynomial algorithm for the case of a fixed dimension of the space and integer components of the vectors. Note that earlier an exact pseudopolynomial algorithm was proposed in [4] for the special case of the problem of search in a finite set of vectors for a subset of vectors with the maximal norm of their sum. The same case—a fixed dimension of the space and the integrality of the inputs—is addressed in this article. Since the minimization problem considered in this article is equivalent to the maximization problem considered in [4] (see Section 2), it is clear that the algorithm proposed in [4] can give the exact solution for a particular case of the problem under consideration. In the present paper, we apply an approach for constructing an exact algorithmic solution that is different from that suggested in [4]. This approach turned out to be less productive as regards the speed of the constructed algorithm (see Section 2). Nevertheless, we think that it is important as another efficient instrument for solving similar problems. 1. STATEMENT OF THE PROBLEM AND MAIN RESULTS The problem we consider has the following statement [8, 9, 16]: Problem 1-MSSC-F. Given a set Y = {y1 , . . . , yN } of vectors in Rq and a positive integer M , find a partition of Y into two clusters C and Y \ C such that y − y(C)2 + y2 → min, (1) S(C) = y∈C
y∈Y\C
where y(C) =
1 y |C| y∈C
is the center of C, under the constraint |C| = M . In this problem, the input set Y corresponds to a table (see the informal treatment in the Introduction) containing N results of the measurements of the tuple of q numerical characteristics of an object. The center of the subset C is unknown (is the optimized variable), and the center of Y \ C is given at the origin. The number M corresponds to the number of the measurements in the active state. Problem 1-MSSC-F is a particular NP-hard case of Problem J-MSSC-F (for J = 1) [15]. In Problem J-MSSC-F, it is required to partition the input set into J + 1 clusters; moreover, the centers of J clusters are unknown, while the center of one cluster is defined at the origin [15]. The symbols MSSC in the abbreviated title of the formulated problem stress similarity in contents with the classical NP-hard MSSC problem, in which the centers of all sought disjoint clusters are unknown. The last symbol (from “Fixed”) points to the fact that the sizes of the clusters are fixed. In [8], some 2-approximate polynomial algorithm for finding a solution in O(qN 2 ) time was constructed for Problem 1-MSSC-F. A polynomial approximation time scheme (PTAS) having time complexity O(qN 2/ε+1 (9/ε)3/ε ), with the guaranteed bound ε for the relative error, was proposed in [9]. In [19], a randomized algorithm was justified which, for a given relative error ε > 0 and fixed probability γ ∈ (0, 1) of the failure of the algorithm for an established value of k, finds a (1 + ε)-approximate solution to the problem in O(2k q(k + N )) time and conditions were found for the algorithm to be asymptotically exact and have complexity O(qN 2 ). Below, for the case when the components of the vectors are integer and the space dimension is fixed we propose the exact pseudopolynomial algorithm that finds the solution to the problem in O(N (M D)q ) time. JOURNAL OF APPLIED AND INDUSTRIAL MATHEMATICS
Vol. 9 No. 4 2015
AN EXACT PSEUDOPOLYNOMIAL ALGORITHM FOR A PROBLEM
499
2. AN EXACT PSEUDOPOLYNOMIAL ALGORITHM Let us first show that, for a fixed dimension of the space, Problem 1-MSSC-F is solvable in polynomial time. To this end we will need the following strongly NP-hard [2, 5] Problem LVS (Longest Vector Sum). Given a set Y = {y1 , . . . , yN } of vectors in Rq and some positive integer M , find C ⊆ Y of size M such that y → max . F (C) = y∈C
It was proved in [6] that Problem LVS is solvable in O(q 2 N 2q ) time, i.e. in a polynomial time, when dimension q of the space is fixed. Proposition 1. The 1-MSSC-F Problem is solvable in O(q 2 N 2q ) time. Proof. The chain of equalities 1 1 2 2 2 2 (F (C))2 y − y(C) + y = y − y = y2 − S(C) = |C| |C| y∈C
y∈Y
y∈Y\C
y∈C
y∈Y
establishes a connection between the target functions of Problems 1-MSSC-F and LVS. Since the first term on the right-hand side of the above expression is a constant and the size of C is given, the minimization of S(C) is equivalent to that of F (C). Therefore, the exact algorithm for Problem LVS can be applied for finding an optimal solution to 1-MSSC-F. Proposition 1 is proved. The essence of the above algorithmic approach is as follows: In the domain of the space which is defined by the maximal absolute value of the coordinates of the input vectors, we construct a multidimensional mesh (lattice) with rational meshstep uniform in every coordinate. The meshstep is chosen so that one of the mesh nodes coincide with the geometric center of one of the clusters to be optimized. For each node of the mesh, we solve the problem of the maximization of an auxiliary objective function. As a result of the solution, we find a tuple of vectors at which the maximum of this function is attained. The so-found tuple is declared a candidate for the solution. As the final solution, we choose the tuple for which the value of the objective function of the initial problem is minimal. For constructing the algorithm, we need an auxiliary assertion. Put y − b2 + y2 , (2) G(B, b) = y∈B
y∈Y\B
where B ⊆ Y, |B| = M , and b ∈ Rq . Lemma 1. For every fixed subset B ⊆ Y, the minimum of (2) is attained at 1 y(B) = y M y∈B
Rq ,
the minimum of (2) is attained at the set consisting of M vectors and equals S(B). Given b ∈ of Y having the greatest projections on b. Proof. The first assertion is easy to check analytically (by derivation), while the second follows from the chain of equalities: y − b2 + y2 = (y − b2 − y2 ) + y2 G(B, b) = y∈B
y∈Y\B
y∈B
=
y∈Y
y2 + M · b2 − 2
y∈Y
y∈B
It remains to observe that the first two summands on the right-hand side are constants. Lemma 1 is proved. JOURNAL OF APPLIED AND INDUSTRIAL MATHEMATICS
Vol. 9 No. 4
2015
y, b.
500
KEL’MANOV, KHANDEEV
Suppose now that the vectors of Y have integer components. Put D = max max |(y)j |,
(3)
y∈Y j∈{1,...,q}
where (y)j is the jth component of y. Consider the set (a multidimensional lattice with uniform rational meshstep) 1 q j j j j D = d | d ∈ R , (d) = (v) , (v) ∈ Z, |(v) | ≤ M D, j = 1, . . . , q . M Note that |D| = (2M D + 1)q . Let us formulate the following algorithm for solving Problem 1-MSSC-F: Algorithm A Input of the algorithm: a set Y and some positive integer M . Step 1. Given b ∈ D, construct the set B(b) consisting of the M vectors from Y having the greatest projections to b. Calculate G(B(b), b). Step 2. Find the vector bA = arg min G(B(b), b) and the corresponding subset B(bA ). As a solution b∈D
to the problem, take CA = B(bA ). If there are several solutions then take any of them. Output of the algorithm: CA . Lemma 2. Suppose that, under the conditions of Problem 1-MSSC-F, the vectors of Y have integer components lying in [−D, D], while C ∗ is an optimal solution to Problem 1-MSSC-F, and CA is the set obtained as the result of Algorithm A. Then S(C ∗ ) = S(CA ). Proof. From (3) it follows that the center y(C) =
1 y M y∈C
of each subset C ⊆ Y of size M lies in D. Hence, the center y ∗ = y(C ∗ ) of the optimal subset C ∗ lies in the same set. By Step 2, we have G(CA , bA ) ≤ G(B(y ∗ ), y ∗ ).
(4)
The first assertion of Lemma 1 implies the estimate S(CA ) ≤ G(CA , bA ),
(5)
G(B(y ∗ ), y ∗ ) = S(C ∗ ).
(6)
and the second yields S(C ∗ ).
Combining (4)–(6), we see that S(CA ) ≤ On the other hand, since CA is an admissible solution to Problem 1-MSSC-F; therefore, we have S(C ∗ ) ≤ S(CA ), which established the equality of S(C ∗ ) and S(CA ). Lemma 2 is proved. Theorem. Under the conditions of Lemma 2, Algorithm A finds an optimal solution to Problem 1-MSSC-F in O(qN (2M D + 1)q ) time. Proof. The optimality of the solution follows from Lemma 2. Let us estimate the time complexity of the algorithm. Step 1 is fulfilled |D| times. In addition, for each b ∈ D, the calculation of the projections on this vector requires O(qN ) operations, and the choice of M vectors having the greatest projections requires O(N ) operations. The cost of the computation of G(B(b), b) constitutes O(qN ) operations. Since |D| = (2M D + 1)q , the complexity of Step 1 is estimated by O(qN (2M D + 1)q ). Step 2, the search for the least element, requires O((2M D + 1)q ) operations. Thus, the time complexity of the algorithm is O(qN (2M D + 1)q ). JOURNAL OF APPLIED AND INDUSTRIAL MATHEMATICS
Vol. 9 No. 4 2015
AN EXACT PSEUDOPOLYNOMIAL ALGORITHM FOR A PROBLEM
501
Let us show that, for a fixed dimension of the space, the algorithm is pseudopolynomial. Indeed, since M D ≥ 1/2, we have 1 q q q (2M D + 1) = 2 M D + ≤ 4q (M D)q . 2 This implies that, under the above-indicated conditions, the running time of the algorithm is estimated by O(N (M D)q ). The running time of the algorithm of [6] yielding an optimal solution in the general case of the problem is O(N 2q ). Therefore, for M D < N 2−1/q , the pseudopolynomial Algorithm A, proposed in a particular case, is more efficient compared to the exact algorithm oriented to the general case. However, for the same particular case of Problem LVS, in [4], some algorithm was justified of complexity O(N M (M D)q−1 ). Since Problems LVS and 1-MSSC-F are polynomially equivalent, that algorithm guarantees the finding of an exact solution to the problem considered in this article D times faster than Algorithm A. Nevertheless, the above approach to the construction of the algorithm can be useful as one more efficient instrument for solving the problems similar in their statements. In particular, this (in essence, mesh) approach is more attractive for parallelizing the algorithm. Moreover, this approach can be used in constructing a fully polynomial time approximation scheme (FPTAS) for the particular case of the problem in which the dimension of the space is fixed. CONCLUSION We address one of the actual strongly NP-hard problems of partitioning a finite vector set in an Euclidean space. It is proved that, for a fixed dimension of the space, the problem is solvable in polynomial time. We justified an exact pseudopolynomial algorithm making it possible to find an optimal solution to the problem in the case when the dimension of the space is fixed and the components of the vector are integers. The mesh approach of the article to constructing an algorithmic solution actually opens the following way of constructing an FPTAS scheme for a fixed dimension of the space: We use the approximation of the center of the unknown cluster with a node of a specially constructed mesh. The justification of such a scheme is an important topic of further research. One more important direction consists in constructing algorithms with quality estimates for the generalization of the problem under consideration to the case when the number of clusters with unknown centers is greater than 1. ACKNOWLEDGMENTS The authors were supported by the Russian Foundation for Basic Research (projects nos. 15–01– 00462 and 13–07–00070). REFERENCES 1. A. A. Ageev, A. V. Kel’manov, and A. V. Pyatkin, “NP-Hardness of the Euclidean Max-Cut Problem,” Dokl. Akad. Nauk 456 (5), 511–513 (2014) [Dokl. Math. 89 (3), 343–345 (2014.)]. 2. A. E. Baburin, E. Kh. Gimadi, N. I. Glebov, and A. V. Pyatkin, “The Problem of Finding a Subset of Vectors with the Maximum Total Weight,” Diskretn. Anal. Issled. Oper. Ser. 2, 14 (1), 32–42 (2007) J. Appl. Indust. Math. 2 (1), 32–38 (2008.)]. 3. A. E. Galashov and A. V. Kel’manov, “A 2-Approximate Algorithm to Solve One Problem of the Family of Disjoint Vector Subsets,” Avtomat. i Telemekh. No. 4, 5–19 (2014) [Autom. Remote Control 75 (4), 595– 606 (2014)]. 4. E. Kh. Gimadi, Yu. V. Glazkov, and I. A. Rykov, “On Two Problems of Choosing Some Subset of Vectors with Integer Coordinates That Has Maximum Norm of the Sum of Elements in Euclidean Space,” Diskretn. Anal. Issled. Oper. 15 (4), 30–43 (2008) [J. Appl. Indust. Math. 3 (3), 343–352 (2009)]. 5. E. Kh. Gimadi, A. V. Kel’manov, M. A. Kel’manova, and S. A. Khamidullin, “A Posteriori Detection of a Quasiperiodic Fragment with a Given Number of Repetitions in a Numerical Sequence,” Sibirsk. Zh. Industr. Mat. 9 (1), 55–74 (2006). JOURNAL OF APPLIED AND INDUSTRIAL MATHEMATICS
Vol. 9 No. 4
2015
502
KEL’MANOV, KHANDEEV
6. E. Kh. Gimadi, A. V. Pyatkin, and I. A. Rykov, “On Polynomial Solvability of Some Problems of a Vector Subset Choice in a Euclidean Space of Fixed Dimension,” Diskretn. Anal. Issled. Oper. 15 (6), 11–19 (2008) [J. Appl. Indust. Math. 4 (1), 48–53 (2010)]. 7. A. V. Dolgushev and A. V. Kel’manov, “On the Algorithmic Complexity of a Problem in Cluster Analysis,” Diskretn. Anal. Issled. Oper. 17 (2), 39–45 (2010) [J. Appl. Indust. Math. 5 (2), 191–194 (2011)]. 8. A. V. Dolgushev and A. V. Kel’manov, “An Approximation Algorithm for Solving a Problem of Cluster Analysis,” Diskretn. Anal. Issled. Oper. 18 (2), 29–40 (2011) [J. Appl. Indust. Math. 5 (4), 551–558 (2011)]. 9. A. V. Dolgushev, A. V. Kel’manov, and V. V. Shenmaier, “A Polynomial Approximation Scheme for a Problem of Cluster Analysis,” in Proceedings of the 9th International Conference “Intellectualization of Information Processing,” Budva, Montenegro, September 16–22, 2012 (Torus Press, Moscow, 2012), pp. 242–244. 10. I. I. Eremin, E. Kh. Gimadi, A. V. Kel’manov, A. V. Pyatkin, and M. Yu. Khachai, “2-Approximation Algorithm for Finding a Clique with Minimum Weight of Vertices and Edges,” Trudy Inst. Mat. Mekh. Ural. Otdel. Ross. Akad. Nauk 19 (2), 134–143 (2013) [Proc. Steklov Inst. Math. 284 (Suppl. 1), S87– S95 (2014)]. 11. A. V. Kel’manov, “Off-Line Detection of a Quasi-Periodically Recurring Fragment in a Numerical Sequence,” Trudy Inst. Mat. Mekh. Ural. Otdel. Ross. Akad. Nauk 14 (2), 81–88 (2008) [Proc. Steklov Inst. Math. 263 (Suppl. 2), S84–S92 (2008.)]. 12. A. V. Kel’manov, “On the Complexity of Some Data Analysis Problems,” Zh. Vychisl. Mat. Mat. Fiz. 50 (11), 2045–2051 (2010) [Comput. Math. Math. Phys. 50 (11), 1941–1947 (2010)]. 13. A. V. Kel’manov, “On the Complexity Of Some Cluster Analysis Problems,” Zh. Vychisl. Mat. Mat. Fiz. 51 (11), 2106–2112 (2011) [Comput. Math. Math. Phys. 51 (11), 1983–1988 (2011)]. 14. A. V. Kel’manov and A. V. Pyatkin, “On the Complexity of a Search for a Subset of ‘Similar’ Vectors,” Dokl. Akad. Nauk 421 (5), 590–592 (2008) [Dokl. Math. 78 (1), 574–575 (2008)]. 15. A. V. Kel’manov and A. V. Pyatkin, “Complexity of Certain Problems of Searching for Subsets of Vectors and Cluster Analysis,” Zh. Vychisl. Mat. Mat. Fiz. 49 (11), 2059–2067 (2009) [Comput. Math. Math. Phys. 49 (11), 1966–1971 (2009)]. 16. A. V. Kel’manov and A. V. Pyatkin, “On Complexity of Some Problems of Cluster Analysis of Vector Sequences,” Diskretn. Anal. Issled. Oper. 20 (2), 47–57 (2013) [J. Appl. Indust. Math. 7 (3), 363–369 (2013)]. 17. A. V. Kel’manov, S. M. Romanchenko, and S. A. Khamidullin, “Accurate Pseudopolynomial Time Algorithms for Some NP-Hard Problems of Searching for a Vector Subsequence,” Zh. Vychisl. Mat. Mat. Fiz. 53 (1), 143–153 (2013). 18. A. V. Kel’manov and V. I. Khandeev, “A 2-Approximation Polynomial Algorithm for a Clustering Problem,” Diskretn. Anal. Issled. Oper. 20 (4), 36–45 (2013) [J. Appl. Indust. Math. 7 (4), 515–521, 2013]. 19. A. V. Kel’manov and V. I. Khandeev, “A Randomized Algorithm for Two-Cluster Partition of a Set of Vectors,” Zh. Vychisl. Mat. Mat. Fiz. 55 (2), 335–344 (2015) [Comput. Math. Math. Phys. 55 (2), 330–339 (2015.)]. 20. D. Aloise, A. Deshpande, P. Hansen, and P. Popat, “NP-Hardness of Euclidean Sum-of-Squares Clustering,” Mach. Learn. 75 (2), 245–248 (2009). 21. A. K. Jain, “Data Clustering: 50 Years Beyond K-means,” Pattern Recognit. Lett. 31 (8), 651–666 (2010). 22. M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness (Freeman, San Francisco, CA, 1979). 23. P. Hansen and B. Jaumard, “Cluster Analysis and Mathematical Programming,” Math. Program. Ser. A 79 (1–3), 191–215 (1997). 24. P. Hansen, B. Jaumard, and N. Mladenovic, “Minimum Sum of Squares Clustering in a Low Dimensional Space,” J. Classif. 15 (1), 37–55 (1998). 25. M. Inaba, N. Katoh, and H. Imai, “Applications of Weighted Voronoi Diagrams and randomization to Variance-Based k-Clustering,” in Proceedings of the 10th Symposium on Computation Geometry, Stony Brook, NY, USA, June 6–8, 1994 (ACM, New York, 1994), pp. 332–339. 26. J. B. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” in Proceedings of the 5th Berkeley Symposium on Mathematics, Statistics, and Probability, Berkeley, USA, June 21 – July 18, 1965 and December 27, 1965 – January 7, 1966, Ed. by L. M. Le Cam and J. Neyman, Vol. 1 (Univ. of California Press, Berkeley, 1967), pp. 281–297. 27. M. R. Rao, “Cluster Analysis and Mathematical Programming,” J. Amer. Statist. Assoc. 66, 622–626 (1971).
JOURNAL OF APPLIED AND INDUSTRIAL MATHEMATICS
Vol. 9 No. 4 2015