query) is encoded in SAT and solved by taking benefit from several fea- tures of ..... diameter to α=8 and enlarging the minimal separation to β=2), there remain.
Constrained Clustering Using SAT Jean-Philippe M´etivier, Patrice Boizumault, Bruno Cr´emilleux, Mehdi Khiari, and Samir Loudni University of Caen Basse-Normandie – GREYC (CNRS UMR 6072) Campus II, Cˆ ote de Nacre, 14000 Caen - France {firstname.lastname}@unicaen.fr
Abstract. Constrained clustering - finding clusters that satisfy userspecified constraints - aims at providing more relevant clusters by adding constraints enforcing required properties. Leveraging the recent progress in declarative and constraint-based pattern mining, we propose an effective constraint-clustering approach handling a large set of constraints which are described by a generic constraint-based language. Starting from an initial solution, queries can easily be refined in order to focus on more interesting clustering solutions. We show how each constraint (and query) is encoded in SAT and solved by taking benefit from several features of SAT solvers. Experiments performed using MiniSat on several datasets from the UCI repository show the feasibility and the advantages of our approach.
1
Introduction
Clustering is one of the core problems in data mining. Clustering aims at partitioning data into groups (clusters) so that transactions occurring in the same cluster are similar but different from those appearing in other clusters [12]. The usual clustering problem is designed to find clusterings satisfying a nearest representative property while constrained clustering [3,19] aims at obtaining more relevant clusters by adding constraints enforcing several properties expressing background information on the problem at hand. Constraints deal with various types: (1) data objects’ relationships (e.g., a set of objects must be (or not) in a same cluster [20]), (2) the description of the clusters (e.g., a cluster must have a minimal or a maximal size [2]), (3) both objects and clusters (e.g., a given object must be in a given cluster), (4) the characteristics of the clustering (e.g., the number of clusters),. . . Traditional clustering algorithms do not provide effective mechanisms to make use of this information. The goal of this paper is to propose a generic approach to fill this gap. Recently, several works have investigated relationships between data mining and constraint programming (CP) to revisit data mining tasks in a declarative and generic way [6,14,15]. The user models a problem and expresses his queries by specifying what constraints need to be satisfied. The process greatly facilitates the search of knowledge and models such as clustering. The approach is enforced by the use of a constraint-based language [17]: it is sufficient to change J. Hollm´ en, F. Klawonn, and A. Tucker (Eds.): IDA 2012, LNCS 7619, pp. 207–218, 2012. c Springer-Verlag Berlin Heidelberg 2012
208
J.-P. M´etivier et al.
the specification in term of constraints to address different pattern mining problems. In the spirit of this promising avenue, we propose an effective constrained clustering approach handling a large set of constraints. The paper brings the following contributions. First, we use the declarative modeling principle of CP to define a constrained clustering approach taking into account a large set of constraints on objects, a description of the clusters and the clustering process itself. By nature, clustering proceeds by iteratively refining queries until a satisfactory solution is found. Our method integrates in a natural way this stepwise refinement process based on the queries in order to focus on more interesting clustering solutions. Contrary to very numerous clustering methods that use heuristics or greedy algorithms, our method is complete. Second, we define an efficient SAT encoding which integrates features of SAT solvers (e.g., binary clauses, unit propagation, sorting networks) to solve the queries. Finally, an experimental study using MiniSat shows the feasibility and the effectiveness of our method on several datasets from the UCI repository. Section 2 provides the background on the constraint-based language. Section 3 describes our method on constrained clustering with examples of constraints coming from the background information of the problem at hand. Section 4 addresses the point of how queries and constraints of the language are encoded and solved with SAT. Section 5 shows the effectiveness of our approach through several experiments. Section 6 presents related work.
2
Background: Constraint-Based Language
The constraint programming methodology is by nature declarative. It explains why studying relationships between CP and data mining has received a considerable attention to go towards generic and declarative data mining methods [6,14,15]. This section sketches our constraint-based language that enables us to specify in term of constraints different pattern mining problems [17]. This language forms the first step of our constrained clustering method proposed in Section 3. In the remainder of this section, we only focus on primitives of the language that will be used in this paper. Let I be a set of n distinct literals called items, an itemset (or pattern) is a non-null subset of I. The language of itemsets corresponds to LI = 2I \∅. A transactional dataset T is a multi-set of m itemsets of LI . Each itemset, usually called a transaction or object, is a database entry. For instance, Table 1 gives a transactional dataset T with m=11 transactions t1 , . . . , t11 described by n=10 items. This toy dataset is inspired by the Zoo dataset from the UCI repository. Terms are built from constants, variables, operators, and function symbols. Constants are either numerical values, or patterns, or transactions. Variables, noted Xj , for 1 ≤ j ≤ k, represent the unknown patterns (or clusters). Operators can be set ones (as ∩, ∪, \) or numerical ones (as +, −, ×, /). Built-in function symbols involve one or several terms: – cover(Xj ) = {t | t ∈ T , Xj ⊆ t} set of transactions covered by Xj . – freq(Xj ) = | {t | t ∈ T , Xj ⊆ t} | is the frequency of pattern Xj .
Constrained Clustering Using SAT
209
– size(Xj ) = | {i | i ∈ I, i ∈ Xj } | is the size of pattern Xj . – overlapItems(Xi, Xj ) = | Xi ∩ Xj | is the number of items shared by both Xi and Xj . – overlapTransactions(Xi, Xj ) = | cover(Xi ) ∩ cover(Xj ) | is the number of transactions covered by both Xi and Xj . Constraints are relations over terms that can be satisfied or not. There are three kinds of built-in constraints: numerical constraints (like ), set constraints (like =, =, ∈, ∈, / ⊂, ⊆), and dedicated constraints like: – isNotEmpty(Xj ) is satisfied iff Xj = ∅ , ..., Xk ]) is satisfied iff each transaction is covered – coverTransactions([X1 by at least one pattern ( 1≤i≤k cover(Xi )=T ) – noOverlapTransactions([X1, ..., Xk ]) is satisfied iff all i, j s.t. 1≤i