Feature terms are defined by its signature: Σ = ãS, F, â¤, Vã. ..... adding symmetry breaking constraints to the C
Feature Term Subsumption using Constraint Programming with Basic Variable Symmetry Santiago Onta˜ n´on1 and Pedro Meseguer2 1
2
Computer Science Department Drexel University Philadelphia, PA, USA 19104
[email protected] IIIA-CSIC, Artificial Intelligence Research Institute Spanish Scientific Research Council, 08193 Bellaterra, Spain
[email protected]
Abstract. Feature Terms are a generalization of first-order terms which have been recently received increased attention for their usefulness in structured machine learning applications. One of the main obstacles for their wide usage is that their basic operation, subsumption, has a very high computational cost. Constraint Programming is a very suitable technique to implement that operation, in some cases providing orders of magnitude speed-ups with respect to the standard subsumption approach. In addition, exploiting a basic variable symmetry –that often appears in Feature Terms databases– causes substantial additional savings. We provide experimental results of the benefits of this approach.
1
Introduction
Structured machine learning (SML) [8] focuses on developing machine learning techniques for rich representations such as feature terms [2, 7, 16], Horn clauses [12], or description logics [6]. SML has received an increased amount of interest in the recent years for several reasons, like allowing to handle complex data in a natural way, or sophisticated forms of inference. In particular SML techniques are of special interest in biomedical applications, where SML techniques can reason directly over the molecular structure of complex chemical and biochemical compounds. One of the major difficulties in SML is that the basic operations required for to design machine learning algorithms for structured representations, have a high computational complexity. Consequently, techniques for efficiently implementing such operations are key for the application of SML techniques in real-life applications with large complex data. This paper focuses on feature terms, a generalization of first-order terms that has been introduced in theoretical computer science in order to formalize object-oriented capabilities of declarative languages, and that has been recently received increased attention for their usefulness in SML applications [3, 5, 14, 15, 16]. The most basic operation among feature terms is subsumption, which
determines whether a given feature is more general than another, and is the most essential component for defining machine learning algorithms. Inductive machine learning methods work by generating hypotheses (often in the form of rules) that are generalizations of the training instances being provided. The “generality relation” (subsumption) states whether a hypothesis covers a training instance or not, and thus it is one of the most fundamental operations in inductive machine learning. It is well known that subsumption between feature terms has a high computational cost if we allow set-valued features in feature terms [7] (necessary to represent most structured machine learning datasets). Constraint Programming (CP) is a very suitable technique to implement subsumption. We present the CP modelization of the above operation for set-valued feature terms. In some cases, our CP implementation of feature term subsumption provides speed-ups of orders of magnitude with respect to the standard subsumption approach. In addition, when this CP implementation is enhanced with symmetry breaking constraints that exploit basic variable symmetries in feature terms, we obtain substantial extra gains in performance. Our CP implementation uses JaCoP (an open-source constraint library for Java) [10]. We are aware of a previous use of CP to compute θ-subsumption in ML [13]. However, feature term subsumption is significantly different from θ-subsumption, which is defined as the existence of a variable substitution between logical clauses without considering essential elements of feature terms such as sets or loops [15].
2
Background
Feature Terms. Feature terms [2, 7] are a generalization of first-order terms, introduced in theoretical computer science to formalize object-oriented declarative languages. Feature terms correspond to a different subset of first-order logics than description logics, although with the same expressive power [1]. Feature terms are defined by its signature: Σ = hS, F, ≤, Vi. S is a set of sort symbols, including the most general sort (“any”), ≤ is an order relation inducing a single inheritance hierarchy in S, where s ≤ s0 means s is more general than or equal to s0 , for any s, s0 ∈ S (“any” is more general than any s which, in turn, is more general than “none”). F is a set of feature symbols, and V is a set of variable . . names. We write a feature term ψ as: ψ ::= X : s [f1 = Ψ1 , ..., fn = Ψn ]. where ψ points to the root variable X (that we will note as root(ψ)) of sort s; X ∈ V, s ∈ S, fi ∈ F, and Ψi might be either another variable Y ∈ V, or a set of variables {X1 , ..., Xm }. When Ψi is a set {X1 , ..., Xm }, each element in the set must be different. An example of feature term appears in Figure 1. It is a train (variable X1 ) composed of two cars (variables X2 and X3 ). This term has 8 variables, and one set-valued feature (indicated by a dotted line): cars of X1 . To make a uniform description, constants (such as integers) are treated as variables of a particular sort. For each variable X in a term with a constant value k of sort s, we consider that X is a regular variable of a special sort sk . For each different constant k, we create a new sort sk of s. For example, if a variable X had an integer value 5, we would create a new sort s5 (sub sort of integer), and
.() !*#+&%) !"#$%&'()
+&%4)
'(9%/($) !,#+&%)
45&72) .() 45&72) .+/($)
!-#./(0) !1#2(0'(2) !3#45/%$) !6#/72(%2+$) !8#+'%+.2)
Fig. 1. A simple train represented as a feature term.
consider that X is a regular variable of sort s5 . Thanks to this representation change, we can forget about constants and just consider all variables in the same way. The set of variables of a term ψ is vars(ψ), the set of features of a variable X is f eatures(X), and sort(X) is its sort. Feature terms can be represented as directed labelled graphs. Given a variable X, its parents are the nodes connected with X by incoming links, and its children are the nodes connected with X by outgoing links. Operations on Feature Terms. The basic operation between feature terms is subsumption: whether a term is more general than (or equal to) another one. Definition 1. (Subsumption) A feature term ψ1 subsumes another one ψ2 (ψ1 v ψ2 ) 3 when there is a total mapping m: vars(ψ1 ) → vars(ψ2 ) such that: – root(ψ2 ) = m(root(ψ1 )) – For each X ∈ vars(ψ1 ) • sort(X) ≤ sort(m(X)), • for each f ∈ f eatures(X), where X.f = Ψ1 and m(X).f = Ψ2 : ∗ ∀Y ∈ Ψ1 , ∃Z ∈ Ψ2 |m(Y ) = Z, ∗ ∀Y, Z ∈ Ψ1 , Y 6= Z ⇒ m(Y ) 6= m(Z) i.e. each variable in Ψ1 is mapped in Ψ2 , and different variables in Ψ1 have different mappings. Subsumption induces a partial order among feature terms, i.e. the pair hL, vi is a poset for a given set of terms L containing the infimum ⊥ and the supremum > with respect to the subsumption order. It is important to note that while subsumption in feature terms is related to θ-subsumption (the mapping m above represents the variable substitution in θ-subsumption), there are two key differences: sorted variables, and semantics of sets (notice that two variables in a set cannot have the same mapping, whereas in θ-subsumption there is no restriction in the variable substitutions found for subsumption). Since feature terms can be represented as labelled graphs, it is natural to relate the problem of feature terms subsumption to subgraph isomorphism. However, subsumption cannot be modeled as subgraph isomorphism because, 3
In description logics notation, subsumption is written in the reverse order since it is seen as “set inclusion” of their interpretations. In machine learning, A v B means that A is more general than B, while in description logics it has the opposite meaning.
f
ψ1
x1 : s
x2 : s
ψ2
y1 : s
f
y2 : s
f
y3 : s
f
Fig. 2. A bigger feature term subsumes a smaller feature term: ψ2 v ψ1 .
larger feature terms can subsume smaller feature terms while the corresponding graphs are not isomorphic. See for example the two terms shown in Figure 2, where a term ψ2 with three variables subsumes a term ψ1 with two variables (mapping: m(y1 ) = x1 , m(y2 ) = x2 , m(y3 ) = x1 ). Constraint Satisfaction. A Constraint Satisfaction Problem (CSP) involves a finite set of variables, each taking a value in a finite discrete domain. Subsets of variables are related by constraints that specify permitted value tuples. Formally, Definition 2. A CSP is a tuple (X , D, C), where X = {x1 , . . . , xn } is a set of n variables; D= {D(x1 ), . . . , D(xn )} is a collection of finite discrete domains, D(xi ) is the set of xi ’s possible values; C is a set of constraints. Each constraint c ∈ C is defined on the orderedQset of variables var(c) (its scope). Value tuples permitted by c are in rel(c) ⊆ xj ∈var(c) D(xj ). A solution is an assignement of values to variables such that all constraints are satisfied. CSP solving is NP-complete.
3
Variable Symmetry in Feature Terms
A variable symmetry in feature term ψ is a bijective mapping σ : vars(ψ) → vars(ψ) such that applying σ on term ψ does not modify ψ in any significant way. Often, a basic form of variable symmetry, called interchangeable variables, appears in feature terms. 4 Formally, Definition 3. Two variables X and Y of vars(ψ) are interchangeable in ψ if exchanging X and Y in ψ, the resulting term does not suffer any syntactic change with respect to the original ψ. Clearly, if X and Y are interchangeable, none of them can be the root of ψ. In addition they have to share the same sort, sort(X) = sort(Y ). It is easy to see that two variables are interchangeable if and only if they have the same parents and the same children, as proved next. Proposition 1. Two variables X and Y of vars(ψ) are interchangeable in ψ if and only if they are of the same sort, with the same parents and the same children in ψ through the same features. 4
In CSP terms, this type of symmetry is similar to that between pairs of CSP variables in a graph coloring clique, all variables sharing the same domain.
Br H
C Br
link
H
88
X1 : C
X2 : Br
link
X3 : Br
link
X4 : H
link
X5 : H
Fig. 3. The chemical structure of methane and its representation as feature term. Variables X2 , X3 , X4 and X5 are in the same set. X2 is interchangeable with X3 , and X4 is interchangeable with X5 .
Proof. Obviously, if X and Y are of the same sort, with the same parents and children through the same features, exchanging X and Y does not cause any syntactical change in ψ, so they are interchangeable. If X and Y are not of the same sort, exchanging them causes syntactical changes in ψ. Assuming they share the same sort, if they do not have the same parents or the same children, exchanging X and Y causes syntactical changes in ψ. The same happens when although having the same parents and children, they are connected to them by diferent features. Figure 3 shows an example of interchangeable variables in a feature term containing the chemical structure of methane. It happens that Br atoms are all equivalent, to they can be permuted freely without any change in the problem. The same happens with H atoms. Observe that a Br atom is not interchangeable with a H atom, because they are of different sort. As result, variables X2 and X3 are interchangeable, and also X4 with X5 . 5
4
Subsumption as Constraint Satisfaction
Testing subsumption between feature terms ψ1 and ψ2 can be seen as a CSP: – CSP Variables: for each feature term variable X ∈ vars(ψ1 ) there is a CSP variable x that contains its mapping m(X) in ψ2 . To avoid confusion between the two types of variables, feature term variables are written uppercase while CSP variables are written lowercase, the same letter denotes corresponding variables (x is the CSP variable that represents feature term variable X).6 – CSP Domains: the domain of each CSP variable is the set vars(ψ2 ), except for the CSP variable of root(ψ1 ), whose domain is the singleton {root(ψ2 )}. – CSP Constraints: three types of constraints are posted • Constraints on sorts: for each X ∈ vars(ψ1 ), sort(X) ≤ sort(x). 5
6
Exchanging a Br atom with a H atom would generate another, more elaborated, symmetry than the one we consider. Here we restrict ourselves to the most basic symmetry notion. Nevertheless, exploitation of this kind of symmetry results in very good savings. For X we use “feature term variable” or ”variable”. For x we use ”CSP variable”.
• Constraints on features: for each variable X ∈ vars(ψ1 ) and feature f ∈ f eatures(X), for each variable Y ∈ X.f there exists another variable Z ∈ x.f such that y = Z. • Constraints on difference: If X.f = {Y1 , ..., Yk }, where all Yi ’s are different by definition, the constraint all-different(y1 , ...yk ) must be satisfied. Since ψ1 and ψ2 have a finite number of variables, it is direct to see that there is a finite number of CSP variables (exactly |vars(ψ1 )|) and all their domains are finite (assuming that the CSP variable x1 correspond to root(ψ1 ), the domain of x1 will be {root(ψ2 )}, and the common domain of the other CSP variables is the set vars(ψ2 )). If n is the maximum number of variables and m is the maximum number of features, the maximum number of constraints is: – n unary constraints on sorts (one per CSP variable), – O(n2 m) binary constraints on features (number of possible pairs of variables times the maximum number of features), – O(nm) n-ary constraints on difference (number of variables, each having one all-different constraint, times the maximum number of features). Constraints on sorts can be easily tested using the ≤ relation amongst sorts; constraints on features and of difference are directly implemented since they just involve the basic tests of equality and difference. Moreover, notice that it is trivial to verify that if the previous constraints are satisfied, the definition of subsumption is satisfied and vice versa. Therefore, the previous CSP problem is equivalent to subsumption in feature terms. In practice, n varies from a few variables in simple machine learning problems to up to hundreds or thousands for complex biomedical datasets. Most machine learning datasets do not have more than a few different feature labels, and thus m usually stays low. Moreover, in practice, the actual number of constraints is far below its maximum number as computed above. Consequently, a CP implementation of feature terms subsumption is feasible. We have done it and the results are detailed in Section 5. 4.1
Interchangeable Variables
It is well-known that symmetry explotation can dramatically speed-up CP implementations because it causes substantial search reductions [9]. In this section we explore the simplest variable symmetry in feature terms, variable interchangeability (Definition 3), inside the CP model of feature set subsumption. Imagine that we want to test ψ1 v ψ2 and there are two interchangeable variables X and Y in ψ1 . Interchangeability implies that they have the same parents through the same labels, so they are in the same set. Consequently, m(X) 6= m(Y ). Since X and Y are interchangeable, any mapping m satisfying the subsumption conditions for X will also be valid for Y . Therefore, assuming that m(X) < m(Y ), there is another mapping m0 (symmetric to m) that is equal to m except that permutes the images of X and Y , m0 (X) = m(Y ) and m0 (Y ) = m(X). Obviously, m0 (X) > m0 (Y ). Since m and m0 are symmetric mappings,
Fig. 4. Time required to compute subsumption in real-world instances. Horizontal axis: time required to compute subsumption by a standard approach; vertical axis: time required to compute subsumption by a CP approach; dots represent subsumption instances. Above the grey line, instances for which the standard approach is faster than the CP implementation, below the grey line the opposite occurs. Square dots correspond to CP implementation without any symmetry breaking, while triangular dots correspond to CP implementation with symmetry breaking constraints.
we choose one of them by adding the symmetry breaking constraint m(X) < m(Y ). In consequence, for any pair X, Y of interchangeable variables in ψ1 we add a symmetry breaking constraint m(X) < m(Y ). Often we found subsets X1 , ..., Xk of mutually interchangeable variables in ψ1 , which are also under the constraint alldifferent(m(X1 ), ..., m(Xk )). Thus, this would add a quadratic number of symmetry breaking constraints: m(Xi ) < m(Xj ) i : 1..k − 1 j : i + 1..k However, as Puget pointed out in [17], many of these constraints are redundant and it is enough to add a linear number of symmetry breaking constraints: m(Xi ) < m(Xi+1 ) i : 1..k − 1 to break all symmetries among interchangeable variables.
5
Experimental Results
In order to evaluate our model, we compared the time required to compute subsumption by a standard implementation of subsumption in feature terms [4] with (i) our CP implementation, and with (ii) that CP implementation enhanced with symmetry breaking constraints. We generated 1500 pairs of feature terms using the examples in two relational machine learning data sets as the source of terms: trains and predictive toxicology. The trains data set was originally introduced by Michalsky as a structured machine learning challenge [11]. Each instance represents a train (different instances have different number of cars,
cargos and other properties). Since the size of each instance is different, this dataset cannot be represented using a standard propositional representation, and a relational machine learning representation is required. In the toxicology dataset [5], each instance represents the chemical structure (atoms and their links) of a chemical compound. This is a very complex data set with some instances representing chemical compounds with a large number of atoms. The terms used in our experiments contain between 5 and 138 variables each, and some of them have up to 76 variables belonging to some set. Figure 4 shows the results of our experiments, where each dot represents one of the 1500 pairs of terms used for our evaluation. The horizontal axis (in a logarithmic scale), shows the time in seconds required by the traditional method, and the vertical axis (also in a logarithmic scale), shows the time required using CP. Square dots correspond to the CP implementation without symmetry breaking, while triangle dots are for CP with symmetry breaking constraints. Dots that lay below the grey line correspond to problems where CP is faster. We observe that the CP implementation without symmetry breaking is in general faster than the traditional approach because most square dots are below the grey line (in 56 cases out of 1500, the CP implementation is slower than the traditional method). When adding symmetry breaking constraints we observe a dramatic efficiency improvement: almost all triangles are below the grey line (only 34 triangles are above that line) and all instances are solved in less than 0.1 seconds. In some instances there are increments of up to 8 orders of magnitude (observe the triangle in the horizontal axis located at 100000). Adding the time required to perform all the 1500 tests, the traditional method required 669581 seconds, the CP implementation without symmetry breaking required 345.5 seconds, and CP with symmetry breaking lasted 8.3 seconds (4 and 5 orders of magnitude improvement with respect to the traditional method). Although benefits may vary in other datasets (different from trains [11] and toxicology [5]), these results clearly show the benefits of the CP approach with respect to the traditional method, and the extra benefits we obtain by adding symmetry breaking constraints to the CP implementation. The type of symmetries exploited in our approach is quite simple, but they are very frequent in biomedical data, where feature terms typically represent molecules.
6
Conclusions
A key obstacle when applying relational machine learning and ILP techniques to complex domains is that the basic operations like subsumption have a high computational cost. We presented modeling contributions, including basic variable symmetry exploitation, that allowed us to implement subsumption using CP. As result, this operation is done more efficiently than using traditional methods. As future work, we would like to improve our CP models to further increase performance. The study of more elaborated symmetries seems to be a promising avenue for research. We would also like to assess the gain in performance of inductive learning algorithms using our CP-based solutions.
References [1] A¨ıt-Kaci, H.: Description logic vs. order-sorted feature logic. Description Logics (2007) [2] A¨ıt-Kaci, H., Podelski, A.: Towards a meaning of LIFE. Tech. Rep. 11, Digital Research Laboratory (1992) [3] A¨ıt-Kaci, H., Sasaki, Y.: An axiomatic approach to feature term generalization. In: Proc. 12th ECML. pp. 1–12 (2001) [4] Arcos, J.L.: The NOOS representation language. Ph.D. thesis, Universitat Polit`ecnica de Catalunya (1997) [5] Armengol, E., Plaza, E.: Lazy learning for predictive toxicology based on a chemical ontology. In: Artificial Intelligence Methods and Tools for Systems Biology, vol. 5, pp. 1–18 (2005) [6] Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press (2003) [7] Carpenter, B.: The Logic of Typed Feature Structures, Cambridge Tracts in Theoretical Computer Science, vol. 32. Cambridge University Press (1992) [8] Dietterich, T., Domingos, P., Getoor, L., Muggleton, S., Tadepalli, P.: Structured machine learning: the next ten years. Machine Learning pp. 3–23 (2008) [9] Gent, I., Petrie, K.E., Puget, J.F.: Symmetry in constraint programming. In: Rossi, F., van Beek, P., Walsh, T. (eds.) Handbook of Constraint Programming (2006) [10] Kuchcinski, K.: Constraint-driven scheduling and resource assignment. ACM Transactions on design Automaton of Electronic Systems 8, 355–383 (2003) [11] Larson, J., Michalski, R.S.: Inductive inference of vl decision rules. SIGART Bull. (63), 38–44 (1977) [12] Lavraˇc, N., Dˇzeroski, S.: Inductive Logic Programming. Techniques and Applications. Ellis Horwood (1994) [13] Maloberti, J., Sebag, M.: Theta-subsumption in a constraint satisfaction perspective. In: Proc. 11th ILP. pp. 164–178 (2001) [14] Onta˜ no ´n, S., Plaza, E.: On similarity measures based on a refinement lattice. In: Proc. 8th ICCBR (2009) [15] Onta˜ no ´n, S., Plaza, E.: Similarity measuress over refinement graphs. Machine Learning Journal 87, 57–92 (2012) [16] Plaza, E.: Cases as terms: A feature term approach to the structured representation of cases. In: Proc. 1st ICCBR. pp. 265–276. No. 1010 in LNAI (1995) [17] Puget, J.F.: Breaking symmetries in all-different problems. In: IJCAI. pp. 272–277 (2005)