Using Computational Learning Strategies as a Tool for Combinatorial

Using Computational Learning Strategies as a Tool for Combinatorial Optimization Andreas Birkendorf and Hans Ulrich Simon Lehrstuhl Informatik II, Universitat Dortmund, 44221 Dortmund, Germany E-mail: birkendo,[email protected]

Abstract

In this paper, we describe how a basic strategy from computational learning theory can be used to attack a class of NP-hard combinatorial optimization problems. It turns out that the learning strategy can be used as an iterative booster: given a solution to the combinatorial problem, we will start an ecient simulation of a learning algorithm which has a \good chance" to output an improved solution. This boosting technique is a new and surprisingly simple application of an existing learning strategy. It yields a novel heuristic approach to attack NP-hard optimization problems. It does not apply to each combinatorial problem, but we are able to exactly formalize some sucient conditions. The new technique applies, for instance, to the problems of minimizing a deterministic nite automaton relative to a given domain, the analogous problem for ordered binary decision diagrams, and to graph coloring.

1 Introduction In this paper, we describe how a basic strategy from computational learning theory can be used to attack a class of NP-hard combinatorial optimization problems. It turns out that the learning strategy can be used as an iterative booster: given a solution to the combinatorial problem, we will start an ecient simulation of a learning algorithm which has a \good chance" to output an improved solution. This boosting technique is a new and surprisingly simple application of an existing learning strategy. It yields a novel heuristic approach to attack NP-hard optimization problems. A basic building stone in this paper is the relation between two concrete problems. The rst-one deals with the minimization of deterministic nite automa (=DFA's) relative to a given domain D. The second-one is the problem of learning DFA's with a minimum adequate teacher. These two problems are formally 1

Birkendorf, Simon / Learning Strategies for Combinatorial Optimization

2

described in Section 2. The second problem was solved by Angluin (see [3]) who designed an ecient learning algorithm, which was later improved by Rivest and Schapire in [20]. We will show that this learning algorithm can be used as an iterative booster for Problem 1. The booster is however not so specialized to Problem 1 as it seems at rst glance. In fact, it is possible to address many other combinatorial problems with basically the same technique. There are two main sources for possible generalizations: 1. A learning algorithm can be used as a booster for a combinatorial problem whenever some sucient conditions are satis ed. This will immediately lead to a generalization of the problems concerning DFA's to problems which are structurally similar but deal with another representation class R instead of DFA's. We use the notation MinRep(R) to refer to the general minimization problem for a given class R of representations. R can represent, for instance, a class of automata, grammars, or boolean circuits. (Of course, not for all classes there are learning algorithms available which satisfy the sucient conditions for the boosting technique.) A second class R for which boosting works as good as for DFA's is the class of ordered binary decision diagrams (formally described in Section 2). 2. Approximation preserving reductions allow to convert a good approximation algorithm for one optimization problem into a good approximation algorithm for a second optimization problem. Thus we can address with our boosting technique all optimization problems which have an approximation preserving reduction to MinRep(R) for some R which has a booster already. For instance, the famous graph coloring problem has an approximation preserving reduction to MinRep(DFA) (see [22]). Many practical scheduling problems can be encoded as graph coloring (con ict-free allocation of resources to jobs). For these reasons, we believe that we have found an interesting link between computational learning theory and combinatorial optimization which opens a wide area of applications and leads to challenging problems in learning theory (e.g., design more learning algorithms which can be used as boosters for combinatorial problems). This paper is structured as follows. Section 2 will formally introduce the problems MinRep(DFA) and MinRep(OBDD), respectively, and the model of exact learning with a minimum adequate teacher. Section 3 explains the boosting technique in terms of a general representation class R. Section 4 discusses the ef ciency of a central procedure used by the boosting technique. Section 5 presents some experimental results. The paper is closed by mentioning some possible directions of future work.


3

2 Notations and De nitions In this section, we rst de ne the problems MinRep(DFA) and MinRep(OBDD). Then we brie y describe the model of exact learning with a minimum adequate teacher. Finally, we recall some important facts about the problems MinRep(DFA) and MinRep(OBDD). Let M be a DFA given by its state diagram as shown in Figure 1. Let L(M ) denote the language of words accepted by M , where, as usual, we say that M accepts a word w if the path, which starts at the initial state and follows the edges labeled with the letters of w, ends at a nal accepting state. Two DFA's are called equivalent if they accept the same language. It is well known (see [12, 1]) that the minimization of M , i.e., the problem of nding an equivalent DFA M whose decision diagram has a minimal number of states, can be solved in time O(n log n). The following slight modi cation of this problem is however NP-hard: MinRep(DFA) Given two DFA's M; Md, nd the smallest DFA M which satis es L(M) \ L(Md ) = L(M ) \ L(Md ): Here, DFA Md represents a partial domain D = L(Md ). M has to coincide with M inside D, but may have an arbitrary behavior on words outside D.

b s0

a

s1

b

s2

a,b Figure 1: A DFA which accepts the language fb; abg (fa; bgfb; abg). The initial state is s0 . The set of nal accepting states is fs2 g. An ordered binary decision diagram (=OBDD) is given by a directed acyclic graph B as shown in Figure 2. B must satisfy the following conditions: 1. B has exactly one source. 2. B has two sinks denoted by zerosink and onesink, respectively. 3. Each nonsink v is labeled with a boolean variable xi and has two children denoted by zero(v ) and one(v ), respectively.


4

x1 1

0 x3 0

x2 1

1

0 0

onesink

x3 1

zerosink Figure 2: An OBDD which computes the boolean function x1 x2 _ (x1 x3 ), where \" denotes addition modulo 2. 4. If P is a path from the source to a sink and label xi precedes label xj on P , then i < j , i.e., the node labels appear in the natural order on each path in B (possibly after a clever renumbering of the variables; this paper is not concerned with methods to nd the best ordering of the variables). An OBDD B represents a boolean function fB which can be evaluated at boolean vector a = (a1; : : :; am ) by routing a through B as follows. Vector a enters B at the source. At each nonsink v labeled xi , vector a is routed to the child zero(v ) if ai = 0, and to one(v ) otherwise. If a ends at the zerosink, then fB (a) = 0, else fB (a) = 1. The problem MinRep(OBDD) is de ned in analogy to MinRep(DFA). In the model of exact learning with a minimum adequate teacher, the learner has to infer a hidden target concept L , where L is an arbitrary but xed member of a class C of target concepts (e.g., the class of languages accepted by DFA's). In order to gather information about L, the learner has access to two oracles EQU and MEMB (represented by the teacher) who answer equivalence and membership queries, respectively. A query of the form EQU(h) (h represents a language L(h) which is the current hypothesis of the learner) is answered by \YES" if L(h) = L. If not, EQU will present a witness of inequivalence, i.e., a word w within the symmetric dierence of L(h) and L. A query of the form MEMB(w) is answered by \YES" if w 2 L, and by \NO" otherwise. The goal of


5

the learner is to infer L exactly (up to equivalence of representations). As mentioned in the introduction, DFA's are eciently learnable with a minimum adequate teacher, for instance by the algorithm of Rivest and Schapire from [20]. Recently, this algorithm was eciently adapted to OBDD's (see [9]), which are therefore also eciently learnable with a minimum adequate teacher. Since we plan to use the learning algorithms for DFA's and OBDD's as a tool for the problems MinRep(DFA) and MinRep(OBDD), it is worthwhile to make some comments about the importance and the hardness of the two latter problems. OBDD's were introduced by Bryant in [5] based on earlier investigations by Lee ([16]) and Akers ([2]). They play an important role as data structure for boolean functions. They are used, e.g., in the logical synthesis process, for veri cation and test pattern generation, and as part of CAD tools (see [6]). MinRep(DFA) and MinRep(OBDD) are important problems for the following reasons: It is certainly ecient to keep the DFA's and OBDD's, which are supposed to be manipulated many times, as small as possible. This will not only decrease the amount of space, but also the amount of time consumed by the manipulation procedures. It also often happens in practice that the inputs fed into DFA's or OBDD's have a speci c syntactic structure, i.e., they belong to a restricted domain. A clever procedure for minimization will have to take this into account. Both problems have applications in computational learning theory, even when they are restricted to the special case that D is given by a nite set of words or boolean vectors, respectively. (This is a subproblem because a nite set can always succinctly be encoded as a DFA or OBDD.) We denote these subproblems by MinRep0 (DFA) and MinRep0 (OBDD), respectively. Then DFA's (or OBDD's, respectively) are learnable within the PAC learning framework by an Occam algorithm if MinRep0 (DFA) (or MinRep0 (OBDD), respectively) can be solved by a \reasonably good" approximation algorithm. See [19] for a more thorough discussion of this. We will come back to this application in Section 6. Using the technique of approximation preserving reductions, it can be shown that good approximation algorithms for MinRep0 (DFA) or MinRep0 (OBDD) can eciently be converted into good approximation algorithms for a bunch of other problems including the famous graph coloring problem (see [22, 8]). Since the problems MinRep0 (DFA) and MinRep0 (OBDD) are NP-hard (see [10] and [24]), it is reasonable to look for an approximation algorithm which runs in polynomial time and always presents a solution which is close to optimal. However, even the construction of approximate solutions with a very modest de nition of \close to optimal" is NP-hard. (This follows from results in [19, 22, 8, 17].)


6

Of course, these hardness results for the subproblems are a-fortiori valid for the unrestricted problems MinRep(DFA) and MinRep(OBDD). It is therefore unlikely that we will nd an approximation algorithm with a good mathematical guarantee for the performance ratio. This does however not rule out the possibility to design clever heuristics which work well in practice. This is the line of attack pursued in the following sections.

3 A Boosting Technique for Finding Short Representations The boosting technique, which is described in this section, is not only applicable to MinRep(DFA) and MinRep(OBDD). In order to clarify the generality of this technique, it is developed for an arbitrary class of representations. Let ? and be two alphabets. A representation over ? of a class of languages over is given by a class R ?, a mapping L which associates a language L(r) with each string r 2 R. a mapping j j which associates with each r 2 R its size jrj. For the sake of concreteness, the reader may think of R as a set of strings which represent natural encodings of state diagrams for DFA's or OBDD's. Mapping L associates a language with these strings in the obvious way (words accepted by the DFA, boolean vectors routed to onesink within the OBDD). The size of a DFA (or OBDD, respectively) is the number of its states (or nodes, respectively). These two concrete cases are then discussed in more detail in Sections 4 and 5. The problem of nding shortest representations in R is formally given as follows: MinRep(R) Given r; rd 2 R, nd the shortest string r which satis es

L(r) \ L(rd) = L(r) \ L(rd): String rd represents a domain D = L(rd ) . The language L(r) must coincide with L(r) within D, but may be arbitrary outside D. The signi cance and hardness of this problem for DFA's and OBDD's was already discussed in Section 2. Since the MinRep(R) problem is very hard to solve for R 2 fDFA,OBDDg, we will be satis ed already if we are able to design a heuristic BOOST with the following properties:


7

1. Let r0 denote the output of BOOST if the input is r; rd. Then r0 is a legal solution, i.e., L(r0) \ L(rd ) = L(r) \ L(rd ). 2. Experiments indicate that r0 is often considerably shorter than r. 3. It never happens that r0 is longer than r. 4. BOOST can be applied iteratively. Let ri+1 denote the output of BOOST if the input is ri ; rd. Then iterative runs of BOOST lead to a sequence r; r0; r1; r2; : : : of legal solutions which satis es jrj jr0j jr1j jr2j : : :. 5. Experiments indicate that often several improvements take place until the process becomes stationary. 6. Experiments indicate that the improvements are signi cant even if the initial representation r is already the result of another carefully designed heuristic for MinRep(R). The purpose of this section is to demonstrate that a heuristic BOOST with these properties can be designed whenever the following tools are available:

MEMBER A procedure which solves the wordproblem: given r 2 R and w 2 ,

decide whether w 2 L(r). For DFA's or OBDD's, the standard procedures solve this problem in linear time. INTERSECT A procedure which solves the following synthesis problem: given r1; r2 2 R, compute a string r 2 R which satis es L(r) = L(r1) \ L(r2). For DFA's and OBDD's, the classical \product construction" solves this problem in time O(jr1j jr2j). WITNESS A procedure which solves the equivalence problem: given r1; r2 2 R, output \YES" if L(r1) = L(r2), and a witness

w 2 (L(r1) n L(r2)) [ (L(r2) n L(r1)) of inequivalence otherwise. It is important to insert some amount of randomness in this procedure (using the fact that there are dierent witnesses of inequivalence, in general). Randomness avoids an early stationary behavior when BOOST is applied iteratively. WITNESS procedures for DFA's and OBDD's are discussed in Section 4. LEARN A procedure which learns a shortest representation of a hidden target concept L(r), where r 2 R, with a minimum adequate teacher, i.e., LEARN has access to oracles MEMB and EQU as described in Section 2.


8

Furthermore, we demand the \monotonicity property" jrhi j jrhi?1 j regarding the size of the i-th hypothesis rhi . Hence, the complexity of the hypotheses rises monotonically. A learning procedure of this kind was designed for DFA's by Angluin in [3], later improved by Rivest and Schapire in [20], and adapted to OBDD's by Gavalda and Guijarro in [9]. Combining these tools, we obtain the following heuristic BOOST for the MinRep problem with input r; rd: First compute the representation r0 for L(r) \ L(rd) by applying the procedure INTERSECT to input r; rd. Second run procedure LEARN with oracles MEMB and EQU which are simulated as follows: MEMB Call MEMB(w) is answered by applying the procedure MEMBER to input w; r. This is a legal answer for hidden target concept L(r). EQU Call EQU(rh) is answered by rst computing the representation rh0 for L(rh) \ L(rd) (for this, apply procedure INTERSECT to input rh ; rd), and then applying procedure WITNESS to input rh0 ; r0. This is a legal answer for hidden target concept L(r), except that the answer \YES" is already produced as soon as rh and r coincide on the restricted domain D given by rd. In this way, we keep the simulation of LEARN running until an equivalence query is answered \YES". Let r0 denote the current hypothesis. Note that r0 is a legal solution for input r; rd, i.e., L(r0) \ D = L(r) \ D. Although the size of the shortest string r satisfying L(r) \ D = L(r) \ D is likely to be smaller than the size of r0, we are able to bound jr0j from above as follows:

Theorem 3.1 Let DMEMB be the set of words which were presented to oracle MEMB during the simulation of procedure LEARN, D+ = D [ DMEMB, and r+ a shortest string satisfying L(r+ ) \ D+ = L(r) \ D+ . Then

jr0j jr+j:

Proof We correctly simulate LEARN for hidden target concept L(r) until the

current hypothesis r0 satis es L(r0) \ D = L(r) \ D. Let DEQU denote the set of witnesses of inequivalence that were returned by oracle EQU during the simulation. Note that DEQU D. The crucial observation is that LEARN cannot distinguish hidden target concepts that coincide on

DEQU [ DMEMB D [ DMEMB = D+:


9

Thus, if r+ is a (shortest) string satisfying L(r+ ) \ D+ = L(r) \ D+, we may think of r+ instead of r as the hidden target concept. Since we stop the simulation when r0 coincides with the target concept on D, L(r0) may be dierent from L(r+). We could however conceptually restart the simulation until LEARN has exactly identi ed L(r+), for instance, with hypothesis L(r00 ). Since LEARN always nds a shortest representation of the target concept, it follows that jr00 j jr+j. By the monotonicity property, we obtain jr0j jr00 j, thus jr0j jr+ j. Because the de nition of r+ implies that jr+ j jrj we obtain the following

Corollary 3.2

jr0j jrj:

Corollary 3.2 shows that an iterative application of BOOST leads to iterative improvements jrj jr0j jr1j jr2j : : :, but does not rule out that the very rst step in the iteration is stationary, i.e., it can happen that jr0j = jrj (for instance, if r is already optimal). In practice, however, BOOST leads to considerable improvements. In Section 5, we will give empirical evidence for this. There is also some intuitive support from Theorem 3.1: If D+ = D [ DMEMB is only a \slight" extension of D, the shortest string r+ satisfying L(r+) \ D+ = L(r) \ D+ is (hopefully) not much longer than the shortest string r satisfying L(r) \ D = L(r) \ D. Since jrj jr0j jr+ j, the solution r0 is \close" to optimal. This reasoning is the more satis ed the less membership queries are issued by LEARN. We expect the dierence between jrj and jr+j to be large in the worst case, but small in many practical situations.

4 Ecient Construction of a Witness of Inequivalence This section is concerned with implementation details to keep the running time of BOOST reasonably small. The reader not interested in this may directly pass to Section 5. The heart of the heuristic BOOST is the simulation of the procedure LEARN. We want to keep the amount of time, needed to answer all equivalence and membership queries asked by LEARN, within reasonable bounds. The following parameters are relevant for this purpose: the size n of representation r the size nd of representation rd the length m of the longest witness of inequivalence ever returned by the EQU-oracle


10

the number qEQ of equivalence queries asked by LEARN the number qMB of membership queries asked by LEARN the time TEQ needed to answer an equivalence query the time TMB needed to answer a membership query The parameters qEQ ; qMB ; TEQ; TMB are functions depending on n; nd ; m in general. Denoting the total time needed to simulate all queries for representation language R by QR(n; nd ; m), we obtain the following general equation:

QR(n; nd; m) = qEQ (n; nd; m) TEQ(n; nd; m) + qMB (n; nd ; m) TMB(n; nd; m): (1) In the special case of OBDD's, parameter m is simply the total number of Boolean variables. The procedure LEARN used by Gavalda and Guijarro in [9] for learning OBDD's asks at most n equivalence queries and at most O(n2 m + nm log m) membership queries. Using the obvious simulation of the oracle MEMB, a membership query is answered in O(m) steps. It will follow from Corollary 4.2 that O(nnd + m) steps are sucient to answer an equivalence query. Thus, Equation 1 specialized to OBDD's reads as follows:

QOBDD(n; nd ; m) = O(n2 nd + n2 m2 + nm2 log m):

(2)

In the special case of DFA's, it will follow from Corollary 4.4 that the EQUoracle can be simulated such as to return witnesses of inequivalence of length at most m = 2nnd ? 2. Furthermore, each such witness can be retrieved in O(nnd log(nnd )) steps. Using the obvious simulation of the oracle MEMB, a membership query is answered in O(n) steps. The procedure LEARN used by Rivest and Schapire in [20] for learning DFA's asks at most n equivalence queries and at most O(n2 + n log m) membership queries. Thus, Equation 1 specialized to DFA's reads as follows:

QDFA (n; nd ) = O(n2 nd log(nnd ) + n3 ):

(3)

The simulation of oracle MEMB is straightforward for OBDD's and DFA's, respectively. The simulation of the oracle EQU is more interesting. According to the description in Section 3, one rst applies procedure INTERSECT to rh ; rd which returns in O(nnd ) steps the \product automaton" (or \product OBDD", respectively), denoted by rh0 . In the same way, r0 was obtained from r; rd. Then the procedure WITNESS is applied to rh0 ; r0. The sizes of r0 and rh0 , respectively, are bounded nnd . An inecient implementation of WITNESS (e.g., quadratic time) would seriously limit the applicability of BOOST. There are standard procedures which perform the equivalence test for two DFA's (or two OBDD's, respectively).


11

But to the best of our knowledge, these procedures do not retrieve witnesses of inequivalence. Remember also, that we wanted to randomize procedure WITNESS in order to apply BOOST iteratively. We decided therefore to design this central procedure ourselves, and the result is presented in the remainder of this section. We rst describe the procedure WITNESS for OBDD's. We will make use of the following facts from [5, 21]. For each OBDD B there exists a reduced OBDD Bred which computes the same function as B with a minimum number of nodes. Bred is uniquely determined by B up to isomorphism (renaming of nodes) and can be computed in linear time. In Bred , each node v labeled xi represents a function fv (x1 ; : : :; xi; : : :; xm) which essentially depends on xi but not on x1; : : :; xi?1 . This implies that the two children of v must be dierent. Furthermore, two dierent nodes v1 ; v2 always represent dierent functions, i.e., there must exist a string w = a1 am such that fv1 (w) 6= fv2 (w). The linear time reduction procedure can also be applied to two (or more) OBDD's B1 ; B2 and delivers a common reduction Bred . The properties of Bred concerning the functions fv are still valid. In particular, let s1 ; s2 denote the nodes in Bred which represent the functions f1 and f2 computed by B1 and B2 , respectively. It follows that B1 and B2 are equivalent, i.e., they compute the same function, i s1 = s2 . The reduction procedure can be designed such as to output not only Bred , but also pointers to s1 and s2 . It is easy to design a procedure WITNESS which runs in time O(m) when Bred and s1 ; s2 are given as input. Of course, if s1 = s2 , WITNESS outputs \YES". We assume that s1 6= s2 . WITNESS will maintain tuples of the form [v1; v2; a1 ak?1 ; k]; where v1 6= v2 with the following interpretation: The computation in Bred , which starts in s1 (or s2 , respectively) and proceeds along the path corresponding to a1 ak?1 , ends in node v1 (or v2, respectively). WITNESS starts with the tuple [s1; s2; ; 1]; where is the empty string. At the end, WITNESS will have constructed a tuple of the form [z1 ; z2; a1 am ; m + 1]; where z1 6= z2 : The interpretation of tuples implies that fz1 ; z2g = fzerosink; onesinkg and a1 am is a witness of inequivalence of B1 and B2. We explain inductively how the next tuple is constructed when the current tuple is [v1 ; v2; a1 ak?1 ; k]; where v1 6= v2 : 0


12

We say that a node v in Bred has level l(v ) = i if v is labeled xi . If v is one of the sinks we set l(v ) = m + 1. Let l1 = l(v1), l2 = l(v2), l = minfl1; l2g, and w = a1 ak?1 . We proceed according to the following case analysis: Case 1 k < l. Then generate a string w0 = ak al?1 consisting of l ? k random bits, and proceed with tuple [v1; v2; ww0; l]. Case 2 k = l1 < l2. Subcase 2a zero(v1) = v2. It follows that one(v1 ) 6= v2 . Proceed with tuple [one(v1 ); v2; w1; k + 1]. Subcase 2b one(v1) = v2. It follows that zero(v1) 6= v2. Proceed with tuple [zero(v1); v2; w0; k + 1]. Subcase 2c zero(v1) 6= v2; one(v1) 6= v2. Then generate a random bit b, and set v10 to the appropriate child of v1 , i.e, to zero(v1) if b = 0 and to one(v1 ) if b = 1. Proceed with tuple [v10 ; v2; wb; k +1]. Case 3 k = l2 < l1. This case is symmetric to Case 2. We omit the description. Case 4 k = l1 = l2. Subcase 4a zero(v1) = zero(v2). It follows that one(v1 ) 6= one(v2 ). Proceed with tuple [one(v1 ); one(v2); w1; k + 1]. Subcase 4b one(v1) = one(v2). It follows that zero(v1) 6= zero(v2). Proceed with tuple [zero(v1); zero(v2); w0; k + 1]. Subcase 4c zero(v1) 6= zero(v2); one(v1) 6= one(v2). Then generate a random bit b and set v10 ; v20 to the appropriate children of v1 ; v2, respectively, i.e, to the respective zero-children if b = 0, and to the respective one-children if b = 1. Proceed with tuple [v10 ; v20 ; wb; k + 1]. Figure 3 shows a sample run of this procedure. It is easy to verify that, in any case, the next tuple has the desired interpretation again. Thus, after at most m iterations, this procedure comes up with a witness of inequivalence. Each iteration takes only constant time (when implemented properly). We summarize the result. Theorem 4.1 Let Bred be a reduced OBDD and m the number of Boolean variables. Let s1 ; s2 be two dierent nodes in Bred and f1 ; f2 the functions represented by s1 ; s2 , respectively. One can nd in O(m) steps a word w such that f1(w) 6= f2 (w).


13

[s1; s2 ; ; 1]; b=1 (1) random #

s

1

[s1; s2 ; 1; 2]; b=0 (2b)

x2

#

1 0

x3

s

[v1; s2 ; 10; 3]; b=0 (3c) random 2

#

0 v1

x4

x4 0

0

[v1; v2; 100; 4]; b=1 (4a)

1 v

2

#

1 x5 v 3

0

[v5; v3; 1001; 5]; b=1 (3a) #

1

v

5

x6

v4

0

1

1

1

x7 0

zerosink

[v5; v6; 10011; 6]; b=1 (1) random

1

#

x7 0

onesink

v6

[v5; v6; 100111; 7]; b=0 (4c) random #

[zerosink; onesink; 1001110; 8]

Figure 3: A sample run of procedure WITNESS for OBDD's with output 1001110. The current procedure tuples and the selected bits b chosen in the corresponding subcases are recorded.


14

Corollary 4.2 There exists a procedure WITNESS for OBDD's with time bound O(n1 + n2 + m).

Time O(n1 + n2 ) is spent for computing the common reduction and time O(m) for the retrieval of the witness when Bred is given. In our application, n1 and n2 are bounded above by nnd . The design of a procedure WITNESS for DFA's is more subtle. Of course, the mere equivalence test can be done with a classical algorithm due to Hopcroft and Karp (see [12, 1]) which uses the UNION-FIND data structure and runs in almost linear time. We were however not able to retrieve a witness when running this procedure. Our method for retrieving a witness makes use of another classical algorithm due to Hopcroft (see [11]) which minimizes a given DFA in time O(n log n). In the remainder of this section, we proceed as follows. We rst reduce the equivalence test for two DFA's to an equivalence test for two states. Then we brie y recall Hopcrofts algorithm. Finally, we augment it with a new data structure, called DFA-SPLITTING-TREE, which allows a convenient retrieval of a witness of inequivalence of two given inequivalent states. This will lead to an overall running time of O(n log n). Let M be a DFA (possibly without a distinguished initial state). For all states s and all words w 2 , we de ne s w as the state that is reached after M has scanned w, given that M was started in state s. Let S denote the set of states and F S the set of nal accepting states. We say that states s1 ; s2 2 S are M -equivalent if s1 w 2 F , s2 w 2 F for all w 2 . We recall the following easy facts: M is easy to minimize if the relation of M -equivalence is known. If M results from the combination of the disjoint state diagrams of two DFA's M1 ; M2 with initial states s1 ; s2, then M1 and M2 are equivalent, i.e., they accept the same language, i s1 ; s2 are M -equivalent. It suces therefore to design a procedure WITNESS for the M -inequivalence of two given states of a single DFA M . It will turn out that Hopcroft's procedure, which minimizes M , is very useful for this purpose. The algorithm A of Hopcroft uses a clever partitioning technique. At any time the partition is coarser than the M -partition. It starts with the partition of S into F and S n F . F is stored in a bucket B1 , and S n F in a bucket B2 . Let k0 2 f1; 2g be the index of the bucket which contains fewer states. A iteratively re nes the current partition until it coincides with the M -partition. The iterative re nement is organized by putting \splitting items" into a QUEUE. Splitting items are pairs


15

(k; a) where k is an index which addresses a bucket and a 2 . Initially, QUEUE only contains the pairs (k0; a) for all a 2 . For each splitting item (k; b) which is deleted from QUEUE, the content of each bucket Bj is split into the following two sets: Bj0 = fq 2 Bj j q b 2 Bk g Bj00 = fq 2 Bj j q b 2= Bk g If this splitting is not proper, it can be ignored. If it is proper and r is the current maximal index of a bucket, then Bj0 is stored in bucket Bj and Bj00 in a new bucket Br+1 . The crucial observation of Hopcroft was that, at this time, the following holds for all a 2 : it is not necessary to insert both pairs (j; a); (r +1; a) into QUEUE in order to ensure the correctness of the algorithm; one pair is always sucient. Whenever there is a freedom of choice, one can use the pair which corresponds to the bucket which contains fewer states. This observation is used within the analysis of the O(n log n) running time. As soon as QUEUE becomes empty (which happens when no more proper splittings occur), the nal partition coincides with the M -partition. The reader interested in a more detailed description is referred to [11]. We will augment Hopcrofts algorithm A with a new data structure, called DFA-SPLITTING-TREE, which allows an ecient retrieval of a witness of inequivalence of two states s1 ; s2 . This data structure consists of a binary tree T which, loosely speaking, stores the splitting history of the run of A. Initially, T contains only a root labeled with two children labeled with the index 1 of bucket B1 (which stores F ) and index 2 of bucket B2 (which stores S n F ), respectively. In general, the leaves of T are labeled with the indices of all current buckets. They represent the current partition of S . The inner nodes represent the splitting history. When A uses splitting item (k; b) to properly split the content of bucket Bj into sets Bj0 and Bj00 , as described above, and then stores these sets in buckets Bj and Br+1 , the leaf labeled j in T becomes an inner node labeled b (the letter of the splitting item) by creating two children of it labeled j and r + 1 (the indices of the two subbuckets after the proper split). For our purpose, we can stop A as soon as s1 and s2 belong to dierent buckets (if ever). If A stops with s1 and s2 in the same bucket, then s1 and s2 are M -equivalent, and WITNESS may output \YES". Otherwise, s1 and s2 belong to dierent buckets. If this happens at the very beginning with s1 2 F and s2 2 S n F , or vice versa, the empty string is a witness of inequivalence. Otherwise, a proper split operation was responsible for the separation of the two states into dierent buckets. This split is represented by an inner node x in T with some label b. Procedure WITNESS uses a UNION-FIND data structure to represent the current partition and proceeds as follows. Calls FIND(s1) and FIND(s2 ) will locate the leaves l1; l2 in T whose buckets contain s1 and s2 , respectively. It is easy to determine the


16

splitting node x, because x is the youngest common ancestor of l1 ; l2.1 Let T (x) denote the subtree of T with root x. We use a bottom-up pass through T (x) to reverse each earlier split operation by performing the corresponding UNION operation. When we reach node x, this node is a leaf again whose bucket contains all states which were stored in T (x) before (in particular s1 ; s2 ). Notion T now refers to the DFA-SPLITTING-TREE with subtree T (x) shrunk to a single node x. Since x is a splitting node for s1 ; s2 , we know that the states

s01 = s1 b; s02 = s2 b are stored in dierent leaves of T . The crucial observation is that the witness of inequivalence satis es the following equality: witness(s1 ; s2) = b witness(s01 ; s02 )

(4)

This follows easily from the \associative law"

s1 (b w) = (s1 b) w = s01 w; s2 (b w) = (s2 b) w = s02 w; which holds in particular for w = witness(s01 ; s02). Thus, we can iterate the procedure with s01 ; s02 in the roles of s1 ; s2 , respectively. The iteration stops when the current splitting node is the root of T . Then, as explained above, the empty word is a witness of inequivalence of the current pair of states. Since Hopcroft's procedure takes already time O(n log n), we can use any easy implementation for the UNION-FIND data structure with the same time bound. Procedure WITNESS for DFA's can easily be randomized by selecting splitting items from QUEUE in a random fashion. Finally, note that the length of the output witness of inequivalence is bounded by jM j? 2 = jM1 j + jM2 j? 2. This is simply because each iteration increments the length of the witness by one, and reverses at least one split operation by performing the corresponding UNION operation. In the beginning, the DFA-SPLITTING-TREE has two leaves (storing F and S n F , respectively). At the end, it has at most jM j = jM1 j + jM2j leaves (each-one possibly storing a single state). Thus, at most jM1j + jM2j ? 2 split operations can be reversed. We summarize the result. Theorem 4.3 Let s1; s2 be states of a DFA M and n = jM j. There exists a procedure with time bound O(n log n) which either detects that s1 and s2 are M equivalent or outputs a word w of length at most n?2 such that s1 w 2 F; s2 w 2= F , or vice versa.

At the moment is even the parent of 1 2 , because we stopped the algorithm immediately after the separation of 1 2 . The same part of the procedure will however later be used for two states 1 2 whose corresponding leaves are not necessarily siblings. 1

x

l ;l

s ;s

0

s ;s

0


17

Corollary 4.4 Let n1 and n2 denote the sizes of two given DFA's M1 and M2, respectively. There exists a procedure WITNESS for DFA's with time bound O ((n1 + n2 ) log(n1 + n2 )). The length of the witness of inequivalence constructed by WITNESS is bounded above by n1 + n2 ? 2. In our application, n1 and n2 are bounded above by nnd . The following example shows that the shortest witness w of the inequivalence of DFA's M1 ; M2 may have length jM1 j + jM2 j ? 2. The upper bound on the length of w guaranteed by procedure WITNESS can therefore not be improved in general. Example 4.5 The m-potential of a string w 2 fa; bg is inductively de ned as

follows:

potm () = 0 potm(wa) = minfm; potm(w) + 1g potm (wb) = maxf0; potm(w) ? 1g In other words: The m-potential p is initialized to 0. Then w is scanned from left to right; each letter a increments p by 1, each letter b decrements p by 1, except that no modi cation leads to values greater than m or smaller than 0. Let M (m) be the minimum DFA for the set of words with m-potential 0. It is easy to see that jM (m)j = m + 1, and that am+1 bm is the shortest witness of inequivalence of M (m) and M (m + 1). Figure 4 shows the transition diagrams of M (3) and M (4), respectively. a b

b

p0

a p1

a p2

p3

b

b

b

a

a

a

q0

q1 b

q2 b

a

a q3

b

q4

a

b

Figure 4: The transition diagrams of M (3) and M (4) from Example 4.5. Initial states and accepting states are p0 ; q0.


18

Figure 5 illustrates how witness aaaabbb of inequivalence is retrieved by procedure WITNESS. First, the DFA-SPLITTING-TREE T is grown in two phases. Phase 1 uses splitting items (1; b); (2; b); (3; b); to grow and split the rightmost branch of T . Loosely speaking, this phase cuts the diagrams for M1 and M2 in \vertical slices", i.e., the current partition of states is (compare with the gure)

fp0; q0g; fp1; q1g; fp2; q2g; fp3; q3g; fq4g:

In Phase 2, a kind of \horizontal slice" is performed which separates the states

pi of M1 from the states qi of M2 . This phase uses splitting items (5; a); (4; a); (3; a); (2; a) in that order. The last split separates p0 from q0 and shows that M (3) and M (4) are inequivalent. Now, the retrieval of the witness of inequivalence is started. Recursion (4) evolves as follows:

witness(p0 ; q0) = a witness(p1 ; q1) witness(p1 ; q1) = a witness(p2 ; q2) witness(p2 ; q2) = a witness(p3 ; q3) witness(p3 ; q3) = a witness(p3 ; q4) witness(p3 ; q4) = b witness(p2 ; q3) witness(p2 ; q3) = b witness(p1 ; q2) witness(p1 ; q2) = b witness(p0 ; q1) witness(p0 ; q1) = Thus, witness(p0 ; q0) = aaaabbb, as expected.

5 Experimental Results We found the experimental results that we got so far very encouraging, although more experiments will be necessary to get a more complete picture. We applied our heuristic BOOST, tailored to work on OBDD's, to \real world" data.2 In cooperation with a group of students, we developed a bunch of strategies that modify the proposed algorithms in [20], [9] and obtain a quite good performance with respect to running times and OBDD reduction rates in our problem setting. At the moment, we achieve running times ranged between 1 second and 150 seconds for one iteration of the booster on a SUN Sparc Station 5 (70 MHz processor speed, 48 MB RAM), depending on the sizes of the input OBDD's B; Bd . The obtained results are partially illustrated in the two gures below. For a complete survey of our experimental results and for more details concerning the modi ed learning algorithms we refer the reader to [18]. 2

Thanks to Siemens for making these data available to us.


Phase of "Horizontal Slice"

Phase of "Vertical Slices"

p0 p1 p2 p3 q0 q1 q2 q3 q4

QUEUE

p0 q0

p p2 p 1 3 q q q q

1a

2b

1

q0

p0

1

9

2

3

(1,a) (1,b)

4

p1 q1

p2 p3 q2 q3 q4

2a

3b

q1

p1

2

8

19

(2,a) (2,b)

p2 q2

p3 q3 q4

3a

4b

q2

p2

3

7

(3, a) (3,b)

p3 q3

q4

4a

5

q3

p3

4

6

(5,a) (5,b)

(4,a) (4,b) (3,a) (3,b) (2,a) (2,b)

Figure 5: The DFA-SPLITTING-TREE of the sample run from Example 4.5. Apart from its states, each bucket contains its index and its splitting letter (if ever split). The current con guration of the QUEUE is given on the right side. Splitting items leading to a proper bucket split are underlined.


20

Reduction (%) 100 90 H, 4549, 3286, 72.2% (71)

80 70 B, 273, 157, 57.5% (47)

60 50

F, 15905, 7040, 44.3% (150)

40 1 2 3 4 5

10

20

50

# Iterations

Figure 6: Three typical developments of the reduction rates with respect to the number of iterations of the booster. Figure 6 illustrates the eect of using BOOST iteratively. The three curves show the development of the reduction rates 3 of three input OBDD's named B , F , and H after 1; 2; 3; 4; 5; 10; 20 and 50 iterations of BOOST, where the OBDD obtained after iteration i serves as oracle for iteration i + 1. Each curve is titled with the corresponding OBDD, its original size, its size after nishing the boosting process, the obtained reduction rate, and (in brackets) the needed number of iterations. We stopped the run of BOOST when the number of iterations without proper improvement exceeded a certain threshold t (t typically ranged between 5 and 20). The longest run with proper improvements at each iteration had length 208, but in most cases runs of length 10 up to 20 achieve suciently good results. Table 1 contains the main results of our experiments on nine partial OBDD's. The meaning of the columns is as follows:

3 With \reduction rate" me mean the ratio between the size of the learnt OBDD and the size of the OBDD used as oracle in the rst iteration.

Birkendorf, Simon / Learning Strategies for Combinatorial Optimization OBDD A B C D E F G H I

m 30 33 137 60 128 128 76 62 62

j Bd j

jB j

j Bboost j 30 (46:0%) 157 (57:5%) 702 (64:9%) 3194 (27:2%) 8402 (52:8%) 7040 (44:3%) 5241 (36:0%) 3286 (42:6%) 2264 (49:8%)

j Brestr j

81 64 162 273 267 1081 659 (61:0%) 36 11732 4528 (38:6%) 131 15905 7312 (46:0%) 101 15905 8695 (54:7%) 101 12310 8297 (67:4%) 105 4549 3356 (73:8%) 111 4549 2323 (51:1%) Table 1: Reduction rates for partial OBDD's

21

restr j j Bboost

655 3037 7298 6872 5253 3007 2240

(99:4%) (67:1%) (99:8%) (79:0%) (63:3%) (89:6%) (96:4%)

Column 1 Names associated with dierent OBDD's B Column 2 Number m of boolean variables Column 3 Sizes of the dierent OBDD's Bd representing the respective domain Column 4 Sizes of the dierent B Column 5 Sizes of the corresponding improvements Bboost found by BOOST, when B was used as the oracle in the rst iteration (also the reduction rate compared to B is shown) Column 6 The corresponding improvements Brestr found by the popular heuristic RESTRICT from [7] with the achieved reduction rates. restr found by BOOST, Column 7 Sizes of the corresponding improvements Bboost when the output B restr of RESTRICT was used as the oracle in the rst iteration (the reduction rate now refers to the preoptimized OBDD B restr ).

It comes out that, in our experiments, the general approach of BOOST, tailored to OBDD's, is able to basically outperform a special MinRep(OBDD) heuristic like RESTRICT. Table 1 also illustrates the eect of using RESTRICT as a clever algorithm for preoptimization. If we use the output of RESTRICT as rst oracle, then BOOST improves RESTRICT again (as demonstrated by the table). It would be interesting to study this boosting eect with other intelligent heuristics for preoptimization.


22

6 Conclusion We plan to make more systematic experiments in the near future. We want to gure out which other concept classes allow learning algorithms which boost solutions for combinatorial optimization problems. We want to develop a \tool box" which allows to apply this novel heuristic BOOST to various combinatorial problems. The problems are addressed directly (via an appropriate booster) or indirectly (via an approximation preserving reduction). At the moment, we are examining a totally dierent but also very challenging application of the boost technique in an area of algorithmic learning theory, called \learning from examples". Here, the learner has to infer a hidden target concept T from a nite sample ST consisting of positive and negative examples for T . A famous and far elaborated theoretical framework in this context is given by the PAC learning model 4 introduced by Valiant in [25]. In this model, each example of ST is drawn according to a xed but unknown distribution D. The learner has to deliver a hypothesis H which is likely to be a good approximation of T . The quality of hypothesis H as approximation of T is measured by the prediction error D(H (x) 6= T (x)). We refer the reader to [14] for a more detailed introduction to the PAC learning model. In our application, we are interested in learning a hidden target DFA M from examples. Sample SM consists of m examples (wi; li), where the wi are words over a nite alphabet , drawn at random according to an unknown distribution D, and the li are the appropriate labels, i.e., li =\+" if M accepts wi , and li =\{" otherwise. Unfortunately, it turned out that DFA's cannot be learned from examples in the PAC learning model given one of some standard cryptographic assumptions. For instance, a PAC learning algorithm for DFA's could be used to break the famous RSA cryptosystem or to eciently decide the quadratic residue property. See [13] for more details and pointers to the literature about cryptography. However, this negative result is based on choosing the target DFA's and the probability distributions on in a worstcase fashion. We believe that in many practical situations the task of learning an unknown DFA from examples will not exhibit this kind of cryptographic hardness. Our optimism is based on a paradigm called \Occam's Razor" in combination with the optimization abilities of our boost strategy. \Occam's Razor" stands for the principle to prefer hypotheses which are simple and consistent with reality. In the PAC learning framework, H should have a succinct representation (simplicity) and predict the labels within sample SM correctly (consistency). The claim is that hypotheses of this kind are likely to have a small prediction error. A precise version of this claim has been rigorously proven within the PAC learning model. See [4] for more details. 4

PAC is the acronym for Probably Almost Correct.


23

BOOST can be used to apply Occam's razor to the DFA learning problem as follows: Produce a sample SM according to the unknown distribution D. Create a rst DFA M0 which is consistent with SM . This is either done in an ad-hoc manner or by using the procedure proposed by Traktenbrot and Barzdin in [23]. Given Mi, use a slightly modi ed boost technique to obtain Mi+1 . As before, membership queries are answered according to Mi , but equivalence queries are answered \YES" as soon as the hypothesis DFA is consistent with SM . Otherwise a witness of inequivalence wi, (wi ; li) 2 SM , is presented. First experimental results dealing with random target DFA's and some restricted classes of probability distributions are encouraging. Surprisingly, already very simple minded preoptimizations, used to create the rst oracle M0 , are enough to get BOOST started. It should be mentioned that the procedure of Traktenbrot and Barzdin by itself (without BOOST) is known to learn random DFA's extremely well under distributions which are uniform on words of restricted length (see [15], for instance). A comparison of BOOST with other algorithms for learning DFA's is postponed to a forthcoming paper.

Acknowledgements The authors would like to thank all members of the student group PG 285 for many valuable ideas and discussions, and for doing a lot of software implementation and even more experiments. The authors gratefully acknowledge the support of Bundesministerium fur Forschung und Technologie grant 01IN102C/2, the support of Deutsche Forschungsgemeinschaft grant Si 498/3-1 and Si 498/3-2, the support of Deutscher Akademischer Austauschdienst grant 322-vigoni-dr, and the support of the ESPRIT Working Group in Neural and Computational Learning (NeuroCOLT Nr. 8556).


24

References [1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison Wesley, 1974. [2] S. B. Akers. Binary Decision Diagrams. IEEE Transactions on Computers, 27:509{516, 1978. [3] D. Angluin. Learning Regular Sets from Queries and Counterexamples. Information and Computation, 75:87{106, 1987. [4] A. Blumer, A. Ehrenfeucht, D. Haussler and M. Warmuth. Occam's Razor. Information Processing Letters, 24:377{380, 1987 [5] R. E. Bryant. Graph-based Algorithms for Boolean Function Manipulation. IEEE Transactions on Computers, 35:677{691, 1986. [6] R. E. Bryant. Symbolic Boolean Manipulation with Ordered Binary Decision Diagrams. ACM Computing Surveys, 24(3):293{318, 1992. [7] O. Coudert, C. Berthet, and J. C. Madre. Veri cation of Synchronous Sequential Machines Based on Symbolic Execution. In Proceedings of the Workshop on Automatic Veri cation Methods for Finite State Systems, pages 111{128, 1989. [8] P. Fischer and H. U. Simon. An Approximation Preserving Reduction from Graph Coloring to the Consistency Problem of OBDDs, 1994. Unpublished manuscript.

[9] R. Gavalda and D. Guijarro. Learning Ordered Binary Decision Diagrams. In Proceedings of the 6th International Workshop on Algorithmic Learning Theory, pages 228{239. Springer Verlag, 1995. [10] E. M. Gold. Complexity of Automaton Identi cation from Given Data. Information and Control, 37:302{320, 1978. [11] J. E. Hopcroft. An n log n Algorithm for Minimizing States in a Finite Automaton. In Z. Kohavi and A. Paz, editors, Theory of Machines and Computations, pages 189{196. Academic Press, New York, 1971. [12] J. E. Hopcroft and R. M. Karp. An Algorithm for Testing the Equivalence of Finite Automata. Research Report TR{71{114, Cornell University, Ithaca, N.Y., 1971. [13] M. J. Kearns and L. Valiant. Cryptographic Limitations on Learning Boolean Formulae and Finite Automata. Journal of the ACM, 41:67{95, 1994.


25

[14] M J. Kearns and U. V. Vazirani. An Introduction to Computational Learning Theory. The MIT Press, 1994. [15] K. Lang. Random DFA's can be Approximately Learned from Sparse Uniform Examples. In Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 45{52, 1992. [16] C. Y. Lee. Representation of Switching Circuits by Binary Decision Programs. Bell Systems Technical Journal, 38:985{999, 1959. [17] C. Lund and M. Yannakakis. On the Hardness of Approximating Minimization Problems. In Proceedings of the 25th Annual Symposium on Theory of Computing, pages 286{293, 1993. [18] L. Baumer, A. Birkendorf, P. Bollweg, J. Forster, K. Freitag, O. Giel, J. Harms, U. Kleine-Vogelpoth, K. Kneupner, I. Kubbilun, N. Masri, M. Rips, H. U. Simon, M. Wehle. Endbericht der Projektgruppe \Algorithmische Lernstrategien als Werkzeug fur kombinatorische Optimierung". Technical Report, Universitat Dortmund, 1997. [19] L. Pitt and M. K. Warmuth. The Minimum Consistent DFA Problem cannot be Approximated within any Polynomial. In Proceedings of the 21st Annual Symposium on Theory of Computing, pages 421{432, 1989. [20] R. L. Rivest and R. E. Schapire. Inference of Finite Automata using Homing Sequences. Information and Computation, 103:299{347, 1993. [21] D. Sieling and I. Wegener. Reduction of OBDDs in Linear Time. Information Processing Letters, 48:139{144, 1993. [22] H. U. Simon. On Approximate Solutions for Combinatorial Optimization Problems. SIAM Journal on Discrete Mathematics, 3(2):294{310, 1990. Also presented on the SIAM{conference on Discrete Mathematics in 1988. [23] B. A. Traktenbrot and Y. M. Barzdin. Finite Automata (Behavior and Synthesis). North-Holland, 1973. [24] Y. Takenaga and S. Yajima. NP-Completeness of Minimum Binary Decision Diagram Identi cation. Technical Report Comp92{99, SS92{46, Department of Information Science, Faculty of Engineering, Kyoto University, Kyoto 606{ 01, Japan, March 1993. [25] L. Valiant. A Theory of the Learnable. Communications of the ACM, 27:1134{1142, 1984.

Using Computational Learning Strategies as a Tool for Combinatorial

Using Computational Learning Strategies as a Tool for Combinatorial

Suggest Documents

Learning Combinatorial Interaction Test Generation Strategies using ...

Learning diary as a tool for metacognitive strategies development ...

Using Facebook as a Supplementary Tool for Teaching and Learning

Using Patents as a Tool for Reinforcing Constructivist Learning

Using Digital Video Assignments as a Tool for Active Learning

Using Technology as a Tool for Learning and Developing 21st ...

Learning Optimal Dialogue Strategies - Association for Computational

Blocks As a Tool for Learning - CiteSeerX

Machine learning as a tool for geologists

Facebook as a learning tool

Qualitative Reasoning as a Modeling Tool for Computational ...

Computational fluid dynamics as a tool for urban drainage system ...

COMPUTATIONAL FLUID DYNAMICS AS A TOOL FOR DERIVING

Using wikis as a learning tool in higher education - Ascilite

Using mind maps as a teaching, learning and assessment tool ...

Use of Mathematica as a Teaching Tool for (Computational) Fluid ...

Using Reading Inventory as a Learning Tool - ERIC

Using blended learning as a tool to strengthen ...

using videos as a language learning tool in the esl

language acquisition as learning - Association for Computational ...

Google Translate as a Supplementary Tool for Learning Malay: A ...

Google Translate as a Supplementary Tool for Learning Malay: A ...

Dictation as a Language Learning Tool

LABVIEW AS A TEACHING AND LEARNING TOOL