Supervised by Doina Precup, Joelle Pineau, and Prakash Panangaden. September 8, 2007. Abstract. We consider the problem of learning the behavior of a ...
Learning Algorithms for Automata with Observations Dorna Kashef Haghighi Supervised by Doina Precup, Joelle Pineau, and Prakash Panangaden September 8, 2007 Abstract We consider the problem of learning the behavior of a POMDP (Partially Observable Markov Decision Process) with deterministic actions and observations. This is a challenging problem due to the fact that the observations can only partially identify the states. Recent work by Holmes and Isbell offers an approach for inferring the hidden states from experience in deterministic POMDP environments. We propose an alternative algorithm that ensures more accurate predictions, and we show that in fact it produces the minimal predicting machine.
1
Introduction
Learning automata with partially observable internal states is one of the interesting and crucial topics in modern AI research. One of the models studied in this field is that of Partially Observable Markov Decision Processes(POMDPs) [KLC98]. In this paper we focus on POMDPs with deterministic actions and observations. In order to learn the internal representation of a partially observable system, it is necessary to have a sufficient history, which should be gained by interacting with the system. The present paper arose from an attempt to learn a model from the given sufficient history such that its predictions of future action effects are accurate and its size is close to minimal. Recently, a study has been done with the goal of learning Hidden Markov Models(HMMs) and Probabilistic Deterministic Finite Automata(PDFA) from data using a new PAC framework [GKPP06]. The algorithm does not work with actions. We have used their main idea of splitting and merging the states to develop a new algorithm called Merge-Split which works for deterministic partially observable systems with actions. We show the correctness of this new algorithm and we also prove that in the case where the start state is unknown the Merge-Split algorithm builds the minimal predicting learned machine. At the end, we show a few examples to illustrate the behavior of Merge-Split and 1
to compare and contrast its properties with the Looping Prediction Suffix Tree (LPST) algorithm [HI05] which is another algorithm for learning POMDPs with deterministic actions and observations.
2
Background
We start this section by reviewing the definition of a POMDP, as it is the system which we work with throughout this paper. Afterwards, we give a brief overview of the LPST and GKPP algorithms.
2.1
Partially Observable Markov Decision Process
A Partially Observable Markov Decision Process (POMDP) is an extension of a Markov Decision Process, in which the observations can only partially identify the internal states. It is a tuple (S, A, O, P : S × A × S → [0, 1], γ : A × S × O → [0, 1]) where: • S is a set of states, • A is a set of actions, • O is a set of observations, • P is the transition probability function: P (s, a, s0 ) is the probability that taking the action a from state s ends up in state s0 . • γ is the observation probability function: γ(a, s, ω) is the probability that ω is observed after taking the action a and ending up in state s. In this paper we address the case of POMDPs in which the actions and observations are deterministic. We define this kind of environment as a tuple D = (S, A, O, δ, γ), where S, A, and O are, as before, sets of states, actions and observations, δ is a deterministic transition function δ : S × A → S, and γ is a deterministic observation function γ : S × A → O.
2.2
Looping Prediction Suffix Trees (LPST)
LPST is an algorithm designed by Holmes and Isbell [HI05] as a solution for inferring hidden states from experience in deterministic POMDP environments. It is proved in their paper that every environment has a finite history suffix tree-based representation that allows for perfect prediction. Given this sufficient history as the input to LPST, the algorithm works as follows:
2
Algorithm 1 Algorithm for learning the LPST. LLT(h) create root node r representing the empty suffix add r to an expansion queue while nodes remain in queue do remove the first node x find all transitions following instances of x in h if x’s transition set is deterministic then continue else if isLoopable(x) then create loop from x to the highest possible ancestor else for each one-step extension of x in h do create child node and add to queue end for end if end while return r
Algorithm 2 Subroutine for determining loopability. isLoopable(x) for each ancestor q of x, starting at the root do for all prefixes p such that px or pq occurs in h do if pq or px does not appear in h OR trans(px) 6= trans(pq) then return False end if end for end for return True
LPST algorithm starts by creating a node for representing the empty suffix. Afterwards, a child node will be created for each possible action-observation pair appearing in the history. Three actions are possible for each node: Split, Loop, and Terminate. The termination decision is based upon the single step action-observation transition set of the node. If the set is deterministic then no further splitting or looping will be done on the node. Else, it should be examined 3
whether looping is possible or not. The single step transition set of a node x is called deterministic if there exists no prefix q such that trans(qx) 6= trans(x). In this algorithm nodes can only loop back to one of their ancestors. Node x can loop back to its ancestor q if for all prefixes p, neither px nor pq occur in the transition history, or their single step transition sets denoted by trans(px) and trans(pq) are identical. If neither terminating nor looping can be done on a node, then it will get split into new nodes, each representing one of the possible single step actionobservation transitions. Figure 1 shows an example of a flip automaton, taken from the same paper. This automaton has deterministic actions and observations. As shown in Figure 2, the learned suffix tree has 6 states. Bars under leaf nodes indicate expansion termination, with each node’s single-step action-observation transitions listed beneath the bar.
Figure 1: The flip automaton.
Figure 2: The learned looping suffix tree for the flip automaton
4
2.3
PAC-Learning of Markov Models with Hidden State
This algorithm by Gavalda, Keller, Pineau and Precup [GKPP06] learns a Probabilistic Deterministic Finite Automata (PDFA) which approximates a Hidden Markov Model (HMM) up to a given degree of accuracy. The inputs to the algorithm are Σ, D, δ, n, and µ, where Σ is the set of possible observations, D is the set of training trajectories (Ds is the set storing the suffixes of all training trajectories that pass through state s), δ is the desired confidence, n is an upper bound on the number of states desired in the model, and µ is a lower bound on the distinguishability between any two states. The main idea of the algorithm is based on splitting, merging and promoting the states. It keeps a list of safe and candidate states. The safe states are the final model states. A candidate is a state which may be representing a unique state in the model, or may be a duplicate of an existing safe state. The decision of either merging or promoting a candidate state sσ will be made when enough trajectories starting at the state have been seen, according to the largeness condition: (largeness condition) |Dsσ | ≥
3(1+µ/4) ln δ20 (µ/4)2
(1)
It will get merged with a safe state if the difference between the probability distribution over the trajectories observed from both states is in the given desired bound. In other words, we merge candidate state sσ and safe state s0 if for every trajectory d we have ¯ ¯ |Dsσ (d)| ¯ |Dsσ | −
¯
|Ds0 (d)| ¯ |Ds0 | ¯
≤ µ/2
(2)
The following table, which has been taken from their paper, summarizes the algorithm.
Table 1. Learning Algorithm M = PDFA-Learn (Σ, D, δ, n, µ) Initialize safe states S = {s0 } Ds0 = D Initialize candidates S¯ = {s0 σ|∀σ ∈ Σ} Ds0 σ = {σ2 ...σk |∃d ∈Ds0 , d = σσ2 ...σk } While ∃sσ ∈ S¯ which is large, as given by (1) Remove sσ from S¯ If ∃s0 ∈ S such that ∀d, eq.(2) is satisfied Add transition from s to s0 labelled by σ Ds0 = Ds0 ∪ Dsσ Else s0 = sσ 5
INITIALIZING
MERGING
PROMOTING
S = S ∪ {s0 } Ds0 = Dsσ S¯ = S¯ ∪ {s0 σ 0 |∀σ 0 ∈ Σ} Ds0 σ = {σ2 ...σk |∃d ∈ Ds0 , d = σσ2 ....σk } End if End while
3
Merge-Split Algorithm
In this section we present our Merge-Split algorithm for learning a POMDP with deterministic actions and observations from a given set of trajectories. It shares the idea of the GKPP algorithm of merging states according to their similarity in future transitions. However, it extends GKPP to allow actions as well as observations. Currently we assume that all different length future transitions of the states are available in the initial training trajectories. The algorithm works as follows. We begin by creating a node representing the empty state; afterwards for each action-observation pair occurring in the trajectories a corresponding node will be created. Two actions are possible for each node: Merge and Split. Two states can get merged if and only if they have the same set of future transitions. If it is not possible to merge the node with other existing nodes then we split it into new nodes representing its single step action-observation transitions. The algorithm terminates whenever no further split can be done on the nodes. Table 2. Merge-Split Algorithm M = Merge-Split-Learn(D) S= ∅ Create the empty node s∅ S = S ∪ s∅ Ds∅ = D Split(s∅ ) Split (cState) S 0 = {aσ|∃d ∈ DcState , d = aσa1 σ1 ....a(k) σ(k) } for all aσ in S 0 Create the node saσ Dsaσ = {a1 σ1 ...ak σk |∃d ∈ DcState , d = aσa1 σ1 ...ak σk } if ∃ sa0 σ0 ∈ S: Merge(saσ , s0a0 σ0 ) is true remove node saσ add transition from cState to s0a0 σ0 labelled by aσ else S = S ∪ {saσ } 6
add transition from cState to saσ labelled by aσ Split(saσ ) Merge (s, s0 ) if ( Ds = Ds0 ) return true else return false
We illustrate the algorithm by running it on the flip automaton given in Figure 1. We begin by initializing a single null state s∅ . Possible action-observation pairs are: L1, L0, R0, R1, U 0. For each of these pairs we create a corresponding node and add a transition from s∅ to it. Now the merge condition should be checked for these newly added states. State sL1 corresponds to applying action L and observing 1. In the original automaton this transition leads to s1 . Therefore the future transitions set of sL1 is the set of possible different length transitions from s1 . Similarly, sL0 corresponds to state s1 , sR0 and sR1 are the same as state s2 , and sU 0 is similar to the union of s1 and s2 because the transition U 0 does not provide any additional information about which state we are in. Therefore sL1 can merge with sL0 , sR1 with sR0 , and sU 0 with s∅ because they have the same future transitions. The splitting of the nodes continues in the same way. Figure 3.b shows the learned machine which has 3 states. The states that are crossed have been merged with the set of states indicated below them. The machine has fewer states than the LPST learned automaton, and predicts the behavior of the original automaton correctly.
Figure 3.a: The demonstration of running merge-split algorithm on the flip automaton.
7
Figure 3.b: The learned merge-split machine of the flip automaton.
To predict if a trajectory a1 σ1 a2 σ2 ....ak σk is accepted in the original automaton or not, we start from the null state s∅ . If it does not have any of the aσ pairs that lead to another state then we can infer that the trajectory is not accepted. Otherwise, we go to the next state and check for the next actionobservation pair of the given trajectory. If all the transitions are possible then it is accepted. For example the learned machine in Figure 3.b accepts R0L1L0 but rejects R0L1L1, which is the same answer of the original machine. We can actually prove that the algorithm’s learned machine predicts correctly. Theorem 1. Each trajectory a1 σ1 a2 σ2 ....ak σk is accepted in the original automaton A if and only if the learned machine M accepts it. Proof: Part 1: If A accepts the trajectory it will be also accepted in the machine M . Proof. Assume that M does not accept the trajectory. This can only happen when a1 σ1 a2 σ2 ....ai σi (i < k) takes the machine M to state s and the transition 0 ai+1 from state s has an observation σi+1 6= σi+1 . But this can never happen, as the Merge-Split algorithm assigns the set {ai+1 σi+1 ...ak σk |∃d ∈ Ds∅ , d = a1 σ1 a2 σ2 ...ak σk } to the trajectories set of state s and our assumption is that all the possible trajectories of automaton A exist in Ds∅ . Therefore, transition ai+1 σi+1 is possible from state s, which contradicts our first assumption, and implies that if A accepts the trajectory d, so does M . Part 2 : If M accepts the trajectory it will be also accepted in the automaton A. Proof. Assume that A does not accept the trajectory. If M accepts it then there should be a sequence of states s∅ s1 s2 ...sk in M such that starting from s∅ , a1 σ1 takes the machine to s1 , a2 σ2 takes it to s2 ,..., ak σk takes it to sk . As sk−1 has the transition ak σk to sk , we can deduce from the algorithm that the trajectories set of sk−1 should contain at least one element of the form ak σk ...am σm (m ≥ k). Similarly Dsk−2 has at least one trajectory of the form ak−1 σk−1 ak σk ...am σm (m ≥ k). With the same reasoning, Ds∅ contains the trajectory a1 σ1 ...ak−1 σk−1 ak σk ...am σm (m ≥ k). From the assumption of the 8
algorithm, Ds∅ consists of different length future transitions of the states of the original automaton, which implies that each element of the set is acceptable in A. Therefore a1 σ1 ...ak−1 σk−1 ak σk ...am σm (m ≥ k), and consequently a1 σ1 ...ak−1 σk−1 ak σk will be accepted in automaton A, which contradicts our assumption and completes the proof. Although the Merge-Split learned machine is not the minimal machine which produces the same set of transitions as the original automaton, we will show that it is minimal in a certain sense. Theorem 2. The Merge-Split machine is the minimal automaton which can correctly predict all trajectories that can follow any given history, when no assumption is made regarding the start state. Proof: We know that the original automaton is deterministic, therefore, each trajectory a1 σ1 ...ai σi has at most one state as its destination. In order to predict correctly, if the future transitions of two trajectories a1 σ1 ...ai σi and a01 σ10 ...a0k σk0 are different then they should go to two different states, otherwise the machine cannot differentiate their future which results in a wrong prediction. To prove the theorem we need to show that the Merge-Split automaton has the minimum number of such states. Assume that there exists a predicting machine P with accurate predictions which has fewer states than the Merge-Split learned machine M S. In this case, there exists at least two trajectories a1 σ1 ...ai σi and a01 σ10 ...a0k σk0 such that they go to a same state in P but different states in M S. But if they go to a same state in P then they should have the same possible futures, because otherwise their prediction would be wrong. Therefore the two states of M S which are the destinations of these two trajectories will have the same trajectories set. By the merge condition of the algorithm, these two states will get merged. This implies that if two trajectories go to a same state in P , they will also go to a same state in M S. Therefore, P and M S must have the same number of states.
4
Discussion
In this section a few examples are provided to show some sample outputs of the Merge-Split algorithm, and to compare the properties of Merge-Split and LPST machines. We tried to find some examples for which the Merge-Split learned machine has more states than LPST. Consider the 7-state automaton with reset action in Figure 4.a. The LPST representation (Figure 4.b) has 25 states, while the Merge-Split machine (Figure 4.c) has 28 states. This example shows that the Merge-Split is not a minimal machine, as it has more states than the original automaton. However, it still is the minimal predicting automaton. The LPST machine has learned some behaviors of the original automaton, for example it has learned that the future transitions after observing B0 depend 9
on the state in which action B is applied, while the future transitions of other action-observation pairs are always the same, regardless of their start state. But the LPST machine cannot predict if a given trajectory is accepted or not correctly. For instance, for trajectory A0B0A0, it can infer from the learned automaton that observing B0 is possible after A0, but to deduce if observing A0 is possible after A0B0 it should know which of the six B0 states of the machine corresponds to the current state of A0B0, and finding this state is not possible in this learned LPST.
Figure 4.a: 7-state Flip-Flop with reset action A
10
Figure 4.b: The LPST learned machine.
11
Figure 4.c: The Merge-Split learned machine.
Figures 5.a and 5.b demonstrate the Merge-Split algorithm on an automaton which does not have a reset action. The learned representation, in Figure 5.b, contains a structure isomorphic with the original automaton on the right. However, the set of the states are necessary to insure that future trajectories are predicted correctly from any other state. 12
Figure 5.a: 7-state Flip-Flop without a reset action.
Figure 5.b: The learned merge-split machine of the flip automaton.
13
5
Future Work
In this work we have shown how to learn a POMDP with deterministic actions and observations using the Merge-Split algorithm, and we have proved that this algorithm produces the minimal predicting machine. Our next goal is to find an algorithm which can correctly predict in POMDP environments with deterministic actions but stochastic observations. The first step is to modify the current Merge-Split algorithm to work in probabilistic environments. In order to do so the trajectories set of each state should contain the probabilities of observing the trajectories from that state, and the merge condition should also consider the probabilities, so that two states can get merged if the difference in the probabilities of observing future transitions from them is within a defined bound. And the next step is to prove if the modified algorithm still produces the minimal predicting machine or not.
6
References
[GKPP06] Ricard Gavalda, Philipp W.Keller, Joelle Pineau, Doina Precup. “PAC-Learning of Markov Models with Hidden State”. in Proceedings of ECML, 2006. [HI05] Michael P. Holmes, and Charles Lee Isbell, Jr. “Looping Suffix TreeBased Inference of Partially Observable Hidden State”. in Proceedings of the 23rd international conference on Machine learning, pages 409-416, Pittsburgh, Pennsylvania, 2006, ACM Press. [KLC98] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. “Planning and acting in partially observable stochastic domains”. Artificial Intelligence, Volume 101, pages 99-134, 1998.
14