Technical Report

Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis, MN 55455-0159 USA

TR 04-015 Temporal Sequence Prediction Using an Actively Pruned Hypothesis Space Steven Jensen, Daniel Boley, Maria Gini, and Paul R. Schrater

April 14, 2004

Temporal Sequence Prediction Using an Actively Pruned Hypothesis Space Steven Jensen, Daniel Boley, Maria Gini and Paul Schrater Department of Computer Science and Engineering, University of Minnesota Abstract We propose a novel time/space efficient method for learning temporal sequences that operates on-line, is rapid (requiring few exemplars), and adapts easily to changes in the underlying stochastic world model. This work is motivated by humans’ remarkable ability to learn spatiotemporal patterns and make short-term predictions much faster than most existing machine learning methods. Using a short-term memory of recent observations, our method maintains a dynamic space of candidate hypotheses in which the growth of the space is systematically and dynamically pruned using an entropy measure over the observed predictive quality of each candidate hypothesis. We further demonstrate the application of this method in the domain of “matching pennies” and “rock-paper-scissors” games.

1

Introduction

Humans continually form and revise predictive models of the world around them, and employ these models in virtually every aspect of behavior. The need for robust sequence prediction is essential for any intelligent agent to function in a dynamic (non-stationary) world. An agent that learns “on the fly”, while operating in a dynamic environment must be able to make predictions of near-future events based on limited experience of environmental event history. The challenge for a learning agent in a rapidly changing environment is to exploit the relevant experience while preserving flexibility. Recent findings in the field of Neuroscience [Huettel et al., 2002] suggest that we might use human behavior as a model for a machine-learning solution to short-term temporal sequence detection/prediction, specifically in that short-term memory appears to be utilized to rapidly learn regularly occurring patterns in sequences of observations, and that the detection of apparent nonrandomness in sensory input underlies the behavioral mechanisms responsible for learning and predicting sequences of events. Together, these results suggest a machine learning approach that exploits basic, low-level predictability as a basis for learning and predicting temporal sequences. Low-level predictability can be modeled via a variable-length order-n Markov chain. Recall that the joint probability distribution of a sequence of observations o1:n = o1 , o2 , . . . , on can always be factored as n Y P (ot |o1:t−1 ) (1) P (o1 , o2 , . . . , on ) = P (o1 ) t=2

The full joint distribution becomes intractable as the sequence length increases, however a fixed Q finite length n Markov model P (oT −n ) Tt=T −n+1 P (ot |oT −n:T ) cannot capture regularities with variable lags and multiple time scales. In this paper we propose an algorithm that overcomes some of the problems of fixed length Markov chains. 1

2

A Motivating Example

Consider the following situation in which a learning agent encounters the following sequence of observations: (for simplicity we denote an observation with a single capital case letter): ABACABACA? If an order-1 Markov assumption is made, then the model, when presented with the given sequence up to the final event ’A’ should predict the subsequent occurrence of the event ’B’ or ’C’ with equal probability. If we arbitrate with the flip of a coin, we can expect no better than 50% success at predicting the succeeding event. However, this is clearly not what a human observer would immediately induce from the same sequence. A human observer recognizes that events ’B’ and ’C’ alternate with regularity and are completely predictable. If we simply augment our definition of ’state’ to include the 2 preceding events (order-2 Markov chain), we have increased the number of possible states from 3 (A, B, or C) to 9, but now we have a model that can accurately sort out the context and learn that (C, A) predicts ’B’ with probability 1 and (B, A) predicts ’C’ with probability 1. Similarly, the following sequence is problematic for a ’fixed’ order-2 Markov model: AABCABBAABCAAB? In this case, the length-2 sequence (A, B) predicts ’C’ with probability .67 and predicts ’B’ with probability .33. However, if we extend our definition of state to three events (order-3 Markov chain), the sequence (A, A, B) will predict ’C’ with probability 1. Simply extending the Markov order, however, will increase the difficulty of learning the simple relationship of (A, A) predicting B. Increasing the order of the Markov chain is a common method for augmenting the definition of state in order to uncover predictable relationships in temporal sequences, however the drawback to this approach lies in the combinatorial explosion in the state space. If we consider the case in which some event is entirely predicted by a single observation which occurs, say, 3 time-steps in the past, the order-3 Markov chain will require a large number of training samples to detect this specific relationship (owing to the fact that the 2 intervening events will be drawn from a largely random pool). For example, consider the following sequence: ABCABBABCACBACC? This sequence is more subtle, however note that the event A occurs in every fourth time-slot. Following the final two events in this sequence, the order-2 Markov chain will be unable to predict any event because the sequence (C, C) is novel. Given adequate experience, the agent employing the order-2 Markov chain will ultimately “learn” that ’A’ occurs following every possible combination of two intervening events, however an agent operating in a highly dynamic environment must learn much more quickly with limited experience. We propose an alternate method, using the notion of an actively pruned “hypothesis” space that is able to sort out highly predictive patterns regardless of the Markov order and do so with relatively few examples.

3

Actively Pruned Hypothesis Space Method

We propose a method for making short-term predictions using an actively pruned hypothesis space (HSpace). The method avoids some of the pitfalls of current methods such as the need for long training sequences and uncontrollable combinatorial explosion, and provides rapid, on-line learning. This, in turn, enables adaptation to rapidly changing situations. Furthermore, it is capable 2

of rapidly detecting predictable events in apparently random sequences and adapting to pattern changes. Unlike augmented order-n Markov chain methods, in which learning occurs over a space of uniform n-grams, this algorithm learns over a space of hypotheses. Given a short-term memory consisting of the n most recent temporally-ordered observations, an individual hypothesis consists of the ordered contents of the short-term memory of recent observations (or a subset thereof) and an associated prediction-set of events that have, in the past, immediately followed the pattern contained in short-term memory. Given an ordered temporal stream of discrete observations (or percepts) (ot−n , ot−n+1 , . . . , ot−1 ), we seek to determine if any pattern of observations from that stream accurately predicts the immediate future event or events. The event ot might be consistently predicted by the single observation ot−1 , or the single observation ot−2 , or by the two events (ot−4 , ot−1 ), or by some other combination. Each of these combinations along with their associated event predictions are referred to as an hypothesis. In general, if the system takes the form of a Markov chain of order-1, then the single observation at time t − 1 will predict the probability of the event at time t. However, given a series of observations, it is not necessarily true that the sequence results from a Markov process of order-1. For example, it may be that the single observation at time t − 4 accurately predicts the observed event while the observation at time t − 1 is irrelevant, or the two specific observations occurring at time t − 6 and t − 4 may predict the observed event with high probability, and so on. Assuming, without loss of generality, that we fix the length of the histories stored in the shortterm memory to n = 7, there are 2n = 128 possible subsets of the event history that are used to form hypotheses, equivalent to the power-set formed from the 7-gram short-term memory. If we disregard the trivial hypothesis (consisting only of the empty-set {}), then for any specific short-term memory configuration, we are able to form 127 hypotheses, each of which “may” have predicted the observed event at time t. The system consists of a short-term memory (STM) of the n last observations ot−n , . . . , ot−1 , where for simplicity we set n = 7, plus a Hypothesis Space (HSpace) which stores the various possible hypotheses. Each hypothesis consists of a partial history (a subset of the last n time slots) and all the possible predictions with a count associated with each prediction. There are 2n possible subsets of the history. At each time step, the system attempts to learn which of the subsets is consistently good at predicting the current observation ot . It does this by adding a potential hypothesis for each possible subset corresponding to the current observation: ot−1 ⇒ et ot−2 ⇒ et .. . ot−6 , ot−4 ⇒ et .. . ot−7 , ot−6 , . . . , ot−1 ⇒ et where ei is the observation at time t that this rule is trying to predict. The exact process by which the HSpace is filled with these hypotheses is the learning process outlined in the next Section. By forming these hypotheses in real-time, we can learn those that, over time, predict specific events with high-probability and utilize them to make predictions of future events. The set of all hypotheses formed over time represents the “hypothesis space” or HSpace.

3

3.1

Learning

The HSpace is used to store the hypotheses encountered and generated from previous time steps. Associated with each hypothesis is its set of predictions together with counts for each prediction indicating how many times it has been encountered in the past. As each observation (or percept), o, is sensed, it is entered into a n-element fifo queue (STM) containing the recently observed history. The STM is organized in a fixed temporal sequence: (ot−7 , ot−6 , ..., ot−1 ). At each discrete time step, t, a new set of 127 hypotheses is formed from the stored observations in STM, along with the currently observed event at time t. Each of the 127 individual hypotheses are then inserted into the hypothesis space subject to the following rules: 1. If the hypothesis is not in the HSpace, it is added with an associated prediction-set containing only the current event (prediction) and an event count set to 1. 2. If the hypothesis already resides in the HSpace, then the observed event is matched with the stored predictions in the associated prediction set. If found, the proposed hypothesis is consistent with past observations and the event count corresponding to et is simply incremented. 3. If the hypothesis already resides in the HSpace but the observed event et is not found in the associated prediction set, the novel prediction is added to the prediction set with an event count of 1.

3.2

Pruning the hypothesis space.

The combinatorial explosion in the growth of the HSpace is controlled through a process of active pruning. Since we are only interested in those hypotheses that provide high-quality prediction, inconsistent hypotheses or those lacking predictive quality can be removed. For any given hypothesis, the computed entropy of the prediction-set can be used as a measure of the “quality” of prediction. The prediction-set P S for each hypothesis in the HSpace consists of a set of n tuples (ei , ci ), one for each of the n events predicted by the hypothesis, P S = {(e1 , c1 ), (e2 , c2 ), . . . , (en , cn )} where ci is the count of the number of times that ei followed this hypothesis’s list of observations in the data sequence. Using the individual event counts, the entropy of the prediction set can be computed as, n X ci ci log2 H=− ctot ctot i=1

where ctot is simply the sum of all the individual event counts, ctot =

n X

ci

i=1

If a specific hypothesis is associated with a single, consistent prediction, the entropy measure for that prediction-set will be zero. If a specific hypothesis is associated with a number of conflicting predictions, then the associated entropy will be high. In this sense, the “quality” of the prediction represented by the specific hypothesis is inversely related to the entropy measure. High entropy indicates poor predictive quality, and low entropy indicates consistently accurate prediction. 4

As hypotheses are added to the HSpace, inconsistent hypotheses are removed. An inconsistent hypothesis is one in which the entropy measure over the prediction-set exceeds a predetermined threshold. In other words, when the entropy measure of the predictions associated with a specific hypothesis is high, it fails the “predict with high probability” test and is no longer considered to be a reliable predictor of future events, so it is removed from the HSpace. It is this pruning behavior which bounds the growth in the hypothesis space. Over time, only those hypotheses deemed accurate predictors with high probability are retained in the hypothesis space. All others are eventually removed.

3.3

Making Predictions.

As each new observation is experienced and recorded in the STM, new hypotheses are added to the HSpace and re-evaluated in terms of the maximum entropy threshold. Those that exceed the threshold are actively pruned. The hypotheses which remain in the space are generally high-quality predictors of future events and can be used to perform serial prediction tasks. These predictions are made by considering all hypotheses consistent with the current contents of STM and choosing the “most likely” hypothesis. Again, an entropy measure over the prediction-set can be used as a qualitative prediction measure: The lower the entropy, the “better” the prediction. To make a prediction, we simply locate the hypotheses in the HSpace which are represented by the current contents of the STM, and rank them according to entropy measure. The maximum number of matching hypotheses is bound by the length of STM. With our choice of keeping the last seven observations in the STM, there are at most 127 such matching hypotheses. The most frequently occurring prediction from the hypothesis with the lowest-entropy is the best prediction that can be made, given the current experience. For making predictions, a simple entropy computation is not sufficient because it is biased toward selecting those hypotheses with a small number of occurrences. For example, a hypothesis that has only occurred once will have a single prediction-set element, producing a computed entropy value of zero. A more robust entropy measure must be used that takes into account the number of occurrences and gives greater weight to those with higher frequency. A more reliable entropy measure is obtained by re-computing the prediction-set entropy with a single, false positive added to the set. We add a single, hypothetical false-positive element which represents an implicit prediction of something else. This yields a reliable entropy measure,

Hrel = −

""

n X i=1

ci log2 ctot + 1

ci ctot + 1

#

1 log2 + ctot + 1

1 ctot + 1

#

If a specific hypothesis in the HSpace has only occurred once, its associated prediction-set will contain a single element with an event count of 1. This yields a computed prediction-set entropy of log2 (1) = 0.0. However, using the reliable entropy measure Hrel yields an adjusted entropy of 1 1 1 1 2 log2 ( 2 ) + 2 log2 ( 2 ) = 1. Note that a prediction-set with a single element but a high event count will yield a reliable entropy considerably less than 1. In this case, the reliable entropy measure is consistent with the intuitive notion of “predictability” implied by frequent occurrence.

5

A B A C A B A D

... ... At−2 Bt−1 →{(A, 1)} Bt−2 At−1 →{(C, 1)} At−2 Ct−1 →{(A, 1)} Ct−2 At−1 →{(B, 1)} At−2 Bt−1 →{(A, 2)} Bt−2 At−1 →{(D, 1), (C, 1)}

... ... Bt−1 →{(A, 1)} At−1 →{(C, 1)} Ct−1 →{(A, 1)} At−1 →{(B, 1), (C, 1)} Bt−1 →{(A, 2)} At−1 →{(D, 1), (B, 1), (C, 1)}

... ... At−2 →{(A, 1)} Bt−2 →{(C, 1)} At−2 →{(A, 2)} Ct−2 →{(B, 1)} At−2 →{(A, 3)} Bt−2 →{(D, 1), (C, 1)}

Table 1: Operation of the Pruned HSpace algorithm on a short sample stream.

3.4

An example

We show step by step how the HSpace is constructed in Table 1. For simplicity, we use a STM of length 2, and a short stream. Given the following input: ... A B A C A B A D ... at the second occurrence of observation ’A’ three hypotheses will be inserted in the HSpace. In Table 1, the hypotheses are shown in the row of the corresponding observation. Following the succeeding observations ’C’ and ’A’, six additional hypotheses will be inserted. At the second occurrence of observation ’B’, At−1 predicts not only ’C’, as before, but also ’B’. We now have an ambiguous prediction, with ’B’ being predicted with probability 21 and ’C’ also being predicted with probability 21 . The entropy of At−1 prediction goes to 1. In the next time step, we observe ’A’ and since At−2 has predicted ’A’ consistently three times its reliable entropy increases. In the next time step, we observe ’D’ and now the ambiguity of At−1 includes ’D’, ’B’, and ’C’. The entropy of the prediction set of At−1 has now increased to approximatively 1.5 and is a good candidate for pruning.

3.5

Brief Analysis

For an alphabet of size m, an augmented order-n Markov chain approach requires a transition matrix of size mn . The proposed Pruned HSpace method spans a potential space of order (m + 1)n which is significantly larger, however, two attributes of the problem domain restrict the effective size of the HSpace: 1. Limited experience yields a sparse space: Only those hypotheses that both have been experienced and have high predictive quality are kept in the HSpace. 2. Statistical structure in the observation space leads to efficient pruning: If the temporal stream of observations is truly random, leading to no ability to predict future events, then the HSpace method will indeed explode, or in the presence of pruning, will continually prune and add new hypotheses (i.e. thrash). However, most interesting “real-world” behavior has regularities our algorithm should efficiently exploit.

4

Relation to other work

The problem of determining predictive sequences in time-ordered databases has been addressed by a significant body of data mining literature, starting with the seminal work of Agrawal and 6

Srikant [Agrawal and Srikant, 1995]. However, these approaches (including [Agrawal and Srikant, 1995]) generally utilize a number of passes (forward and/or backward) through the data and do not meet the ”on-line” or ”real-time” criteria essential for dynamic agent performance. In addition, the prevalent data mining techniques generally do not handle changes in the underlying stochastic model. The Pruned HSpace algorithm can be viewed as a method to learn a sparse representation of an nth order Markov process via pruning and parameter tying. Because sub-patterns occur more frequently than the whole, our reliability measure preferentially prunes larger patterns. Because prediction is then performed via the best sub-pattern, we are effectively tying probability estimates of all the pruned patterns to their dominant sub-pattern. Previous approaches to learning sparse representations of Markov processes include variable memory length Markov models [Guyon and Pereira, 1995, Ron et al., 1996, Singer, 1997, Bengio et al., 1998] and mixture models that approximate n-gram probabilities with sums of lower order probabilities [Saul and Jordan, 1998]. Variable length Markov models (VLMMs) are most similar to our approach in that they use a variable length segment of the previous input stream to make predictions. However, VLMMs differ in that they use a tree-structure on the inputs, predictions are made via mixtures of trees, and learning is based on agglomeration rather than pruning. In the mixture approach, n-gram probabilities p(ot |ot−1 , ot−n ) are formed via additive combinations of 2-gram components. Learning in mixture models is complicated by using EM to solve a credit assignment problem between the 2m-gram probabilities and the mixture parameters. We believe the relative merits of our algorithm to be its extreme simplicity and flexibility.

5

Empirical Results

We investigated the properties of our Pruned HSpace algorithm using two simple games, “matchingpennies” and “rock-paper-scissors”. The game of matching-pennies is a simple two-player game in which one player attempts to match the hidden selection (heads or tails) made by the opposing player. The rock-paper-scissors is a similar but slightly more complex two-player game that proceeds with each person simultaneously “playing” from a set of three choices {rock, paper, scissors}. The winner is decided as follows: “rock” wins over scissors, “scissors” wins over paper and “paper” wins over rock. Ties are not counted. Both games are examples of games with no optimal strategy. Theoretically, the best strategy is to play randomly, leading to a tie. However, if an opponent exhibits a bias in play selection, that bias can be exploited to provide a winning advantage over time. These games provide an excellent domain in which to test a temporal sequence prediction mechanism. In each case, the playing strategy is to ascertain predictability (bias) in the opponent’s play, predict what the opponent is likely to do next, and choose a play that is superior to that predicted for the opponent. If the opponent exhibits predictable behavior, the learning algorithm can exploit that bias and achieve a statistical edge. In addition, if the opponent changes strategy mid-game, the algorithm must quickly adapt to the new behavior. The matching-pennies game can be simply modeled using a Bernoulli process s in which the probability of producing s = 1 is systematically biased. We initially tested the Pruned HSpace algorithm by presenting length 1000 binary strings generated by a Bernoulli process in which p(s = 1) was uniformly varied from .5 to 1.0. The number of correct predictions was measured as a function of p(s = 1) and detailed in Figure 1.

7

1

0.9

0.8

0.7

0.6

P(1) P(correct prediction)

0.5

0.4 0.5

0.55

0.6

0.65

0.7

0.75 0.8 Bernoulli 1−Bias

0.85

0.9

0.95

1

Figure 1: Performance of algorithm on Bernoulli strings of length 1000. Solid black line shows the observed fraction of 1s in the sample string, dotted line shows the fraction of correct predictions as a function of p(s = 1). 1

0.9

0.8

0.7

0.6

P(1) P(prediction = 1)

0.5

0.4 0.5

0.55

0.6

0.65

0.7

0.75 0.8 Bernoulli 1−Bias

0.85

0.9

0.95

1

Figure 2: Predictions of algorithm on Bernoulli strings of length 1000. Solid black line shows the observed fraction of 1s in the sample string, dotted line shows the fraction of predictions in which the algorithm guessed 1.

Figure 2 shows the prediction frequency for 1s in the same series of tests. For Bernoulli biases near 0.5 these results are similar to probability matching results from human experiments (see [Estes, 1964, Jones, 1971]) in which the response frequency matches bias. The growth in the HSpace was measured as a function of p(s = 1) and is presented in Figure 3. The trial was extended to measure the ability for the Pruned HSpace algorithm to rapidly adapt to changing stochastic process statistics. 100 binary strings of length 1000 were generated

8

1600

1400

Average HSpace size

1200

1000

800

600

400

200

0 0.5

0.55

0.6

0.65

0.7

0.75 0.8 Bernoulli 1−Bias

0.85

0.9

0.95

1

Figure 3: The size of the HSpace as a function of p(s = 1).

using the Bernoulli process described earlier. A simple, predictable binary pattern consisting of all 0s was provided immediately following the length 1000 preamble, and the prediction performance measured immediately following the “switch”. 1

Preamble 1−Bias = .5 Preamble 1−Bias = .75 Preamble 1−Bias = 1.0

Avg. Correct Predictions

0.8

0.6

0.4

0.2

0

−50

−40

−30

−20

−10 0 10 Sample−time relative to ’switch’

20

30

40

50

Figure 4: Adaptation rate as a function of input samples following switch in generative process statistics. Correct predictions averaged over 100 runs. Figure 4 summarizes the adaptation results for 3 sample runs of all 0s preceded by a preamble of length 1000 with p(s = 1) set respectively to .50, .75, and 1.00. The switch from a preamble biased to all 1s to a pattern of all 0s occurs within 8 samples and represents the worst-case performance. As a further test, the Pruned HSpace algorithm was used as the basis for a rock-paper-scissors playing program and pitted against human opponents. In addition to a slightly larger alphabet, the rock-paper-scissors game presents a more complex 9

environment in which two separate temporal observation streams are processed in parallel. The first stream consists of the consecutive plays of the opponent and is used to predict the opponent’s subsequent play. The second stream is used to predict the opponent’s next play based on the sequence of the machine’s plays. In this way, if the opponent is falling into biased patterns related to his/her own play, the first stream will provide predictors, whereas if the opponent attempts to exploit perceived patterns related to the machine’s play, that bias will be detected and exploited. The approach is simple. Observe, make two predictions of the opponent’s next play based on the separate input streams, and select the play that has the lowest reliable entropy measure. A number of matches have been played against human opponents with surprising success. The results of one such game are shown in Figure 5. 150

100

50

Cumulative program wins Cumulative opponent wins

0

0

50

100

150 200 Number of plays

250

300

350

Figure 5: Rock-Paper-Scissors wins vs. losses over time against a human opponent. As demonstrated, an advantage is gained following approximately 35 − 40 plays. The program exhibits the key characteristics of a dynamic, adaptive, on-line learner. It adapts to the changing play of the opponent and quickly exploits predictive patterns of play. Early results suggest that human players exhibit predictable patterns (even when explicitly trying not to), and demonstrate the Pruned HSpace method as an effective, efficient tool for learning and predicting these temporal patterns.

6

Conclusions and Future Work

We presented an algorithm capable of learning complex temporal sequences in real-time using limited memory resources while adapting rapidly to changes in the underlying stochastic properties. We demonstrated its potential for use in domains where rapid adaptability is of paramount importance. Straightforward extensions of the algorithm include variable length windows instead of the fixed length windows we presented, and multiple input streams. The algorithm can be extended to longer time scales by treating embedded sequences with high predictability as higher order temporal streams on which a modified pruning HSpace algorithm can operate.

10

References [Agrawal and Srikant, 1995] Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Philip S. Yu and Arbee S. P. Chen, editors, Eleventh International Conference on Data Engineering, pages 3–14, Taipei, Taiwan, 1995. IEEE computer Society Press. [Bengio et al., 1998] Yoshua Bengio, Samy Bengio, Jean-Francois Isabelle, and Yoram Singer. Shared context probabilistic transducers. NIPS’97, 10, 1998. [Estes, 1964] W. K. Estes. Probability learning. In A. W. Melton, editor, Categories of Human Learning. Academic Press, 1964. [Guyon and Pereira, 1995] Isabelle Guyon and Fernando Pereira. Design of a linguistic postprocessor using variable memory length markov models. International Conference on Document Analysis and Recognition, pages 454–457, 1995. [Huettel et al., 2002] Scott A. Huettel, Peter B. Mack, and Gregory McCarthy. Perceiving patterns in random series: dynamic processing of sequence in prefrontal cortex. Nature Neuroscience, 5(5):485–490, May 2002. [Jones, 1971] M.R. Jones. From probability learning to sequential processing: A critical review. Psychol. Bull., 76:153–185, 1971. [Ron et al., 1996] Dana Ron, Yoram Singer, and Naftali Tishby. The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25, 1996. [Saul and Jordan, 1998] Lawrence K. Saul and Michael I. Jordan. Mixed memory markov models: decomposing complex stochastic processes as mixtures of simpler ones. Machine Learning, pages 1–11, 1998. [Singer, 1997] Yoram Singer. Adaptive mixture of probabilistic transducers. Neural Computation, 9, 1997.

11

Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis, MN 55455-0159 USA

TR 04-015 Temporal Sequence Prediction Using an Actively Pruned Hypothesis Space Steven Jensen, Daniel Boley, Maria Gini, and Paul R. Schrater

April 14, 2004

Temporal Sequence Prediction Using an Actively Pruned Hypothesis Space Steven Jensen, Daniel Boley, Maria Gini and Paul Schrater Department of Computer Science and Engineering, University of Minnesota Abstract We propose a novel time/space efficient method for learning temporal sequences that operates on-line, is rapid (requiring few exemplars), and adapts easily to changes in the underlying stochastic world model. This work is motivated by humans’ remarkable ability to learn spatiotemporal patterns and make short-term predictions much faster than most existing machine learning methods. Using a short-term memory of recent observations, our method maintains a dynamic space of candidate hypotheses in which the growth of the space is systematically and dynamically pruned using an entropy measure over the observed predictive quality of each candidate hypothesis. We further demonstrate the application of this method in the domain of “matching pennies” and “rock-paper-scissors” games.

1

Introduction

Humans continually form and revise predictive models of the world around them, and employ these models in virtually every aspect of behavior. The need for robust sequence prediction is essential for any intelligent agent to function in a dynamic (non-stationary) world. An agent that learns “on the fly”, while operating in a dynamic environment must be able to make predictions of near-future events based on limited experience of environmental event history. The challenge for a learning agent in a rapidly changing environment is to exploit the relevant experience while preserving flexibility. Recent findings in the field of Neuroscience [Huettel et al., 2002] suggest that we might use human behavior as a model for a machine-learning solution to short-term temporal sequence detection/prediction, specifically in that short-term memory appears to be utilized to rapidly learn regularly occurring patterns in sequences of observations, and that the detection of apparent nonrandomness in sensory input underlies the behavioral mechanisms responsible for learning and predicting sequences of events. Together, these results suggest a machine learning approach that exploits basic, low-level predictability as a basis for learning and predicting temporal sequences. Low-level predictability can be modeled via a variable-length order-n Markov chain. Recall that the joint probability distribution of a sequence of observations o1:n = o1 , o2 , . . . , on can always be factored as n Y P (ot |o1:t−1 ) (1) P (o1 , o2 , . . . , on ) = P (o1 ) t=2

The full joint distribution becomes intractable as the sequence length increases, however a fixed Q finite length n Markov model P (oT −n ) Tt=T −n+1 P (ot |oT −n:T ) cannot capture regularities with variable lags and multiple time scales. In this paper we propose an algorithm that overcomes some of the problems of fixed length Markov chains. 1

2

A Motivating Example

Consider the following situation in which a learning agent encounters the following sequence of observations: (for simplicity we denote an observation with a single capital case letter): ABACABACA? If an order-1 Markov assumption is made, then the model, when presented with the given sequence up to the final event ’A’ should predict the subsequent occurrence of the event ’B’ or ’C’ with equal probability. If we arbitrate with the flip of a coin, we can expect no better than 50% success at predicting the succeeding event. However, this is clearly not what a human observer would immediately induce from the same sequence. A human observer recognizes that events ’B’ and ’C’ alternate with regularity and are completely predictable. If we simply augment our definition of ’state’ to include the 2 preceding events (order-2 Markov chain), we have increased the number of possible states from 3 (A, B, or C) to 9, but now we have a model that can accurately sort out the context and learn that (C, A) predicts ’B’ with probability 1 and (B, A) predicts ’C’ with probability 1. Similarly, the following sequence is problematic for a ’fixed’ order-2 Markov model: AABCABBAABCAAB? In this case, the length-2 sequence (A, B) predicts ’C’ with probability .67 and predicts ’B’ with probability .33. However, if we extend our definition of state to three events (order-3 Markov chain), the sequence (A, A, B) will predict ’C’ with probability 1. Simply extending the Markov order, however, will increase the difficulty of learning the simple relationship of (A, A) predicting B. Increasing the order of the Markov chain is a common method for augmenting the definition of state in order to uncover predictable relationships in temporal sequences, however the drawback to this approach lies in the combinatorial explosion in the state space. If we consider the case in which some event is entirely predicted by a single observation which occurs, say, 3 time-steps in the past, the order-3 Markov chain will require a large number of training samples to detect this specific relationship (owing to the fact that the 2 intervening events will be drawn from a largely random pool). For example, consider the following sequence: ABCABBABCACBACC? This sequence is more subtle, however note that the event A occurs in every fourth time-slot. Following the final two events in this sequence, the order-2 Markov chain will be unable to predict any event because the sequence (C, C) is novel. Given adequate experience, the agent employing the order-2 Markov chain will ultimately “learn” that ’A’ occurs following every possible combination of two intervening events, however an agent operating in a highly dynamic environment must learn much more quickly with limited experience. We propose an alternate method, using the notion of an actively pruned “hypothesis” space that is able to sort out highly predictive patterns regardless of the Markov order and do so with relatively few examples.

3

Actively Pruned Hypothesis Space Method

We propose a method for making short-term predictions using an actively pruned hypothesis space (HSpace). The method avoids some of the pitfalls of current methods such as the need for long training sequences and uncontrollable combinatorial explosion, and provides rapid, on-line learning. This, in turn, enables adaptation to rapidly changing situations. Furthermore, it is capable 2

of rapidly detecting predictable events in apparently random sequences and adapting to pattern changes. Unlike augmented order-n Markov chain methods, in which learning occurs over a space of uniform n-grams, this algorithm learns over a space of hypotheses. Given a short-term memory consisting of the n most recent temporally-ordered observations, an individual hypothesis consists of the ordered contents of the short-term memory of recent observations (or a subset thereof) and an associated prediction-set of events that have, in the past, immediately followed the pattern contained in short-term memory. Given an ordered temporal stream of discrete observations (or percepts) (ot−n , ot−n+1 , . . . , ot−1 ), we seek to determine if any pattern of observations from that stream accurately predicts the immediate future event or events. The event ot might be consistently predicted by the single observation ot−1 , or the single observation ot−2 , or by the two events (ot−4 , ot−1 ), or by some other combination. Each of these combinations along with their associated event predictions are referred to as an hypothesis. In general, if the system takes the form of a Markov chain of order-1, then the single observation at time t − 1 will predict the probability of the event at time t. However, given a series of observations, it is not necessarily true that the sequence results from a Markov process of order-1. For example, it may be that the single observation at time t − 4 accurately predicts the observed event while the observation at time t − 1 is irrelevant, or the two specific observations occurring at time t − 6 and t − 4 may predict the observed event with high probability, and so on. Assuming, without loss of generality, that we fix the length of the histories stored in the shortterm memory to n = 7, there are 2n = 128 possible subsets of the event history that are used to form hypotheses, equivalent to the power-set formed from the 7-gram short-term memory. If we disregard the trivial hypothesis (consisting only of the empty-set {}), then for any specific short-term memory configuration, we are able to form 127 hypotheses, each of which “may” have predicted the observed event at time t. The system consists of a short-term memory (STM) of the n last observations ot−n , . . . , ot−1 , where for simplicity we set n = 7, plus a Hypothesis Space (HSpace) which stores the various possible hypotheses. Each hypothesis consists of a partial history (a subset of the last n time slots) and all the possible predictions with a count associated with each prediction. There are 2n possible subsets of the history. At each time step, the system attempts to learn which of the subsets is consistently good at predicting the current observation ot . It does this by adding a potential hypothesis for each possible subset corresponding to the current observation: ot−1 ⇒ et ot−2 ⇒ et .. . ot−6 , ot−4 ⇒ et .. . ot−7 , ot−6 , . . . , ot−1 ⇒ et where ei is the observation at time t that this rule is trying to predict. The exact process by which the HSpace is filled with these hypotheses is the learning process outlined in the next Section. By forming these hypotheses in real-time, we can learn those that, over time, predict specific events with high-probability and utilize them to make predictions of future events. The set of all hypotheses formed over time represents the “hypothesis space” or HSpace.

3

3.1

Learning

The HSpace is used to store the hypotheses encountered and generated from previous time steps. Associated with each hypothesis is its set of predictions together with counts for each prediction indicating how many times it has been encountered in the past. As each observation (or percept), o, is sensed, it is entered into a n-element fifo queue (STM) containing the recently observed history. The STM is organized in a fixed temporal sequence: (ot−7 , ot−6 , ..., ot−1 ). At each discrete time step, t, a new set of 127 hypotheses is formed from the stored observations in STM, along with the currently observed event at time t. Each of the 127 individual hypotheses are then inserted into the hypothesis space subject to the following rules: 1. If the hypothesis is not in the HSpace, it is added with an associated prediction-set containing only the current event (prediction) and an event count set to 1. 2. If the hypothesis already resides in the HSpace, then the observed event is matched with the stored predictions in the associated prediction set. If found, the proposed hypothesis is consistent with past observations and the event count corresponding to et is simply incremented. 3. If the hypothesis already resides in the HSpace but the observed event et is not found in the associated prediction set, the novel prediction is added to the prediction set with an event count of 1.

3.2

Pruning the hypothesis space.

The combinatorial explosion in the growth of the HSpace is controlled through a process of active pruning. Since we are only interested in those hypotheses that provide high-quality prediction, inconsistent hypotheses or those lacking predictive quality can be removed. For any given hypothesis, the computed entropy of the prediction-set can be used as a measure of the “quality” of prediction. The prediction-set P S for each hypothesis in the HSpace consists of a set of n tuples (ei , ci ), one for each of the n events predicted by the hypothesis, P S = {(e1 , c1 ), (e2 , c2 ), . . . , (en , cn )} where ci is the count of the number of times that ei followed this hypothesis’s list of observations in the data sequence. Using the individual event counts, the entropy of the prediction set can be computed as, n X ci ci log2 H=− ctot ctot i=1

where ctot is simply the sum of all the individual event counts, ctot =

n X

ci

i=1

If a specific hypothesis is associated with a single, consistent prediction, the entropy measure for that prediction-set will be zero. If a specific hypothesis is associated with a number of conflicting predictions, then the associated entropy will be high. In this sense, the “quality” of the prediction represented by the specific hypothesis is inversely related to the entropy measure. High entropy indicates poor predictive quality, and low entropy indicates consistently accurate prediction. 4

As hypotheses are added to the HSpace, inconsistent hypotheses are removed. An inconsistent hypothesis is one in which the entropy measure over the prediction-set exceeds a predetermined threshold. In other words, when the entropy measure of the predictions associated with a specific hypothesis is high, it fails the “predict with high probability” test and is no longer considered to be a reliable predictor of future events, so it is removed from the HSpace. It is this pruning behavior which bounds the growth in the hypothesis space. Over time, only those hypotheses deemed accurate predictors with high probability are retained in the hypothesis space. All others are eventually removed.

3.3

Making Predictions.

As each new observation is experienced and recorded in the STM, new hypotheses are added to the HSpace and re-evaluated in terms of the maximum entropy threshold. Those that exceed the threshold are actively pruned. The hypotheses which remain in the space are generally high-quality predictors of future events and can be used to perform serial prediction tasks. These predictions are made by considering all hypotheses consistent with the current contents of STM and choosing the “most likely” hypothesis. Again, an entropy measure over the prediction-set can be used as a qualitative prediction measure: The lower the entropy, the “better” the prediction. To make a prediction, we simply locate the hypotheses in the HSpace which are represented by the current contents of the STM, and rank them according to entropy measure. The maximum number of matching hypotheses is bound by the length of STM. With our choice of keeping the last seven observations in the STM, there are at most 127 such matching hypotheses. The most frequently occurring prediction from the hypothesis with the lowest-entropy is the best prediction that can be made, given the current experience. For making predictions, a simple entropy computation is not sufficient because it is biased toward selecting those hypotheses with a small number of occurrences. For example, a hypothesis that has only occurred once will have a single prediction-set element, producing a computed entropy value of zero. A more robust entropy measure must be used that takes into account the number of occurrences and gives greater weight to those with higher frequency. A more reliable entropy measure is obtained by re-computing the prediction-set entropy with a single, false positive added to the set. We add a single, hypothetical false-positive element which represents an implicit prediction of something else. This yields a reliable entropy measure,

Hrel = −

""

n X i=1

ci log2 ctot + 1

ci ctot + 1

#

1 log2 + ctot + 1

1 ctot + 1

#

If a specific hypothesis in the HSpace has only occurred once, its associated prediction-set will contain a single element with an event count of 1. This yields a computed prediction-set entropy of log2 (1) = 0.0. However, using the reliable entropy measure Hrel yields an adjusted entropy of 1 1 1 1 2 log2 ( 2 ) + 2 log2 ( 2 ) = 1. Note that a prediction-set with a single element but a high event count will yield a reliable entropy considerably less than 1. In this case, the reliable entropy measure is consistent with the intuitive notion of “predictability” implied by frequent occurrence.

5

A B A C A B A D

... ... At−2 Bt−1 →{(A, 1)} Bt−2 At−1 →{(C, 1)} At−2 Ct−1 →{(A, 1)} Ct−2 At−1 →{(B, 1)} At−2 Bt−1 →{(A, 2)} Bt−2 At−1 →{(D, 1), (C, 1)}

... ... Bt−1 →{(A, 1)} At−1 →{(C, 1)} Ct−1 →{(A, 1)} At−1 →{(B, 1), (C, 1)} Bt−1 →{(A, 2)} At−1 →{(D, 1), (B, 1), (C, 1)}

... ... At−2 →{(A, 1)} Bt−2 →{(C, 1)} At−2 →{(A, 2)} Ct−2 →{(B, 1)} At−2 →{(A, 3)} Bt−2 →{(D, 1), (C, 1)}

Table 1: Operation of the Pruned HSpace algorithm on a short sample stream.

3.4

An example

We show step by step how the HSpace is constructed in Table 1. For simplicity, we use a STM of length 2, and a short stream. Given the following input: ... A B A C A B A D ... at the second occurrence of observation ’A’ three hypotheses will be inserted in the HSpace. In Table 1, the hypotheses are shown in the row of the corresponding observation. Following the succeeding observations ’C’ and ’A’, six additional hypotheses will be inserted. At the second occurrence of observation ’B’, At−1 predicts not only ’C’, as before, but also ’B’. We now have an ambiguous prediction, with ’B’ being predicted with probability 21 and ’C’ also being predicted with probability 21 . The entropy of At−1 prediction goes to 1. In the next time step, we observe ’A’ and since At−2 has predicted ’A’ consistently three times its reliable entropy increases. In the next time step, we observe ’D’ and now the ambiguity of At−1 includes ’D’, ’B’, and ’C’. The entropy of the prediction set of At−1 has now increased to approximatively 1.5 and is a good candidate for pruning.

3.5

Brief Analysis

For an alphabet of size m, an augmented order-n Markov chain approach requires a transition matrix of size mn . The proposed Pruned HSpace method spans a potential space of order (m + 1)n which is significantly larger, however, two attributes of the problem domain restrict the effective size of the HSpace: 1. Limited experience yields a sparse space: Only those hypotheses that both have been experienced and have high predictive quality are kept in the HSpace. 2. Statistical structure in the observation space leads to efficient pruning: If the temporal stream of observations is truly random, leading to no ability to predict future events, then the HSpace method will indeed explode, or in the presence of pruning, will continually prune and add new hypotheses (i.e. thrash). However, most interesting “real-world” behavior has regularities our algorithm should efficiently exploit.

4

Relation to other work

The problem of determining predictive sequences in time-ordered databases has been addressed by a significant body of data mining literature, starting with the seminal work of Agrawal and 6

Srikant [Agrawal and Srikant, 1995]. However, these approaches (including [Agrawal and Srikant, 1995]) generally utilize a number of passes (forward and/or backward) through the data and do not meet the ”on-line” or ”real-time” criteria essential for dynamic agent performance. In addition, the prevalent data mining techniques generally do not handle changes in the underlying stochastic model. The Pruned HSpace algorithm can be viewed as a method to learn a sparse representation of an nth order Markov process via pruning and parameter tying. Because sub-patterns occur more frequently than the whole, our reliability measure preferentially prunes larger patterns. Because prediction is then performed via the best sub-pattern, we are effectively tying probability estimates of all the pruned patterns to their dominant sub-pattern. Previous approaches to learning sparse representations of Markov processes include variable memory length Markov models [Guyon and Pereira, 1995, Ron et al., 1996, Singer, 1997, Bengio et al., 1998] and mixture models that approximate n-gram probabilities with sums of lower order probabilities [Saul and Jordan, 1998]. Variable length Markov models (VLMMs) are most similar to our approach in that they use a variable length segment of the previous input stream to make predictions. However, VLMMs differ in that they use a tree-structure on the inputs, predictions are made via mixtures of trees, and learning is based on agglomeration rather than pruning. In the mixture approach, n-gram probabilities p(ot |ot−1 , ot−n ) are formed via additive combinations of 2-gram components. Learning in mixture models is complicated by using EM to solve a credit assignment problem between the 2m-gram probabilities and the mixture parameters. We believe the relative merits of our algorithm to be its extreme simplicity and flexibility.

5

Empirical Results

We investigated the properties of our Pruned HSpace algorithm using two simple games, “matchingpennies” and “rock-paper-scissors”. The game of matching-pennies is a simple two-player game in which one player attempts to match the hidden selection (heads or tails) made by the opposing player. The rock-paper-scissors is a similar but slightly more complex two-player game that proceeds with each person simultaneously “playing” from a set of three choices {rock, paper, scissors}. The winner is decided as follows: “rock” wins over scissors, “scissors” wins over paper and “paper” wins over rock. Ties are not counted. Both games are examples of games with no optimal strategy. Theoretically, the best strategy is to play randomly, leading to a tie. However, if an opponent exhibits a bias in play selection, that bias can be exploited to provide a winning advantage over time. These games provide an excellent domain in which to test a temporal sequence prediction mechanism. In each case, the playing strategy is to ascertain predictability (bias) in the opponent’s play, predict what the opponent is likely to do next, and choose a play that is superior to that predicted for the opponent. If the opponent exhibits predictable behavior, the learning algorithm can exploit that bias and achieve a statistical edge. In addition, if the opponent changes strategy mid-game, the algorithm must quickly adapt to the new behavior. The matching-pennies game can be simply modeled using a Bernoulli process s in which the probability of producing s = 1 is systematically biased. We initially tested the Pruned HSpace algorithm by presenting length 1000 binary strings generated by a Bernoulli process in which p(s = 1) was uniformly varied from .5 to 1.0. The number of correct predictions was measured as a function of p(s = 1) and detailed in Figure 1.

7

1

0.9

0.8

0.7

0.6

P(1) P(correct prediction)

0.5

0.4 0.5

0.55

0.6

0.65

0.7

0.75 0.8 Bernoulli 1−Bias

0.85

0.9

0.95

1

Figure 1: Performance of algorithm on Bernoulli strings of length 1000. Solid black line shows the observed fraction of 1s in the sample string, dotted line shows the fraction of correct predictions as a function of p(s = 1). 1

0.9

0.8

0.7

0.6

P(1) P(prediction = 1)

0.5

0.4 0.5

0.55

0.6

0.65

0.7

0.75 0.8 Bernoulli 1−Bias

0.85

0.9

0.95

1

Figure 2: Predictions of algorithm on Bernoulli strings of length 1000. Solid black line shows the observed fraction of 1s in the sample string, dotted line shows the fraction of predictions in which the algorithm guessed 1.

Figure 2 shows the prediction frequency for 1s in the same series of tests. For Bernoulli biases near 0.5 these results are similar to probability matching results from human experiments (see [Estes, 1964, Jones, 1971]) in which the response frequency matches bias. The growth in the HSpace was measured as a function of p(s = 1) and is presented in Figure 3. The trial was extended to measure the ability for the Pruned HSpace algorithm to rapidly adapt to changing stochastic process statistics. 100 binary strings of length 1000 were generated

8

1600

1400

Average HSpace size

1200

1000

800

600

400

200

0 0.5

0.55

0.6

0.65

0.7

0.75 0.8 Bernoulli 1−Bias

0.85

0.9

0.95

1

Figure 3: The size of the HSpace as a function of p(s = 1).

using the Bernoulli process described earlier. A simple, predictable binary pattern consisting of all 0s was provided immediately following the length 1000 preamble, and the prediction performance measured immediately following the “switch”. 1

Preamble 1−Bias = .5 Preamble 1−Bias = .75 Preamble 1−Bias = 1.0

Avg. Correct Predictions

0.8

0.6

0.4

0.2

0

−50

−40

−30

−20

−10 0 10 Sample−time relative to ’switch’

20

30

40

50

Figure 4: Adaptation rate as a function of input samples following switch in generative process statistics. Correct predictions averaged over 100 runs. Figure 4 summarizes the adaptation results for 3 sample runs of all 0s preceded by a preamble of length 1000 with p(s = 1) set respectively to .50, .75, and 1.00. The switch from a preamble biased to all 1s to a pattern of all 0s occurs within 8 samples and represents the worst-case performance. As a further test, the Pruned HSpace algorithm was used as the basis for a rock-paper-scissors playing program and pitted against human opponents. In addition to a slightly larger alphabet, the rock-paper-scissors game presents a more complex 9

environment in which two separate temporal observation streams are processed in parallel. The first stream consists of the consecutive plays of the opponent and is used to predict the opponent’s subsequent play. The second stream is used to predict the opponent’s next play based on the sequence of the machine’s plays. In this way, if the opponent is falling into biased patterns related to his/her own play, the first stream will provide predictors, whereas if the opponent attempts to exploit perceived patterns related to the machine’s play, that bias will be detected and exploited. The approach is simple. Observe, make two predictions of the opponent’s next play based on the separate input streams, and select the play that has the lowest reliable entropy measure. A number of matches have been played against human opponents with surprising success. The results of one such game are shown in Figure 5. 150

100

50

Cumulative program wins Cumulative opponent wins

0

0

50

100

150 200 Number of plays

250

300

350

Figure 5: Rock-Paper-Scissors wins vs. losses over time against a human opponent. As demonstrated, an advantage is gained following approximately 35 − 40 plays. The program exhibits the key characteristics of a dynamic, adaptive, on-line learner. It adapts to the changing play of the opponent and quickly exploits predictive patterns of play. Early results suggest that human players exhibit predictable patterns (even when explicitly trying not to), and demonstrate the Pruned HSpace method as an effective, efficient tool for learning and predicting these temporal patterns.

6

Conclusions and Future Work

We presented an algorithm capable of learning complex temporal sequences in real-time using limited memory resources while adapting rapidly to changes in the underlying stochastic properties. We demonstrated its potential for use in domains where rapid adaptability is of paramount importance. Straightforward extensions of the algorithm include variable length windows instead of the fixed length windows we presented, and multiple input streams. The algorithm can be extended to longer time scales by treating embedded sequences with high predictability as higher order temporal streams on which a modified pruning HSpace algorithm can operate.

10

References [Agrawal and Srikant, 1995] Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Philip S. Yu and Arbee S. P. Chen, editors, Eleventh International Conference on Data Engineering, pages 3–14, Taipei, Taiwan, 1995. IEEE computer Society Press. [Bengio et al., 1998] Yoshua Bengio, Samy Bengio, Jean-Francois Isabelle, and Yoram Singer. Shared context probabilistic transducers. NIPS’97, 10, 1998. [Estes, 1964] W. K. Estes. Probability learning. In A. W. Melton, editor, Categories of Human Learning. Academic Press, 1964. [Guyon and Pereira, 1995] Isabelle Guyon and Fernando Pereira. Design of a linguistic postprocessor using variable memory length markov models. International Conference on Document Analysis and Recognition, pages 454–457, 1995. [Huettel et al., 2002] Scott A. Huettel, Peter B. Mack, and Gregory McCarthy. Perceiving patterns in random series: dynamic processing of sequence in prefrontal cortex. Nature Neuroscience, 5(5):485–490, May 2002. [Jones, 1971] M.R. Jones. From probability learning to sequential processing: A critical review. Psychol. Bull., 76:153–185, 1971. [Ron et al., 1996] Dana Ron, Yoram Singer, and Naftali Tishby. The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25, 1996. [Saul and Jordan, 1998] Lawrence K. Saul and Michael I. Jordan. Mixed memory markov models: decomposing complex stochastic processes as mixtures of simpler ones. Machine Learning, pages 1–11, 1998. [Singer, 1997] Yoram Singer. Adaptive mixture of probabilistic transducers. Neural Computation, 9, 1997.

11