retrograde approximation algorithms for jeopardy ... - Semantic Scholar

1 downloads 0 Views 207KB Size Report
3Dept. of Computer Science, Loyola College in Maryland, 4501 N Charles St. ... a directed, bipartite game graph G = (U,V,E), where U and V are two disjoint sets ...
Retrograde Approximation Algorithms for Jeopardy Stochastic Games

1

RETROGRADE APPROXIMATION ALGORITHMS FOR JEOPARDY STOCHASTIC GAMES1 Haw-ren Fang2

James Glenn3

Clyde P. Kruskal4

Minnesota, USA / Maryland, USA / Maryland, USA

ABSTRACT We study algorithms for solving one- and two-player stochastic games with perfect information, by modeling a game as a finite, possibly cyclic graph. We analyze a family of jeopardy stochastic games, prove the existence and uniqueness of optimal solutions, and give approximation algorithms to solve them by incorporating Newton’s method into retrograde analysis. Examples of jeopardy stochastic games include Can’t Stop, Pig, and some variants. Results of experiments on small versions of the game Can’t Stop are presented.

1. INTRODUCTION Retrograde analysis was initially introduced by Str¨ohlein (1970). It is a bottom-up process for deterministic, finite, two-player zero-sum games with perfect information to compute the game-theoretic values of positions, based on which an optimal playing strategy can be obtained. It has been successfully applied to building endgame databases of checkers (Lake, Schaeffer, and Lu, 1994; Schaeffer et al., 2004), chess (Thompson, 1986; Thompson, 1996), and Chinese chess (Fang, 2005a; Fang, 2005b; Wu and Beal, 2001). It also played a crucial role in solving Nine Men’s Morris (Gasser, 1996) and Kalah (Irving, Donkers, and Uiterwijk, 2000). A prominent success of parallel retrograde analysis was solving Awari (Romein and Bal, 2003). This work considers stochastic games with perfect information, which have received less attention than the deterministic games, although the game-theoretic values of positions are also clearly defined. Reinforcement learning techniques (Sutton and Barto, 1998) are often used to tackle stochastic games, such as the study of two-player value iteration by Littman (1994). In some games like Yahtzee, the dependencies between positions are acyclic, which guarantees the existence and uniqueness of the optimal solution. Glenn (2006) and Woodward (2003) solved one-player Yahtzee by backward induction. In addition, a recent study was on computer strategies for solitaire Yahtzee (Glenn, 2007a). For games with cyclic graph representations, such as Backgamma, Can’t Stop, and Pig, value iteration has been applied to approximate the optimal solutions (Bellman, 1957; Sutton and Barto, 1998). Sconyers (2003) built 15×15 bear-off databases of Backgammon that include all positions in Hypergammon. Neller and Presser (2004) solved two-player Pig. In both cases, the existence and uniqueness of the optimal solution are not formally proved, although convergence of value iteration in practice implies the existence of the solution. In addition to the effort for the optimal playing strategy, Keller (1998) investigated a heuristic for Can’t Stop, called the “Rule of 28”. In this paper we analyze a family of stochastic games, called jeopardy games. In a jeopardy game, dice or a randomizing device are used to obtain a value. Depending on the value and the current position a player makes a choice of moves toward the goal, and then decide to either (1) stop and end the turn with the progress banked, or (2) continue rolling dice and make a further progress. Some dice values, possibly depending on the positions, end the player’s turn automatically and the player loses the whole progress of the current turn. Therefore, the 1 This

article is extended from two proceedings, one presented in CGW2007 workshop and the other presented in CG2006 conference of Comp. Sci. and Eng., University of Minnesota, 200 Union St. S.E., Minneapolis, MN, 55455, USA. Email: [email protected] 3 Dept. of Computer Science, Loyola College in Maryland, 4501 N Charles St. Baltimore, MD 21210, USA. Email: [email protected] 4 Dept. of Computer Science, University of Maryland, A.V. Williams Bldg., College Park, MD 20742, USA. Email: [email protected]

2 Dept.

2

ICGA Journal

Submitted

second choice gets riskier as the turn progress grows. Examples of jeopardy stochastic games include Can’t Stop (Sackson, 2007)5, Pig, Piglet (Neller and Presser, 2004), Pig Dice, Frey’s Pig, Big Pig, Piggy, Piggy Sevens, Pig-mania, and Hog (Neller and Presser, 2006)6. A jeopardy stochastic game typically has one-, two-, and multiplayer versions. We prove that all one- and two-player games in the family of jeopardy games have unique solutions, and give approximation algorithms to solve them, by incorporating Newton’s method with retrograde analysis. Our methods are more efficient than value iteration not only in theory but also in practice. We also present the results of experiments on small versions of Can’t Stop. The organization of this paper is as follows. In Section 2 we model one-player stochastic games, and prove that all one-player jeopardy stochastic games have unique solutions. In Section 3 we model two-player stochastic games, and prove that all two-player jeopardy stochastic games have unique solutions. Section 4 gives efficient retrograde approximation algorithms to solve them with global convergence analysis. Experiments were performed on small versions of Can’t Stop. Section 5 presents the indexing scheme for Can’t Stop. Section 6 summarizes the results of the experiments and analyzes a heuristic for a simplified version of one-player Can’t Stop. A conclusion is given in Section 7. The Appendix provides a short description of the Can’t Stop game rules.

2. ONE-PLAYER STOCHASTIC GAMES 2.1 Problem Formulation A one-player stochastic game with perfect information and a finite number of positions can be represented as a directed, bipartite game graph G = (U, V, E), where U and V are two disjoint sets of vertices and E ⊆ (U × V ) ∪ (V × U ) is the set of edges. The graph may be cyclic. Each vertex represents a position. A roll position is a vertex u ∈ U . A move position is a vertex v ∈ V . A turn consists of one or more partial turns in sequence; a partial turn is a dice roll followed by a move. For each nonterminal roll position u ∈ U , a dice roll (u, v) ∈ E is a random event. The weight 0 < p((u, v)) ≤ 1 indicates the probability that the game in roll position u will change into move position v. So, for all roll positions u ∈ U , X p((u, v)) = 1. ∀v with (u,v)∈E A move (v, u) ∈ E (from a move position to a roll position) is a deterministic choice made by the player. A partial turn (u, u′ ) (from roll position u to roll position u′ ) consists of a random event followed by a move. It is represented by a pair of edges (u, v), (v, u′ ) ∈ E. In general, a turn consists of a sequence of partial turns (u0 , u1 ), (u1 , u2 ), . . . , (uk−1 , uk ). In Yahtzee a turn consists of one to three partial turns. In a jeopardy game, a turn may consist of many more partial turns. Each position w ∈ U ∪ V is associated with a game-theoretic value, called position value, which measures the expected penalty with optimal play. We denote it by a real function f (w). Our model can also represent the games with the goal to maximize some reward, by denoting the expected reward by −f (w). For one-player jeopardy game, f (w) indicates the expected number of turns required to finish the game from position w, and for one-player Yahtzee, −f (w) is the expected reward, all with optimal play. For all roll positions u ∈ U , f (u) = g(u) + ∀v

X

p((u, v))f (v),

(1)

with (u,v)∈E

where g(u) is the step penalty at u; one may regard −g(u) as the step reward. By induction, f (u) is the expected sum of step penalties. In one-player Yahtzee, g(u) is the score obtained at the end of a turn. In one-player jeopardy game, the objective 5 See

also the Appendix for a short description. that in some dice games, such as 2-Dice Pig (Neller and Presser, 2006), a player may lose not only the progress of the current turn but also the progress of the previous turns. These games are excluded from the family of jeopardy stochastic games in this paper. 6 Note

Retrograde Approximation Algorithms for Jeopardy Stochastic Games

3

is to minimize the number of turns to reach the goal. Therefore, g(u) adds one for each turn. More precisely,  1, if u is the starting position for a turn, g(u) = 0, otherwise. For the optimal playing strategy, we minimize the expected penalty. Therefore, for all non-terminal move positions v ∈ V , f (v) = min{f (u) : (v, u) ∈ E}. (2) A terminal vertex indicates the end of a game. We assume all terminal vertices, denoted by z, are roll positions (in U ) with f (z) = g(z). For example, in one-player Can’t Stop, a terminal vertex z is reached when the player completes three columns, and therefore no additional rolling of dice is required, so f (z) = g(z) = 0. A position value function f satisfying both conditions (1) and (2) is called a solution. A game is solved if a solution is obtained. This model of one-player stochastic games is equivalent to a Markov decision process without cumulative reward ˆ a discount, explained as follows. A Markov decision process (MDP) consists of a state space Vˆ , an action space A, ˆ Vˆ → [0, 1], and an immediate reward function gˆ : Vˆ × Aˆ → R, where pˆ(v1 , a, v2 ) probability function pˆ : Vˆ × A× is the probability that in state v1 ∈ Vˆ action a ∈ Aˆ will lead to the next state v2 ∈ Vˆ , and gˆ(v1 , a) is the immediate reward obtained by action a in state v1 . A policy π : Vˆ → Aˆ is a function that maps the current state observed by an agent to an action. Starting from a state v0 ∈ Vˆ and following a policy π an agent may travel across states v0 , v1 , v2 , . . ., and obtain the cumulative reward defined by gˆ(ˆ v0 , π(v0 )) + γˆ g (v1 , π(v1 )) + γ 2 gˆ(v1 , π(v1 )) + · · · , where 0 ≤ γ < 1 is the discount rate. The process terminates when the agent enters a state v satisfying ∀v ′ ∈ Vˆ , pˆ(v, π(v), v ′ ) = 0. Due to random factors we consider the expected cumulative reward fˆ(v, π) an agent following policy π will receive in state v. It can be expressed by the recurrence relation X fˆ(v, π) = gˆ(v, π(v)) + γ pˆ(v, π(v), v ′ )fˆ(v ′ , π), (3) ˆ v ′ ∈V

which is called the Bellman equation (Bellman, 1957). An optimal policy π ∗ maximizes the expected cumulative reward and therefore satisfies the recurrence relation X pˆ(v, π(v), v ′ )fˆ(v ′ , π ∗ ) for all v ∈ Vˆ . (4) π ∗ = argmax π

ˆ v ′ ∈V

Now we show the equivalence of our one-player model and the MDP without cumulative reward discount. The ˆ = {(v, a) : v ∈ Vˆ , a ∈ A} ˆ and state space Vˆ in a MDP is our set of move positions V . We define U ′ ′ ˆ ˆ ˆ E = {(v, (v, a)) : v ∈ V , (v, a) ∈ U} ∪ {((v, a), v ) : pˆ(v, a, v ) > 0} in a MDP that correspond to our set of roll positions U and edge set E. Each probability pˆ(v, a, v ′ ) > 0 with u′ = (v, a) ∈ U ′ in the MDP corresponds to our p((u′ , v ′ )). The immediate reward gˆ(v, a) with u′ = (v, a) ∈ U ′ is our negated step penalty −g(u′ ). Finally, our (minimized) expected penalty f (v), v ∈ V defined by (1) and (2) is the negated optimal expected reward −fˆ(v, π ∗ ) defined by (3) and (4) in a MDP but with γ = 1. Therefore, solving a one-player stochastic game is equivalent to the optimal control of a MDP without future reward discount. Also note that in MDP ‘agent’ and ‘policy’ are our ‘player’ and ‘playing strategy’, respectively. Figure 1 gives an example that simulates the last stage of a one-player jeopardy stochastic game. Here, g(u1 ) = 1, g(u2 ) = g(u3 ) = 0, p((u1 , v1 )) = 1, and p((u2 , v2 )) = p((u2 , v3 )) = 0.5. A turn begins at position u1 . At position u2 , the first player has 50% probability of reaching the goal, and a 50% chance of falling back to u1 . By (1) and (2), 1 1 f (v2 ) + f (v3 ), 2 2 f (v2 ) = f (u1 ) = f (v1 ) + 1, f (v3 ) = f (u3 ) = 0. f (v1 ) = f (u2 ) =

The unique solution is f (u1 ) = f (v2 ) = 2 and f (u2 ) = f (v1 ) = 1. Suppose we are given a one-player game graph G and its associated function g. First, we investigate the existence and uniqueness of the solution (i.e., a position value function f that satisfies both conditions (1) and (2)). Second, we design an efficient algorithm to solve it, assuming a solution exists.

4

ICGA Journal

g(u3 ) = 0

u3

v3

0.5

Submitted

v2

u1

0.5

1.0

u2

v1

g(u1 ) = 1

g(u2 ) = 0

Figure 1: An example of game graph G = (U, V, E) with a unique solution. 2.2 One-Player Games without Cycles For games with acyclic game graphs, such as Yahtzee, the bottom-up retrograde process to compute the gametheoretic values of positions is clear. We first associate the terminal positions with their position values, and then propagate the information iteratively back to the predecessors until the game-theoretic values of all positions are obtained. The pseudocode is given in Algorithm 1. It can also be applied to constructing the perfect Backgammon bear-off databases with no piece on the bar. Algorithm 1 A retrograde algorithm to solve a one-player stochastic game G = (U, V, E) without cycles. Require: G = (U, V, E) is acyclic. Ensure: Program terminates with (1) and (2) satisfied. ∀u ∈ U , f (u) ← g(u), the step penalty. ⊲ Initialization Phase ∀v ∈ V , f (v) ← ∞. S1 ← {terminal positions in U } S2 ← {terminal positions in V } ∀u ∈ S1 ∪ S2 , set f (u) to be its value. repeat ⊲ Propagation Phase for all u ∈ S1 do for all (v, u) ∈ E do f (v) ← min{f (v), f (u)} if all children of v are determined then ⊲ (*) S2 ← S2 ∪ {v} end if end for end for S1 ← ∅ for all v ∈ S2 do for all (u, v) ∈ E do f (u) ← f (u) + p((u, v))f (v) if all children of u are determined then ⊲ (**) S1 ← S1 ∪ {u} end if end for end for S2 ← ∅ until S1 ∪ S2 = ∅ Consider Algorithm 1. We say a vertex is determined if its position value is known. By (1) and (2), a non-terminal vertex cannot be determined until all its children are determined. The sets S1 and S2 store all determined but not yet propagated vertices. A vertex is removed from them after it is propagated. The backward induction is guaranteed to terminate with all position values determined when it is applied to a game graph without cycles. See Glenn, Fang, and Kruskal (2007b) for details. The optimal playing strategy is clear: in position v we make

Retrograde Approximation Algorithms for Jeopardy Stochastic Games

5

the move (v, u) with the minimum f (u). Note that in Algorithm 1, an edge (u, v) can be visited as many times as the out-degree of u because of (*) and (**). The efficiency can be improved as follows. We associate each vertex with a number of undetermined children, and decrease the value by one whenever a child is determined. A vertex is determined after the number is decreased down to zero. As a result, each edge is visited only once and the algorithm is linear. This is called the children counting strategy. For games like Yahtzee, the level of each vertex (the longest distance to the terminal vertices) is known a priori. Therefore, we can compute the position values level by level. Each edge is visited only once without counting the children. 2.3 One-Player Games with Cyclic Dependencies We showed in Subsection 2.1 that our one-player game model is equivalent to a Markov decision process (MDP) without future reward discount. Value iteration can be used to approximate the optimal policy of a MDP. The convergence, which implies a solution, is guaranteed with a reward discount rate 0 ≤ γ < 1. The proof relies on γ being the contraction rate of global error bound (see, e.g., Mitchell (1997), section 13). In our model there is no future award discount (i.e., γ = 1). Therefore, if a game graph is cyclic, a solution may not exist. A well-designed game is expected to have a unique solution. Here we show that the existence and uniqueness of solution is guaranteed for all one-player jeopardy stochastic games, including one-player Can’t Stop and Pig. In the graph representation of such a game, each strongly connected component has a critical position, called an anchor. When a dice roll results in no legal move, the game goes back to the anchor, causing a cycle. We begin with a condition under which a solution exists and is unique in Lemma 1. The proof uses the standard fixed point theorem (see, e.g., Rosenlicht (1968), chapter 8). The utilization of fixed point theorems in game theory is not new. The most well-known success is the proof of the Nash equilibrium (Nash, 1951), the seminal work leading to the Nobel prize in economic sciences in 1994. Theorem 1 (Fixed Point Theorem) If a continuous function f : R → R satisfies f (x) ∈ [a, b] for all x ∈ [a, b], then f has a fixed point in [a, b] (i.e., f (c) = c for some c ∈ [a, b]). Lemma 1 A cyclic game graph G = (U, V, E) has a solution if, 1. For all u ∈ U , g(u) ≥ 0. 2. For each non-terminal vertex u, there is a path from u to a terminal vertex. 3. There exists some w ∈ U such that the graph is acyclic after removing the outgoing edges of w. In addition, if the vertex w in condition 3 satisfies g(w) > 0, then the solution is unique with all position values non-negative. ˆ = (U, V, E) ˆ be the graph obtained by removing all of the outgoing edges from w in G (i.e., Proof Let G ˆ ˆ = (U, V, E) ˆ is acyclic. All the terminal vertices E = {(u, v) : (u 6= w) ∧ ((u, v) ∈ E)}). By condition 3, G ˆ ˆ by other than w in G are also terminal in G. Let x be the estimated position value of w. We can solve G Algorithm 1. However, we propagate in terms of x (i.e., treat x as a variable during propagation), though we know the value of x. For example, assuming x = 6, we write min{ 21 x, 13 x + 2} = 21 x instead of 3. We use ˆ in terms of x. At the end of Algorithm 1, we compute fˆ(x, y) to denote the position value of y ∈ U ∪ V of G ˆ in terms of x by (1). The values of fˆ(x, y) for all y ∈ U ∪ V constitute a solution fˆ(x, w) with edges in E − E to G, if and only if fˆ(x, w) equals x in value. The main theme of this proof is to discuss the existence and uniqueness of x∗ satisfying fˆ(x∗ , w) = x∗ . Iteratively applying (1) and (2), all fˆ(x, y) for y ∈ U ∪ V are in the form ax + b, where 0 ≤ a ≤ 1. We are particularly concerned with fˆ(x, w). Let fˆ(x, w) = a(x)x + b(x), where a(x) and b(x) are real functions of x. By (1) and (2), it is not hard to see that a(x) is non-increasing, b(x) is non-decreasing, and both a(x) and b(x) are piecewise constant. Hence fˆ(x, w) is piecewise linear, continuous and non-decreasing in terms of x. By condition 1, fˆ(0, w) = b(0) ≥ g(w) ≥ 0. By condition 2, a(x) < 1 for x large enough. Since a(x) is non-increasing and a(x) < 1 for x large enough, f (x) < x for x large enough. By Theorem 1, there exists x∗ ≥ 0 such that fˆ(x∗ , w) = x∗ .

6

ICGA Journal

Submitted

Now we discuss the uniqueness of the solution. By condition 1, fˆ(0, w) ≥ g(w). Assuming g(w) > 0, then fˆ(0, w) > 0. Moreover, fˆ(x, w) = a(x)x + b(x) is piecewise linear and continuous with 0 ≤ a(x) ≤ 1 for x ∈ R. Therefore, x ≤ 0 implies fˆ(x, w) > x. Hence the smallest solution x∗ to fˆ(x, w) = x is greater than zero. Since fˆ(0, w) > 0 and a(x) is non-increasing, a(x∗ ) < 1 and therefore fˆ(x, w) < x for x > x∗ . We conclude that the additional condition g(w) > 0 guarantees the solution x∗ > 0 to fˆ(x, w) = x is unique. With the position value x∗ of the anchor w, we repeatedly apply (1) and (2) and obtain all position values being non-negative. The solution to the game graph G is unique since x∗ is unique. 2 We illustrate an example with no solution. Consider the game graph in Figure 2 and assume g(u2 ) = 0. It satisfies all the conditions in Lemma 1 except that g(u1 ) = −1. Let f (u1 ) = x. By propagation, f (v1 ) = f (u2 ) = f (v2 ) = min{0, x}. Since f (u1 ) = f (v1 ) − 1, we obtain x = min{0, x} − 1 which has no solution. The interpretation is that the player would stay in the loop forever to gain the unlimited award (negative-valued penalty).

g(u3 ) = 0

u3

v2

u1

1.0

1.0

u2

v1

g(u1 ) = −1

Figure 2: The game graph (a) has no solution if g(u2 ) = 0; (b) has infinitely many solutions if g(u2 ) = 1. It Pis easy to construct a game graph G = (U, V, E) with infinitely many solutions. For example, a ring with u∈U g(u) = 0. Another example is the game graph in Figure 2 with g(u2 ) = 1. Let f (u1 ) = x. By propagation, we are to solve x = min{0, x}, which has infinitely many solutions x ≤ 0. In such games the game-theoretic values of positions cannot be uniquely fixed due to cyclic dependencies. Consider the strongly connected components of the game graph of one-player Can’t Stop. Each strongly connected component consists of all the positions with a certain placement of the squares and various placements of the at most three markers. The roll position with no marker is the anchor of the component. When left without a legal move, the game goes back to the anchor, and results in a cycle. The outgoing edges of each non-terminal component lead to the anchors in the supporting components. The terminal components are those in which the player has won three columns. Each terminal component has only one vertex with position value 0. Note that all one-player jeopardy stochastic games have the same structure in their graph representations. All strongly connected components satisfy the precondition of Lemma 1. Theorem 2 All one-player games in the family of jeopardy stochastic games have unique solutions. Proof The proof is by finite induction. We split the game graph of a one-player jeopardy stochastic game into strongly connected components, and consider the components in bottom-up order. Given a non-terminal component with the anchors in its supporting components having position values nonnegative and uniquely determined, we consider the subgraph induced by the component and the anchors in its supporting components. This subgraph satisfies the precondition in Lemma 1, where the terminal positions are the anchors in the supporting components. Therefore, it has a unique solution with all position values non-negative. By induction, a unique solution is guaranteed for a one-player jeopardy stochastic game. 2 3. TWO-PLAYER CAN’T STOP 3.1 Problem Formulation A two-player, stochastic game with perfect information and a finite number of positions can be represented as a ¯ , V¯ , E), where E ⊆ (U × V ) ∪ (U ¯ × V¯ ) ∪ ((V ∪ V¯ ) × (U ∪ U ¯ )). The directed, four-partite graph G = (U, V, U

Retrograde Approximation Algorithms for Jeopardy Stochastic Games

7

graph may be cyclic. ¯ ∪ V ∪ V¯ . Positions in U ∪ V represent positions where it is player one’s turn; A position is a vertex w ∈ U ∪ U ¯ ¯ ¯ A move position is a vertex positions in U ∪ V indicate player two’s turn. A roll position is a vertex u ∈ U ∪ U. ¯ v ∈V ∪V. For each non-terminal roll position u, a dice roll is a random event. The weight 0 < p((u, v)) ≤ 1 indicates the probability that the game in roll position u will change into move position v. So, X p((u, v)) = 1. ∀v with (u,v)∈E A move (v, u) (from a move position to a roll position) is a deterministic choice made by a player. As in the one-player model, a turn consists of a sequence of partial turns, each of which is a dice roll followed by a move. A partial turn (u1 , u2 ) (from roll position u1 to roll position u2 ) consists of a random event followed by a move. It is represented by a pair of edges ((u1 , v), (v, u2 )) in G. In general, a turn consists of a sequence of partial turns (u0 , u1 ), (u1 , u2 ), . . . , (uk−1 , uk ). We associate each position with a real number representing the expected game score that the first player achieves ¯ ∪ V ∪ V¯ → R. The function f (u) indicates the probability with optimal play, denoted by a function f : U ∪ U that the first player will win the game. For zero-sum games such as two-player Can’t Stop and Pig, the probability ¯ , f (z) = 1 if the first player wins, that the second player wins is 1 − f (u). For any terminal position z ∈ U ∪ U and f (z) = 0 if the first player loses. For two-player Yahtzee, a game may end in a draw. In this case we can set f (z) = 0.5. ¯, The function f satisfies that for all (non-terminal) roll positions u ∈ U ∪ U X f (u) = p((u, v))f (v). ∀v with (u,v)∈E

(5)

In optimal play the first player maximizes f and the second player minimizes f . Therefore, for all non-terminal move positions v ∈ V ∪ V¯ ,  max{f (u) : (v, u) ∈ E} if v ∈ V , f (v) = (6) min{f (u) : (v, u) ∈ E} if v ∈ V¯ . ¯ ∪ V ∪ V¯ , f (w) is also called the position value of w. For all positions w ∈ U ∪ U A position value function f satisfying both conditions (5) and (6) is called a solution. A game is solved if a solution is obtained.

v2

f (u2 ) = 1

u2

0.5

u1

v¯1

0.5

0.5

v1

u¯1

u ¯2

0.5

f (¯ u2 ) = 0

v¯2

¯ , V¯ , E). Figure 3: An example of two-player game graph G = (U, V, U ¯ = {¯ We illustrate an example with Figure 3. We have U = {u1 , u2 }, V = {v1 , v2 }, U u1 , u ¯2 }, and V¯ = {¯ v1 , v¯2 }. The two terminal vertices are u2 and u¯2 with f (u2 ) = g(u2 ) = 1 and f (¯ u2 ) = g(¯ u2 ) = 0, respectively. This example simulates the last stage of a two-player jeopardy stochastic game. At position u1 , the first player has 50% chance of winning the game immediately, and a 50% chance of being unable to advance and therefore making no progress at this turn. The second player is in the same situation at position u ¯1 . By (5) and (6),

8

ICGA Journal

f (u1 ) = 21 f (v1 ) + 21 f (v2 ), f (¯ u1 ) = 21 f (¯ v1 ) + 21 f (¯ v2 ), The unique solution is f (u1 ) = f (¯ v1 ) =

2 3

f (v1 ) = f (¯ u1 ), f (¯ v1 ) = f (u1 ),

Submitted

f (v2 ) = f (u2 ) = 1, f (¯ v2 ) = f (¯ u2 ) = 0.

and f (u2 ) = f (¯ v2 ) = 31 .

3.2 Two-Player Games without Cycles Algorithm 1 can be modified easily to work for two-player game graphs. There are few substantive changes: in the initialization phase we set f (u) = 0 for roll positions u because there is no longer a step penalty function g; and in the propagation phase we maximize if v ∈ V and minimize if v ∈ V¯ when examining an edge (v, u). This necessitates another change in the initialization phase: we set f (v) to −∞ if v ∈ V and to ∞ if v ∈ V¯ . The result is Algorithm 2. ¯ , V¯ , E) without cycles. Algorithm 2 A retrograde algorithm to solve a two-player stochastic game G = (U, V, U ¯ , V¯ , E) is acyclic. Require: G = (U, V, U Ensure: Program terminates with (5) and (6) satisfied. ¯ , f (u) ← 0. ∀u ∈ U ∪ U ⊲ Initialization Phase ∀v ∈ V , f (v) ← −∞, ∀v ∈ V¯ , f (v) ← ∞. ¯} S1 ← {terminal positions in U ∪ U S2 ← {terminal positions in V ∪ V¯ } ∀u ∈ S1 ∪ S2 , set f (u) to be its value. repeat ⊲ Propagation Phase for all u ∈ S1 and (v, u) ∈ E do if v ∈ V then f (v) ← max{f (v), f (u)} else[v ∈ V¯ ] f (v) ← min{f (v), f (u)} end if if all children of v are determined then ⊲ (*) S2 ← S2 ∪ {v} end if end for S1 ← ∅ for all v ∈ S2 and (u, v) ∈ E do f (u) ← f (u) + p((u, v))f (v) if all children of u are determined then ⊲ (**) S1 ← S1 ∪ {u} end if end for S2 ← ∅ until S1 ∪ S2 = ∅ Algorithm 2 is indeed a retrograde process in bottom-up order. It solves a given two-player game graph without cycles within a finite number of iterations. See Glenn, Fang, and Kruskal (2007c) for further discussions. Note that an edge (u, v) can be visited as many times as the out-degree of u because of (*) and (**). With the children counting strategy described in Subsection 2.2, we need to visit each edge only once. As a result, the algorithm is linear. 3.3 Two-Player Games with Cyclic Dependencies Value iteration can be applied to solving Markov games (Littman, 1994), where the reward discount factor guarantees the convergence which implies a solution. However, in our two-player model there is no future award discount. Therefore, a solution is not guaranteed if a cycle is present. In the case of dice game Pig, value iteration happens to converge (Neller and Presser, 2004, page 29), so a solution exists.

Retrograde Approximation Algorithms for Jeopardy Stochastic Games

9

A well-designed game is expected to have a unique solution. Here we prove the existence and uniqueness of a solution for all two-players jeopardy stochastic games. In the graph representation of such a game, each strongly connected component has two critical positions, called anchors. When a dice roll results in no legal move, the game goes back to one of the two anchors, leading to a cycle. We begin with a condition under which a solution for a two-player cyclic game exists and is unique in Lemma 2. The proof uses the Fixed Point Theorem again. ¯ , V¯ , E), we use G1 and G2 to denote the two Lemma 2 Given a cyclic two-player game graph G = (U, V, U ¯ ∪ V¯ , respectively. G has a solution with all position values in [0, 1] if, subgraphs of G induced by U ∪ V and U 1. The graphs G1 and G2 are acyclic. ¯ such that all edges from G1 to G2 end at w2 , and all edges from G2 to G1 2. There exist w1 ∈ U and w2 ∈ U ¯ ⊆ V × {w2 } and E ∩ (V¯ × U ) ⊆ V¯ × {w1 }. end at w1 . In other words, E ∩ (V × U) 3. All the terminal position values are in [0, 1]. ˆ 1 the subgraph of G induced by U ∪ {w2 } and V . The solution with all position values in [0, 1] is Denote by G unique if the following additional conditions hold. ˆ 1 from u to a terminal vertex other than w2 . 4 For any u ∈ U , there is a path in G 5 For any (v, w2 ) ∈ E, the out-degree of v is 1. That is, v is a terminal vertex in G1 . ˆ 1 = (U ∪ {w2 }, V, E1 ) and G ˆ 2 = (U ¯ ∪ {w1 }, V¯ , E2 ) be the induced bipartite subgraphs of G. By Proof Let G ˆ 1 and G ˆ 2 are acyclic. Consider G ˆ 1 . All the terminal vertices in G ˆ 1 other than w2 are also terminal condition 1, G ˆ 1 can be uniquely determined since G ˆ 1 is acyclic. in G. If we know the position value of w2 , then the solution to G ˆ 1 by Algorithm 1. Denote Given the estimated position value y of w2 , we can determine the positions values of G by fˆ1 (y, w) the position value of w ∈ U ∪ V that depends on y. Likewise, given x as the estimated position ¯ ∪ V¯ . The values of fˆ1 (y, w) for w ∈ U ∪ V and value of w1 , we denote by fˆ2 (x, w) ¯ the position value of w ¯∈U ˆ ¯ ¯ f2 (x, w) ¯ for w ¯ ∈ U ∪ V constitute a solution to G, if and only if fˆ1 (y, w1 ) = x and fˆ2 (x, w2 ) = y. The main theme of this proof is to discuss the existence and uniqueness of x∗ , y ∗ ∈ [0, 1] satisfying fˆ1 (y ∗ , w1 ) = x∗ and fˆ2 (x∗ , w2 ) = y ∗ , or equivalently fˆ2 (fˆ1 (y ∗ , w1 ), w2 ) = y ∗ . Condition 3 states that all terminal position values are in [0, 1]. Iteratively applying (5) and (6), fˆ2 (fˆ1 (0, w1 ), w2 ) ≥ 0 and fˆ2 (fˆ1 (1, w1 ), w2 ) ≤ 1. Note that fˆ2 (fˆ1 (y, w1 ), w2 ) is a continuous function of y. By Theorem 1, there exists y ∗ ∈ [0, 1] such that fˆ2 (fˆ1 (y ∗ , w1 ), w2 ) = y ∗ . Iteratively applying (5) and (6) again, the position values ¯ ∪ V¯ , fˆ2 (x∗ , w), of w ∈ U ∪ V , fˆ1 (y ∗ , w), are all in [0, 1]. Likewise, the position values of w ¯∈U ¯ are also all in [0, 1], where x∗ = fˆ1 (y ∗ , w1 ). ˆ 1 , whose solution can be obtained by propagation Now we investigate the uniqueness of the solution. Consider G that depends on y, the position value of w2 . For convenience of discussion, we propagate in terms of y (i.e., treat y as a variable during the propagation), even though we know the value of y. For example, assuming y = 12 , we write max{ 32 y, 14 y + 61 } = 23 y instead of 13 . Iteratively applying (5) and (6), all propagated values of u ∈ U ∪ V are in the form ay + b, which represents the local function values of fˆ1 (y, u). Condition 3 gives us a bound on a, b. By finite induction, a, b are non-negative and a + b ≤ 1 for y ∈ [0, 1]. (It is still true that 0 ≤ a ≤ 1 for y ∈ R.) We are particularly concerned with fˆ1 (y, w1 ). Analysis above shows that fˆ1 (y, w1 ) in terms of y is piecewise linear, continuous and non-decreasing with the slope of each line segment in [0, 1], and so is fˆ2 (x, w2 ) in terms of x by a similar discussion. These properties are inherited by the composite function fˆ2 (fˆ1 (y, w1 ), w2 ). Now we claim that the additional conditions 4 and 5 guarantee that the slope of each line segment of fˆ1 (y, u) is ˆ 1 is bipartite and acyclic. We consider the less than 1 for u ∈ U . We prove our claim by induction. Note that G positions level by level in bottom-up order as in Algorithm 2, but here we propagate in terms of y. Our claim is clearly true for terminal positions, which are independent of y so the slope is zero. Given v ∈ V , if (v, w2 ) ∈ E, we get fˆ1 (y, v) = y by condition 5. Otherwise, fˆ1 (y, v) has slope less than 1 by (6) and the assumption that fˆ1 (y, u) has slope less than 1 for u ∈ U . Given u ∈ U , conditions 4 and 5 imply that it has a child v such that (v, w2 ) ∈ / E. By (5), fˆ1 (y, u) has slope less than 1. The proof of our claim is completed by finite induction.

10

ICGA Journal

Submitted

Since w1 ∈ U , our claim implies that each line segment of fˆ1 (y, w1 ) has slope less than 1 if the additional conditions 4 and 5 also hold. Therefore, the slope of each line segment of the continuous function fˆ2 (fˆ1 (y, w1 ), w2 ) is also less than 1. This guarantees the uniqueness of the solution in [0, 1] to fˆ2 (fˆ1 (y, w1 ), w2 ) = y. 2 Here we illustrate an example with infinitely many solutions in Figure 4, where U = {u1 , u2 }, V = {v1 }, ¯ u1 , u U{¯ ¯2 }, and V¯ = {¯ v1 }. This graph satisfies conditions 1 to 4 in Lemma 2. However, condition 5 does not hold. Denote f (u2 ) = a, f (¯ u2 ) = b and f (¯ u1 ) = y. Iteratively applying (5) and (6), we are to solve min{max{a, y}, b} = y for y. If a < b, it has infinitely many solutions f (u1 ) = f (v1 ) = f (¯ u1 ) = f (¯ v1 ) = y for a < y < b. On the other hand, if a ≥ b, the solution, f (u1 ) = f (v1 ) = a and f (¯ u1 ) = f (¯ v1 ) = b, is unique.

f (u2 ) = a

u2

u1

v¯1

1.0

1.0

v1

u¯1

u ¯2

f (¯ u2 ) = b

¯ , V¯ , E) with infinitely many solutions for a < b. Figure 4: An example of game graph G = (U, V, U Consider the strongly connected components of the game graph of two-player Can’t Stop. Each strongly connected component consists of all the positions with a certain placement of the squares and various placement of the at most three markers for the player on the move. The two roll positions with no marker are the anchors of the component. In one of them it is the turn of the first player, whereas in the other it is the second player to move. When left without a legal move, the game goes back to one of the two anchors, and results in a cycle. The outgoing edges of each non-terminal component lead to the anchors in the supporting components. The terminal components are those in which some player has won the game (by completing three columns). Each terminal component has only one vertex with position value 1 (if the first player wins) or 0 (if the second player wins). Note that all two-player jeopardy stochastic games have the same structure in their graph representations. All strongly connected components satisfy the precondition of Lemma 2. Theorem 3 All two-player games in the family of jeopardy stochastic games have unique solutions, with all position values in [0, 1]. Proof The proof is by finite induction. We split the game graph of a two-player jeopardy stochastic game into strongly connected components, and consider the components in bottom-up order. Given a non-terminal component with the anchors in its supporting components having position values uniquely determined in [0, 1], we consider the subgraph induced by the component and the anchors in its supporting components. This subgraph satisfies the precondition in Lemma 2, where the terminal positions are the anchors in the supporting components. Therefore, it has a unique solution with all position values in [0, 1]. By induction, a unique solution is guaranteed for a two-player jeopardy stochastic game. 2 4. RETROGRADE APPROXIMATION ALGORITHMS 4.1 Value Iteration If we apply Algorithms 1 and 2 to one- and two-player stochastic games with cyclic dependencies, then the positions in the cycles cannot be determined. Value iteration (Bellman, 1957; Sutton and Barto, 1998) can be used to approximate the solutions of cyclic games. For example it has been successfully applied to solving twoplayer Pig (Neller and Presser, 2004). A high-level description of value iteration for stochastic games is given in Algorithm 3. Two schemes to improve efficiency are described as follows. First, instead of applying Algorithm 3 to the whole game graph, we divide it into strongly connected components, and solve the components in bottom-up order. This

Retrograde Approximation Algorithms for Jeopardy Stochastic Games

Algorithm 3 Value iteration for one- and two-player stochastic games with cyclic dependencies. {Given a stochastic game graph with the set of positions denoted by V , and tolerance ǫ > 0.} For all terminal positions z ∈ V , initialize f (z) to its value. repeat δ←0 for all v ∈ V do t ← f (v) Update f (v) with the current estimated position values of it children by (1), (2), (5), or (6). δ ← max{δ, |t − f (v)|} end for until δ < ǫ

11

⊲ (*)

is a general scheme not limited to value iteration. Second, we can visit the vertices in the for loop (*) in the order according to their dependencies. This scheme is incorporated in the naive algorithms in Glenn et al. (2007b) and Glenn et al. (2007c) for Can’t Stop. Value iteration is known for its slow linear convergence (see, e.g., Mitchell (1997), section 13). It is possible to improve the convergence by taking advantage of the particular structure of a game graph, if any, in algorithm design. We call a subset of vertices in a cyclic graph anchors if removing them results in an acyclic subgraph. In the following subsection we present our retrograde approximation algorithms by incorporating Newton’s method into retrograde analysis. They are particularly useful for stochastic games with only a few anchors in each strongly connected component. 4.2 Efficient Retrograde Approximation Algorithms In some games, such as those in the family of jeopardy stochastic games, the number of anchors in each strongly connected component of the game graphs is small. An observation makes it possible to design efficient algorithms to compute the solutions. The proofs of Lemma 1 and Lemma 2 reveal that we can propagate in terms of position value x of the anchor (i.e., treat x as a variable), though we know the value of x. For example, assuming x = 6, we write min{ 12 x, 13 x+2} = 1 2 x instead of 3. Then not only the estimated position values but also the function slopes (derivatives) are passed during propagation. We equate the linear function propagated back to the anchor with variable x and solve it for the estimated position value of the anchor in next iteration. The method just described is equivalent to solving fˆ(x, w) = x (one-player game) or fˆ1 (y, w1 ) = x and fˆ2 (x, w2 ) = y (two-player game) by Newton’s method, which is well-known for its fast convergence (see, e.g., Dennis and Schnabel (1996), chapter 5). In our case, the functions fˆ(x, w) (one-player game), and fˆ1 (y, w1 ) and fˆ2 (x, w2 ) (two-player game) are piecewise linear. Hence Newton’s method can reach the solution in a finite number of iterations. In practice, however, rounding errors may create minor inaccuracy. The pseudocode is given in Algorithm 4 for one-player games and in Algorithm 5 for two-player games. Algorithm 4 An efficient algorithm to solve a one-player stochastic game graph with one anchor. Require: G = (U, V, E) satisfies the precondition in Lemma 1 with an anchor w ∈ U . Ensure: fˆ converges to the solution to G in the rate of Newton’s method. ⊲ Subsection 4.3 Let x denote the estimate for position value of anchor w. ⊲ Estimation Phase ˆ = (U, V, E) ˆ by removing the outgoing edges of w. Obtain the acyclic graph G repeat ⊲ Approximation Phase ˆ (in terms of x) with the current estimate x for w by Algorithm 1. Solve G ˆ by (5) in terms of x. Denote the result by ax + b. Compute fˆ(x, w) with E − E b b ⊲ (The solution to ax + b = x is x = 1−a .) x ← 1−a . until fˆ(x, w) = x in value. An example of applying Algorithm 4 is illustrated with the game graph in Figure 1 as follows. We treat u1 as w in Lemma 1, and let x be the initial estimate of the position value of u1 . Then fˆ(x, v2 ) = x, fˆ(x, v1 ) = fˆ(x, u2 ) =

12

1 2 x,

ICGA Journal

Submitted

and fˆ(x, u1 ) = 12 x + 1. Solving 21 x + 1 = x, we obtain x = 2, which is the exact position value of u1 .

Algorithm 5 An efficient algorithm to solve a two-player stochastic game graph with two anchors. ¯ , V¯ , E) satisfies the precondition in Lemma 2 with two anchors w1 ∈ U and w2 ∈ U ¯. Require: G = (U, V, U ˆ ˆ Ensure: f1 and f2 converge to the solution to G in the rate of Newton’s method. ⊲ Subsection 4.3 ˆ 1 = (U ∪ {w2 }, V, E1 ) and G ˆ 2 = (U ¯ ∪ {w1 }, V¯ , E2 ) as in Lemma 2. Denote the induced acyclic subgraphs G ¯ denoted by x and y. Estimate the position values of anchors w1 ∈ U and w2 ∈ U, ⊲ Estimation Phase repeat ⊲ Approximation phase ˆ 1 in terms of the current estimate y for w2 by Algorithm 1. Solve G ⊲ (*) Denote the linear segment of fˆ1 (y, w1 ) by a1 y + b1 . ˆ 2 in terms of the current estimate x for w1 by Algorithm 1. Solve G ⊲ (**) Denote the linear segment of fˆ2 (x, w2 ) by a2 x + b2 . Solve x = a1 y + b1 and y = a2 x + b2 for the next estimates x and y. until the values of x and y cannot be longer unchanged. An example of applying Algorithm 5 is illustrated with the game graph in Figure 3 as follows. We treat u1 as w1 and u¯1 as w2 in Lemma 2, and let x and y be the initial estimate of the position values of u1 and u¯1 , respectively. Then fˆ1 (y, u1 ) = 12 y + 21 , fˆ2 (x, u¯1 ) = 21 x. Solving 12 y + 12 = x and 21 x = y, we obtain x = 23 and y = 31 , which are the exact position values of u1 and u ¯1 , respectively. In both small examples shown above, we obtain the solution by one iteration. In practice, multiple iterations are expected to reach the solution. In the estimation phase of both Algorithms 4 and 5, the better the initial estimated position value(s) of the anchor(s), the fewer steps are needed to reach the solution. Careful initialization may improve the convergence. ˆ 1 and G ˆ 2 are disjoint except For a two-player game graph satisfying the precondition in Lemma 2, the graphs G w1 and w2 , and the propagations in (*) and (**) in each iteration are independent of each other. Therefore, Algorithm 5 is natively parallel on two processors, by separating the computations (*) and (**). What differentiates Algorithms 4 and 5 is the number of anchors, rather than number of players. A more general model is that a game graph G, either one-player or two-player, has two anchors w1 , w2 (i.e., removing the outgoing edges of w1 and w2 results in an acyclic subgraph) but the precondition in Lemma 2 is not guaranteed. In such games the incorporation of two-dimensional Newton’s method is still possible. Let x and y be the current estimated position values of w1 and w2 at each iteration, respectively. The propagated values of w1 and w2 in 7 ) are terms of x and y (e.g., if x = 21 and y = 34 , we write min{ 21 x + 12 y, 32 x + 31 y} = 23 x + 13 y instead of 12 ˆ ˆ ˆ ˆ denoted by f (x, y, w1 ) and f (x, y, w2 ). We solve the linear system of f (x, y, w1 ) = x and f (x, y, w2 ) = y for x and y as the position values of w1 and w2 in the next iteration. Note that, however, the resulting algorithm is no longer natively parallel on two processors. The existence and uniqueness of the solution also need further investigation. Our methods are not limited to game graphs with one or two anchors. Given a game graph with m anchors w1 , . . . , wm (i.e., removing the outgoing edges of w1 , . . . , w2 results in an acyclic subgraph), we use variables x1 , . . . , xm to denote the estimated position values of w1 , . . . , wm , transform the problem into a system of m piecewise linear equations fˆi (x1 , . . . , xm , wi ) = xi for i = 1, . . . , m, and solve the system by incorporating m-dimensional Newton’s method into retrograde analysis. An n-player, finite, stochastic game with perfect information can be presented as a 2n-partite graph. In this direction we have further developed a retrograde approximation algorithm for multi-player stochastic games, and experimented on small versions of three- and four-player Can’t Stop. See Glenn, Fang, and Kruskal (2008) for details. 4.3 Global Convergence Analysis Our retrograde approximation algorithms presented in Subsection 4.2 perform essentially Newton iterations to find a solution to a given game graph. Newton’s method is known for its rapid local convergence. When it is applied to solving a system of continuously differentiable nonlinear equations, it converges quadratically when the starting point is close enough to a solution (Dennis and Schnabel, 1996, chapter 5). We now turn to the global convergence properties of our algorithms. We show that under modest assumptions, our Algorithms 4 and

Retrograde Approximation Algorithms for Jeopardy Stochastic Games

13

5 achieve the global convergence for one- and two-player jeopardy stochastic games, respectively. Our analysis is based on the Newton-Baluev Theorem (Ortega, 1990, section 8). We begin with the definition of convexity. A set Ω ⊆ Rn is convex if x, y ∈ Ω implies tx − (1 − t)y ∈ Ω for t ∈ [0, 1], where the inequality is defined element-wise. A function f : Rn → Rm is convex on a convex set Ω if f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y)

(7)

for x, y ∈ Ω and t ∈ [0, 1]. A function f is concave if −f is convex. Note that we have used bold-face lower-case characters to denote vectors and vector functions, and subscript i to denote the ith component of a vector. We will also use bold-face upper-case characters to denote matrices. Lemma 3 Given a differentiable function f : Rn → Rm which is convex on a convex set Ω ⊆ Rn . Then f (y) − f (x) ≥ J(x)(y − x)

(8)

for all x, y ∈ Ω, where J(x) is the Jacobian matrix of f at x. 1 Proof By (7), f (y) − f (x) ≥ 1−t (f (x + (1 − t)(y − x)) − f (x)). The limit of the right-hand side is J(x)(y − x) as t → 1, which completes the proof. 2

Theorem 4 (Newton-Baluev Theorem) Suppose we are given a continuously differentiable function f : Rn → Rn which is convex on Rn , and there exists a non-singular matrix C ∈ Rn×n such that CJ(x)−1 ≥ 0 (i.e., all elements non-negative) for all x ∈ Rn , and f (x) = 0 has a solution x∗ ∈ Rn . Then the Newton iterations xk+1 = xk − J(xk )−1 f (xk ) for k = 0, 1, . . .

(9)

converge to the solution x∗ to f (x) = 0, and the solution is unique. In addition, Cx1 ≥ Cx2 ≥ · · · ≥ Cx∗ . Proof By Lemma 3 and (9), f (xk+1 ) − f (xk ) ≥ J(xk )(xk+1 − xk ) = −f (xk ) and therefore f (xk+1 ) ≥ 0 for k = 0, 1, . . . . By Lemma 3 again, −f (xk ) = f (x∗ ) − f (xk ) ≥ J(xk )(x∗ − xk ). Since CJ(xk )−1 ≥ 0, we can multiply both sides of the inequality by CJ(xk )−1 and obtain 0 ≥ −CJ(xk )−1 f (xk ) ≥ C(x∗ − xk ) that implies Cxk ≥ Cx∗ for k = 1, 2, . . . . By (9) again, Cxk+1 = Cxk − CJ(xk )−1 f (xk ) ≤ Cxk for k = 1, 2, . . . . We conclude that Cx1 ≥ Cx2 ≥ · · · ≥ Cx∗ . Since the sequence Cx1 , Cx2 , . . . is decreasing with a lower bound Cx∗ , it has a limit, denoted by z. Then C−1 z is the converged value of x1 , x2 , . . . , which is by (9) a solution to f (x) = 0. Suppose we have two solutions x∗ and y∗ to f (x) = 0. By Lemma 3, 0 = f (y∗ ) − f (x∗ ) ≥ J(x∗ )(y∗ − x∗ ). Multiplying both sides by CJ(x)−1 gets x∗ ≥ y∗ . Swapping the roles of x∗ and y∗ and applying Lemma 3 again, we obtain y∗ ≥ x∗ . As a result, x∗ = y∗ , so the solution is unique. 2 Note that Theorem 4 presented above is slightly different from the Newton-Baluev theorem in Ortega (1990), section 8, where there is no constant matrix C ∈ Rn×n , and it is a precondition that J(x)−1 ≥ 0 is strictly non-negative. We slightly generalized this theorem by adding C ∈ Rn×n . This generalization is required to show the global convergence of Algorithm 5 for two-player jeopardy stochastic games. Consider Algorithm 4 and Lemma 1 for one-player jeopardy stochastic games. Algorithm 4 solves the piecewise linear equation fˆ(x, w) = x for x by Newton’s method. From the proof of Lemma 1, fˆ(x, w) is continuous, and piecewise linear in the form a(x)x + b(x), where a(x) and b(x) are piecewise constant and 0 ≤ a(x) ≤ 1. Note that since a(x) is non-increasing, fˆ(x, w) is concave. To apply Theorem 4, we set f (x) to x − fˆ(x, w), which is convex, and then J(x) is 1 − a(x) ≥ 0, so we do not need C (i.e., set C as 1). This problem meets the precondition of Theorem 4 except that a(x) could be equal to 1 that makes the Newton step undefined, and x − fˆ(x, w) is piecewise linear rather than continuously differentiable. The latter issue can be addressed by replacing the sharp points by smooth approximations. For the former issue, the proof of Lemma 1 shows that a(x) > 0 if x is large enough, a(x) is non-increasing, and a(x∗ ) > 0. The proof of Theorem 4 shows that all the Newton iterates but the initial estimate are in the right-hand side of the solution x∗ . Therefore, if we start with a large enough estimate, then Algorithm 4 converges to the solution. Consider Algorithm 5 and Lemma 2 for two-player jeopardy stochastic games. Algorithm 5 solves the system of fˆ1 (y, w1 ) = x and fˆ2 (x, w2 ) = y for x and y by Newton’s method. From the proof of Lemma 2, both fˆ1 (y, w1 ) and fˆ2 (x, w2 ) are continuous, piecewise linear as a1 (y)y + b1 (y) and a2 (x)x + b2 (x), respectively, where a1 (y), b1 (y), a2 (x), b2 (x) are piecewise constant and 0 ≤ a1 (y) < 1, 0 ≤ a2 (x) ≤ 1. In addition, since

14

ICGA Journal

Submitted

a1 (y) is non-decreasing and a2 (x) is non-increasing, fˆ1 (y, w1 ) is convex and fˆ2 (x, w2 ) is concave. To apply Theorem 4, weset x to (x, y) and f (x) to (fˆ1 (y, w1 ) − x, y − fˆ2 (x, w2 )),  which is convex. Then  the Jacobian −1 a1 (y) −1 a (y) 1 1 matrix J(x) is , and its inverse J(x)−1 is 1−a1 (y)a . Indeed, a1 (y) 2 (x) −a2 (x) 1 −a2 (x) 1   −1 1−ǫ has an upper bound smaller than 1 from the game graph. We denote it by δ, and choose C = , −1 1 where 0 < ǫ ≤ 1 − δ is a small perturbation. As a result, C is non-singular and CJ(x)−1 ≥ 0. Therefore, this problem satisfies the precondition of Theorem 4, except that fˆ1 (y, w1 ) − x and y − fˆ2 (x, w2 ) are piecewise linear rather than continuously differentiable. This issue can be addressed by replacing the sharp edges by smooth approximations. We conclude that Algorithm 5 converges for two-player jeopardy stochastic games. By Theorem 4, the converged solution is unique, which can be regarded as another proof of uniqueness of the solution.

5. INDEXING SCHEME The experiments were on small versions of the game Can’t Stop. The indexing schemes for one- and two-player Can’t Stop are given in Subsections 5.1 and 5.2, respectively. 5.1 Indexing Scheme for One-Player Can’t Stop Consider one-player Can’t Stop. Let xi denote the number of steps from the colored marker in column i to the end of the column. Each strongly connected component of the game graph consists of all the positions with some particular (x2 , x3 , . . . , x12 ), where 0 ≤ xi ≤ 2i − 1 for i = 2, 3, . . . , 7 and 0 ≤ xi ≤ 27 − 2i for i = 7, 8, . . . , 12 and at most three of the xi are zero. (x′2 , x′3 , . . . , x′12 ) is a supporting component of (x2 , x3 , . . . , x12 ) if and only if x′i ≤ xi for i = 2, 3, . . . , 12 and (x′2 , x′3 , . . . , x′12 ) 6= (x2 , x3 , . . . , x12 ). A terminal component has three zero columns, and contains only one position in the game graph G (three columns have been won; the game is over). Each position in a non-terminal component (x2 , . . . , x12 ) is (x′2 , . . . , x′12 ) where each x′i ≤ xi and for at most three i, x′i < xi (these represent the positions of the neutral markers, showing the progress during a turn). With a position (x′2 , . . . , x′12 ) we associate an index using a mixed radix system. The digits of the index are the x′i with x′2 chosen as the least significant digit (this is an arbitrary choice and in fact the digits may be given in whatever order is most convenient and efficient for implementation). The values of the positions are determined by the lengths of the columns: v2 = 1 and vi = vi−1 · (li−1 + 1) for 2 < i ≤ 12, where li is the length of column i. The index is then 12 X x′c · vc c=2

Note that, given an index and all of the vi , it is simple to recover the x′i . Also, if position u within a component is reachable from position v without going through an anchor then the index of u is strictly greater than the index of v. Because of this, when we execute Algorithm 1 within a component, we can simply iterate through the positions in order of decreasing index. With this scheme, two positions within different anchors may have the same index. However, since the position values of the non-anchor positions in a component are used only when computing the position value for the anchor in that component, the databases for the anchor positions and the non-anchor positions are maintained separately and so the duplicated indices are not a problem. In fact, we discard the database for the non-anchor positions and reconstruct them using the database for the anchor positions as necessary (as when simulating perfect play). If storage is abundant and speed is important, it would also be possible to use a file named ‘x2 x3 . . . x12 .ijk’ to store all the position values, where i < j < k are the columns in which the markers have advanced. The offset of a position with three markers x′i , x′j , x′k would be x′i + x′j xi + x′k xi xj in the file, where x′i < xi , x′j < xj , and x′k < xk . The naming and indexing convention is the same for positions with two markers or fewer. The database for component (x2 , x3 , . . . , x12 ) consists of all files ‘x2 x3 . . . x12 .∗’. The largest component is (3, 5, 7, 9, 11, 13, 11, 9, 7, 5, 3), which contains the position of the beginning of the game.

Retrograde Approximation Algorithms for Jeopardy Stochastic Games

15

5.2 Indexing Scheme for Two-Player Can’t Stop We use two different indexing schemes for positions in two-player Can’t Stop: one for anchors and another for non-anchors. Because we can discard the position values of non-anchors once we have computed the position value of their anchor, speed is more important than space when computing the values of non-anchors. Therefore, we use a mixed radix scheme like the one described in Section 5.1 for non-anchor positions. In this case, anchors can be described by (x2 , . . . , x12 , y2 , . . . , y12 , t) where the xi represent the positions of player one’s markers, the yi represent the positions of player two’s markers, and t is whose turn it is. Components are described in the same way, except that since a component includes two anchors that differ only in whose turn it is, t may be ′ omitted when describing a component. A position within a component is (x′2 , . . . , x′12 , y2′ , . . . , y12 , t) where, for ′ ′ ′ ′ t all i, xi ≥ xi and yi = yi if t = 1 and xi = xi and yi ≥ yi if t = 2. The place values are v = 1, v2x = 2, vix = x x vi−1 · (li−1 + 1), v2y = v12 · (l12 + 1), and viy = v y i − 1 · (li−1 + 1) and the index is vt · t +

12 X

(x′c · vcx + yc′ · vcy ).

c=2

Because the database for the anchors is kept, space is an important consideration when indexing anchors. In the variant used in our experiments, an anchor (x2 , ..., x12 , y2 , ..., y12 , t) is illegal if xi = yi 6= 0 for some i (players’ markers cannot occupy the same location with a column). With this restriction many indices map to illegal anchors, for example index 14,863,564,802 corresponds to an anchor where both players have a marker in square 1 of column 2. Furthermore, once a column is closed, the locations of the markers in that column are irrelevant; only which player won matters. For example, a position u with y2 = 3 and x2 = 0 also represents the positions with x2 ∈ {1, 2} and all other markers in the same places as u. If the database is stored in an array indexed using the mixed radix system as for non-anchors, then the array would be sparse: for the official game about 98% of the entries would be wasted on illegal and equivalent indices. In order to avoid wasting space in the array and to avoid the structural overhead needed for more advanced data structures, a different indexing scheme is used that results in fewer indices mapping to illegal, unreachable, or equivalent positions. We write each position as ((x2 , y2 ), ..., (x12 , y12 ), t). Associate with each pair (xi , yi ) an index zi corresponding to its position on a list of the legal pairs of locations in column i (i.e., on a list of ordered pairs (x, y) such that x 6= y unless x = y = 0, and if x = li then y = 0 and vice versa; for a column with li = 3 this list would be (0, 0), (0, 1), (1, 0), (2, 0), (0, 2), (2, 1), (1, 2), (3, 0), (0, 3)). The zi and t are then used as digits in a mixed radix system to obtain the index t+

12 X c=2

zc · 2

c−1 Y

(3 + ld (ld − 1)) ,

d=2

where the term in the product is the number of legal, distinct pairs of locations in column d. The list of ordered pairs used to define the zi ’s can be constructed so that if component u is a supporting component of v then the indices of u’s anchors are greater than the indices of v’s and therefore we may iterate through the components in order of decreasing index to avoid counting children while computing the solution. There is still redundancy in this scheme: when multiple columns are closed, what is important is which columns have been closed and the total number of columns won by each player, but not which player has won which columns. Before executing Algorithm 1 on a component, we check whether an equivalent component has already been solved. We deal with symmetric positions in the same way.

6. EXPERIMENTS As proof of concept, we have solved simple versions of one-player and two-player Can’t Stop. The one-player results are reported in Subsection 6.1. The two-player results are reported in Subsection 6.2. With optimal solutions, we can analyze playing strategies or heuristics for Can’t Stop. A proposed heuristic analogous to the “Rule of 28” for one-player Can’t Stop is examined in Subsection 6.3.

16

ICGA Journal

Submitted

6.1 Experimental Results for One-Player Can’t Stop The simpler versions of Can’t Stop use 3-, 4-, and 5-sided dice instead of 6-sided dice and may have shorter columns than the official version. Let (n, k, p) Can’t Stop denote the p-player game played with four n-sided dice with the shortest column k spaces long. Columns 2 and 2n are the shortest columns and column n + 1 is the longest. Adjacent columns always differ in length by 2 spaces. The official two-player version is then (6, 3, 2) Can’t Stop. For n = 2, 3, 4, 5 and k = 1, 2, 3 (except n = 5, k = 3) we have implemented Algorithm 4 in Java and solved (n, k, 1) one-player Can’t Stop. We used an initial estimate of 1.0 for the position value of each vertex. Table 1 shows, for each version of the game, the size of the game graph, the time it took the computer to solve the game, and the average number of turns needed to win the game when using the optimal strategy. The size of the game graph is given as the number of anchor vertices (i.e., vertices representing the beginning of a turn with no markers placed), and the total number of vertices in all of the anchor vertices’ strongly connected components (which includes vertices representing the middle of a turn when the markers have been placed on the board). Symmetry allows us to ignore about half of the anchor vertices in our implementation since the position represented by (x2 , x3 , . . . , x12 ) is equivalent to (x12 , x11 , . . . , x2 ). Table 1: Results of solving simple versions of one-player Can’t Stop. Optimal Iterations (n, k, p)

Anchors

Total vertices

Time

Turns

Newton

(2, 1, 1) 15 89 0.166s (2, 2, 1) 44 539 0.405s (2, 3, 1) 95 2,099 0.601s (3, 1, 1) 308 14,624 1.70s (3, 2, 1) 1,432 143,307 5.05s (3, 3, 1) 4,378 808,835 23.3s (4, 1, 1) 12,913 3,953,861 4m50s (4, 2, 1) 83,456 47,185,664 58m50s (4, 3, 1) 333,069 318,511,854 6h7m (5, 1, 1) 921,174 1,243,394,781 2d20h (5, 2, 1) 7,676,416 17,175,823,808 59d15h

1.298 1.347 1.400 1.480 1.722 1.890 2.187 2.454 2.700 2.791 3.396

3.00 3.07 3.07 3.38 3.79 4.45 3.75 4.44 5.18 4.09 5.16

Val. Iter. 25.1 29.0 33.1 14.6 19.4 22.4 14.0 18.0 21.1 14.7

Note that for fixed n, the time to solve the game is roughly proportional to the number of vertices. When n increases there is also an additional cost due to the increased number of outgoing edges from each vertex in U . For n = 3 there are 15 neighbors of each vertex (representing the 15 different outcomes of rolling four 3-sided dice); for n = 4 there are 35 neighbors. The average position value also affects the running time. For larger values of k or n, the average position value is higher; higher position values will generally require more iterations in Algorithm 4 to converge. The last two columns of Table 1 show for both Newton’s method and value iteration the general increase in the number of iterations per component as k and n increase (the number reported is the unweighted average). The increase is not monotonic because the important factor is the position value of each component and not the position value of the start state alone. The effect of position value can be isolated by fixing k and n. Table 2 shows how the number of iterations to converge varies with position value for (5, 1, 1) Can’t Stop. Value iteration takes far more iterations than Newton’s method and as a result each experiment took two to three times as long to complete with value iteration. 6.2 Experimental Results for Two-Player Can’t Stop We have implemented Algorithm 5 in Java and solved (3, k, 2) Can’t Stop for k = 1, 2, 3 and also (4, 1, 2) Can’t Stop. We used an initial estimate of ( 12 , 12 ) for the position values of the anchors within a component. Table 3 shows, for those four versions of the game, the size of the game graph, the time it took the algorithm to run, and the probability that the first player wins assuming that each player plays optimally. The listed totals exclude the components and positions not examined because of equivalence.

Retrograde Approximation Algorithms for Jeopardy Stochastic Games

17

Table 2: For (5, 1, 1) Can’t Stop, # of iterations required for ranges of position values. Mean Iterations Mean Iterations Position Value

States

Newton

1.0 - 1.25 353,354 1.25 - 1.5 49,353 1.5 - 1.75 18,629 1.75 - 2.0 19,285

Val. Iter.

3.35 4.69 6.95 8.21

Position Value

11.7 27.4 31.0 19.8

States

Newton

2.0 - 2.25 16,708 2.25 - 2.5 4,190 2.5 - 2.75 383 2.75 - 3.0 2 Total 461,904

8.30 8.92 9.36 9.00 4.09

Val. Iter. 15.5 17.4 20.6 24.0 14.7

Table 3: Results of solving simple versions of two-player Can’t Stop. Iterations (n, k, p)

Components

(3, 1, 2) (3, 2, 2) (3, 3, 2) (4, 1, 2)

6,324 83,964 930,756 5,835,432

Total positions

Time

634,756 4m33s 20,834,282 3h45m 453,310,692 3d13h 4,022,674,944 100d

P (P1 wins) 0.760 0.711 0.689 0.631

Newton

Val. Iter.

2.51 19.8 3.19 22.9 3.58 2.55

Note that the time to solve the game grows faster than the number of positions. This is because the running time is also dependent on the number of iterations per component, which is related to the quality of the initial estimate and the complexity of the game. Table 4 gives the average number of iterations for Newton’s method versus the position value of the component given as (x, y) where x (y) is the probability that the first player wins given that the game has entered the component and it is the first (second) player’s turn (the last two columns of Table 3 show that many more iterations are required by value iteration; the resulting time difference is a factor of three or four). Note that Table 4 is upper triangular because there is never an advantage in losing one’s turn and symmetric because of symmetric positions within the game. Perhaps surprisingly, the components that require the most iterations are not those where the solution is farthest from the initial estimate of ( 21 , 12 ). We conjecture that this is because positions where there is a large penalty for giving up one’s turn require less strategy (the decision will usually be to keep rolling) and therefore fˆ is less complicated (has fewer pieces) and so Newton’s method converges faster. There are no positions with 0 ≤ x, y < 0.2 because in a game with short columns there is always a large advantage in being the current player. Table 4: Iterations required vs. position values for (4, 1, 2) Can’t Stop. x 0.2-0.4 0.4-0.6 0.6-0.8 0.8-1.0 0.0-0.2 2.63 2.69 2.34 2.04 0.2-0.4 3.91 3.61 2.34 y 0.4-0.6 3.91 2.69 0.6-0.8 2.63

6.3 A Heuristic for One-Player Can’t Stop Even for the simplified games, the databases are too large for a human to use without electronic access to them. We have attempted to create simpler heuristics that a human could follow for good but not optimal play. One possible heuristic is the “Rule of 28” (Keller, 1998). With this heuristic, each column is given a point value. Each time a neutral marker is advanced, the point total for the turn increases by the value of the column the marker is in (double when first placing a marker in a column). Points are also added (or subtracted) for various undesirable (or desirable) combinations of columns with markers; for example 2 points are added if the three neutral markers are all in odd columns. When the point total for the turn reaches or exceeds 28, the turn ends. Unfortunately, the Rule of 28 only addresses when to end a turn and not how to pair the dice. Furthermore, for comparison to the optimal strategy, we need to adapt it to (5, 2, 1) Can’t Stop (the most complex version that we

18

ICGA Journal

Submitted

have solved). Our proposed heuristic borrows from the Rule of 28 the idea of assigning each column a fixed point value per space: for any position, the score for a column is the product of the column’s value and the spaces left to advance in that column. This is intended to reflect the relative ease of completing the various columns: columns that are close to completion will have low scores. The total score for a position is the sum over all columns of the column’s score times its weight, where the weights are chosen to further favor columns that are closer to completion. When we have a choice of positions to move to, we choose the one with the lowest total score. Weights are assigned at the start of each turn so that the column with the lowest score has the highest weight. For example, the values of each column might be (11, 6, 3, 2, 1, 2, 3, 6, 11), reflecting the fact that advancing one square in the 2’s column makes much more progress to completing that column than advancing one square in the 6’s column. Without the weights, the initial roll {1, 1, 5, 5} would be taken as 1 + 1 = 2 and 5 + 5 = 10 because advancing one square in the 2’s and 10’s columns makes 22 points of progress while advancing two squares in the 6’s column makes only 2 points of progress. However, this is not the optimal move. By setting the initial weights on the columns to (1, 1, 3, 5, 12, 5, 3, 1, 1) we change the improvement of advancing two squares in the 6’s column to 24, so that becomes the chosen move. If we are ever forced to move into the 2’s or 10’s columns (by rolling {1, 1, 1, 1}, for example), then we re-weight those columns so they are seen as more important. We have implemented this heuristic with the column values and weights given above. The columns are reweighted at the beginning of each turn so that the column with the lowest score is assigned weight 12, the two columns with the next lowest scores are assigned weight 5, and so forth. These values were chosen by trial and error to attempt to match the optimal move for each possible roll from the initial position of the game. The heuristic chooses the optimal first move for 54 out of 70 rolls (77.1%). Instead of ending a turn when the total value of the position has decreased by at least some fixed amount (as in the Rule of 28), we end turns only when a neutral marker is advanced to the end of a column. This rule makes the correct decision in the first turn for 90.5% of the possible positions reached in the first turn. All of the errors are from the heuristic ending a turn too soon: there are positions with markers at the end of a column where the optimal decision is to keep rolling. The average number of turns to complete the game is approximately 4.58 using this strategy versus 3.40 for the optimal strategy.

7. CONCLUSION We used a bipartite graph to abstract a one-player stochastic game, and a four-partite graph to abstract a twoplayer stochastic game. For the one-player game, the goal is maximize some expected game value or to minimize the expected penalty. For the two-player game, the position values give the probability of the first player winning the game with optimal play. We presented the family of jeopardy stochastic games that include Can’t Stop, Pig, and some variants. We proved that their optimal solutions exist and are unique. We transformed the problem of finding the optimal play to a fixed point problem, and solved it by incorporating Newton’s method with retrograde analysis. We called our new methods retrograde approximation algorithms, which are faster than value iteration not only in theory but also in practice. We also gave global convergence analysis of our methods. We experimented on small versions of one-player and two-player Can’t Stop, and successfully solved several simplified games with 3-sided, 4-sided, and 5-sided dice. We also compared a heuristic for (5,2,1) one-player Can’t Stop to the optimal solution. One-player Can’t Stop with the official equipment has more than 6 · 1012 positions. The official two-player game has over 1036 components. In both cases the game so large that is would be very difficult to compute a solution. It may be possible to find patterns in the solutions to the simplified games and use those patterns to approximate optimal solutions to the official games. It may also be possible to use the optimal solution to the one-player game as an approximate solution to the two-player game.

ACKNOWLEDGEMENT We would like to thank the editoral board for the guidance to improve the presentation, and a referee for very informative comments.

Retrograde Approximation Algorithms for Jeopardy Stochastic Games

19

8. REFERENCES Bellman, R. E. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ, USA. Dennis, J. E. and Schnabel, R. B. (1996). Numerical Methods for Unconstrained Optimization and Nonlinear Equations. SIAM, Philadelphia, PA, USA. Fang, H. (2005a). The Nature of Retrograde Analysis for Chinese Chess, Part I. ICGA Journal, Vol. 28, No. 2, pp. 91–105. Fang, H. (2005b). The Nature of Retrograde Analysis for Chinese Chess, Part II. ICGA Journal, Vol. 28, No. 3, pp. 140–152. Gasser, R. (1996). Solving Nine Men’s Morris. Computational Intelligence, Vol. 12, pp. 24–41. Glenn, J. (2006). An Optimal Strategy for Yahtzee. Technical Report CS-TR-0002, Loyola College in Maryland, 4501 N. Charles St, Baltimore MD 21210, USA. Glenn, J. (2007a). Computer Strategies for Solitaire Yahtzee. IEEE Symposium on Computational Intelligence and Games (CIG 2007), pp. 132–139. Glenn, J., Fang, H., and Kruskal, C. P. (2007b). A Retrograde Approximate Algorithm for One-Player Can’t Stop. Lecture Notes in Computer Science, Vol. 4630 (eds. H. J. van den Herik, P. Ciancarini, and H. H. L. M. Donkers), pp. 148–159, Springer-Verlag, New York, NY. CG’06 Proceedings. Glenn, J., Fang, H., and Kruskal, C. P. (2007c). A Retrograde Approximate Algorithm for Two-Player Can’t Stop. CGW2007 Workshop, Amsterdam, The Netherlands. Glenn, J., Fang, H., and Kruskal, C. P. (2008). A Retrograde Approximate Algorithm for Multi-Player Can’t Stop. CG2008 Conference, Beijing, China. Irving, G., Donkers, J., and Uiterwijk, J. (2000). Solving Kalah. ICGA Journal, Vol. 23, No. 3, pp. 139–147. Keller, M. (1998). Can’t Stop? Try the Rule of 28. World Game Review, Vol. 6. ISSN 1041–0546. Lake, R., Schaeffer, J., and Lu, P. (1994). Solving Large Retrograde Analysis Problems Using a Network of Workstations. Advances in Computer Games VII (eds. H. van den Herik, I. S. Herschberg, and J. Uiterwijk), pp. 135–162. University of Limburg, Maastricht. the Netherlands. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. Proceedings of the Eleventh International Conference on Machine Learning (eds. W. W. Cohen and H. Hirsh), pp. 157–163. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York, NY, USA. Nash, J. (1951). Non-Cooperative Games. The Annals of Mathematics, Second Series, Vol. 54, No. 2, pp. 286–295. Neller, T. and Presser, C. (2004). Optimal Play of the Dice Game Pig. The UMAP Journal, Vol. 25, No. 1, pp. 25–47. Neller, T. and Presser, C. (2006). Pigtail: A Pig Addendum. The UMAP Journal, Vol. 26, No. 4, pp. 443–458. Ortega, J. M. (1990). Numerical Analysis: A Second Course. SIAM, Philadelphia, PA, USA. Romein, J. and Bal, H. (2003). Solving the Game of Awari using Parallel Retrograde Analysis. IEEE Computer, Vol. 36, No. 10, pp. 26–33. Rosenlicht, M. (1968). Introduction to Analysis. Dover, New York, NY, USA. Sackson, S. (2007). Can’t Stop. Face 2 Face Games, Providence, RI, USA. ISBN 0–9761156–7–0. Boxed game set. Schaeffer, J., Bj¨ornsson, Y., Burch, N., Lake, R., Lu, P., and Sutphen, S. (2004). Building the Checkers 10-Piece Endgame Databases. Advances in Computer Games 10. Many Games, Many Challenges (eds. H. van den Herik, H. Iida, and E. Heinz), pp. 193–210. Kluwer Academic Publishers, Boston, USA.

20

ICGA Journal

Submitted

Sconyers, H. (2003). Hugh Sconyers’ bearoff 15x15 databases. http://home.earthlink.net/ sconyers/. Str¨ohlein, T. (1970). Untersuchungen u¨ ber kombinatorische Spiele. Ph.D. thesis, Fakult¨at f¨ur Allegemeine Wissenschaften der Technischen Hochschule M¨unchen, Munich. Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA. Thompson, K. (1986). Retrograde Analysis of Certain Endgames. ICCA Journal, Vol. 9, No. 3, pp. 131–139. Thompson, K. (1996). 6-Piece Endgames. ICCA Journal, Vol. 19, No. 4, pp. 215–226. Woodward, P. (2003). Yahtzee: The solution. Chance, Vol. 16, No. 1, pp. 18–22. Wu, R. and Beal, D. (2001). Fast, Memory-Efficient Retrograde Algorithms. ICGA Journal, Vol. 24, No. 3, pp. 147–159.

9. APPENDICES APPENDIX: CAN’T STOP RULES We summarize the game rules of Can’t Stop, largely taken from Wikipedia7 : The game equipment consists of four dice, a board, a set of eleven markers (squares) for each player, and three neutral markers. The board consists of eleven columns of spaces, one column for each of the numbers 2 through 12. The columns (respectively) have 3, 5, 7, 9, 11, 13, 11, 9, 7, 5 and 3 spaces each. The object of the game is to move your markers up the columns, and be the first player to complete three columns. On a player’s turn he8 rolls all four dice. He then divides the four dice into two pairs, each of which has an associated total. (For example, if he rolled 1 - 3 - 3 - 4 he could make a 4 and a 7, or a 5 and a 6.) If the neutral markers are off of the board then they are brought on to the board on the columns that correspond to these totals. If the neutral markers are already on the board in one or both of these columns then they are advanced one space upward. If the neutral markers are on the board, but only in columns that cannot be made with any pair of the current four dice, then the turn is over and the player gains nothing. After moving the markers the player chooses whether or not to roll again. If he stops, then he puts markers of his color in the locations of the current neutral markers. If on a later turn he restarts this column, he starts building from the place he previously claimed. If he does not stop then he must be able to advance at least one of the neutral markers on his next roll, or all progress on this turn is lost. When a player reaches the top space of a column, that column is won, and no further play in that column is allowed. The first player to complete three columns wins the game.

7 http://en.wikipedia.org/wiki/Can’t 8 We

use ‘he’ when both ‘she’ and ‘he’ are possible.

Stop