Dec 9, 2009 - Bootstrapping from Game Tree Search. Joel Venessâ â. David Silverâ¡. Will Utherââ Alan Blairâ â. University of New South Walesâ . NICTAâ.
Introduction Background TreeStrap Results Summary
Bootstrapping from Game Tree Search Joel Veness†∗
David Silver‡
Will Uther∗ †
Alan Blair†∗
University of New South Wales† NICTA∗ University of Alberta‡
December 9, 2009
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
Overview Game Tree Search Evaluation Functions
Presentation Overview A new algorithm will be presented for learning heuristic evaluation functions for game tree search via self-play. Topics covered include: I I I I
Why self-play is important The search bootstrapping approach Relationship to TD(λ), TD-Leaf(λ) Empirical evaluation on Chess
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
Overview Game Tree Search Evaluation Functions
Game Tree Search
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
Overview Game Tree Search Evaluation Functions
Heuristic Evaluation Function A heuristic evaluation function is a mapping H (s ) : State → R. For this work we use: I Parameterised representation: H (s ; w ~ ), w ~ ∈ Rn ~ s ) : State → Rn I State dependent feature vector: φ( ~ s) · w I Linear combination of features: H (s ; w ~ ) := φ( ~ ~? Problem: how to find good w
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
Overview Game Tree Search Evaluation Functions
Constructing Evaluation Functions Some alternative methods to find weights: I
Hand-tune (guess and test)
I
Supervised learning / learn from expert play
I
Self-play
Self-play has a number of potential benefits: I
No need for scored training examples
I
Reduced knowledge engineering effort
but can be hard to achieve in practice.
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
Overview Game Tree Search Evaluation Functions
Updating Evaluation Functions ~ ) towards We will be frequently talking about updating H (s ; w some target value T ∈ R. The following methods we consider are all: I
Online
I
Use stochastic gradient descent on either the squared ~ ))2 or sum squared error error 12 (T − H (s ; w P 1 ~ ))2 (Ts − H (s ; w 2
I
Invoked after a real action (move) is taken
s
For this talk, we are more interested in exactly how we choose the training target/s. Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
TD TD-Leaf Caveats
Self play with TD Learning I
Famously applied to Backgammon (TD-Gammon) by Tesauro
I
Simple greedy action selection sufficed during training
I
...unfortunately, difficulties with highly tactical domains (e.g. Chess).
time = t
s1
time = t+1
time = t+2
s2
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
time = t+3
s3
Bootstrapping from Game Tree Search
s4
Introduction Background TreeStrap Results Summary
TD TD-Leaf Caveats
TD-Leaf Learning Introduced by Baxter et al. Combines game tree search and TD learning. Some well-known applications: Chess (KnightCap) and Checkers (Chinook). time = t
s1
time = t+1
time = t+2
s2
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
time = t+3
s3
s4
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
TD TD-Leaf Caveats
Limitations of TD-Leaf Although undoubtedly an improvement over TD for certain types of games, a number of issues remain: I
Difficult to achieve strong results from just randomly assigned weights.
I
Expert play in chess emerged only after material weights were initialised to expert values and likely opponent blunders were excluded.
I
KnightCap required carefully controlled learning regime to learn. Is deterministic case harder than stochastic case?
I
Higher computational overhead compared to TD(λ)
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
TD TD-Leaf Caveats
Our work in context... Program TD-Gammon Chinook KnightCap Meep (Us)
Game Backgammon Checkers Chess Chess
Weights Random Mixed Mixed Random
Self-play Yes Yes No Yes
Performance World Class World Class Expert/Master Expert/Master
Notes: I
Results of Knightcap, starting from random weights, trained via self play were disappointing.
I
The value of a checker and a king were fixed in Chinook.
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
TD TD-Leaf Caveats
An obvious, but important point... The distribution over:
Positions seen in search , Positions seen over the board E.g. Contrast:
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
Backup Properties
TreeStrap: an alternative backup scheme Consider the following modified backup policy: time = t
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
time = t+1
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
Backup Properties
TreeStrap Properties Three main points: I
Backups come from deeper search at the same time-step, not subsequent searches.
I
A single search provides many updates; potential to learn faster?
I
Training examples come from more “representative” positions; potentially more robust?
Implementation: I
Extended to alpha-beta search, uses one-sided loss function
I
High performance programs all use transposition tables; bound information already available
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
Experimental Setup Head to Head Online evaluation
Experimental Setup I
Heuristic evaluation consists of weighted linear combination of 1800 features
I
1m 1s Fischer Time controls used for training and evaluation (∼5 mins)
I
Time taken for updates reduced overall thinking time
I
Over 25000 training games played to learn weights
I
Over 16000 games played (time consuming!) in local evaluation tournament
I
2000 games used for online evaluation
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
Experimental Setup Head to Head Online evaluation
Comparison to existing methods on Chess Learning from self−play: Rating versus Number of training games 2500
Rating (Elo)
2000
TreeStrap(alpha−beta) RootStrap(alpha−beta) TreeStrap(minimax) TD−Leaf Untrained
1500
1000
500
0 1 10
2
3
10 10 Number of training games
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
4
10
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
Experimental Setup Head to Head Online evaluation
Performance at the Internet Chess Club Blitz performance at the Internet Chess Club: Algorithm TreeStrap(αβ) TreeStrap(αβ) I
I I
I
Training Partner Self Play Shredder
Rating 1950-2197 2154-2338
Self-play weights correspond to expert/weak master level play Strong opponent weights correspond to master level play Scored 13.5/15 against International Master opposition online Learning by playing a strong opponent helps, but effect is not as pronounced compared to TD-Leaf
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
Highlights Questions
Highlights I
TreeStrap (·) method introduced, alternative to TD-based approaches for self-play training for games.
With respect to Chess: I
Order of magnitude reduction in training time vs TD-Leaf
I
Simple greedy move selection sufficient for training
I
First successful self-play result, starting from entirely random weights!
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search
Introduction Background TreeStrap Results Summary
Highlights Questions
Questions / Marketing Thankyou for listening. Please visit us at W38 this evening, especially if you are interested in talking about: I
algorithmic details, e.g. TreeStrap(αβ)
I
details of the chess specific features
I
how playing strength is measured
I
relationship of TreeStrap to other Reinforcement Learning techniques
I
ways in which this work can be extended
Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗
Bootstrapping from Game Tree Search