Bootstrapping from Game Tree Search - Semantic Scholar

Introduction Background TreeStrap Results Summary

Bootstrapping from Game Tree Search Joel Veness†∗

David Silver‡

Will Uther∗ †

Alan Blair†∗

University of New South Wales† NICTA∗ University of Alberta‡

December 9, 2009

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search


Overview Game Tree Search Evaluation Functions

Presentation Overview A new algorithm will be presented for learning heuristic evaluation functions for game tree search via self-play. Topics covered include: I I I I

Why self-play is important The search bootstrapping approach Relationship to TD(λ), TD-Leaf(λ) Empirical evaluation on Chess





Game Tree Search





Heuristic Evaluation Function A heuristic evaluation function is a mapping H (s ) : State → R. For this work we use: I Parameterised representation: H (s ; w ~ ), w ~ ∈ Rn ~ s ) : State → Rn I State dependent feature vector: φ( ~ s) · w I Linear combination of features: H (s ; w ~ ) := φ( ~ ~? Problem: how to find good w





Constructing Evaluation Functions Some alternative methods to find weights: I

Hand-tune (guess and test)

I

Supervised learning / learn from expert play

I

Self-play

Self-play has a number of potential benefits: I

No need for scored training examples

I

Reduced knowledge engineering effort

but can be hard to achieve in practice.





Updating Evaluation Functions ~ ) towards We will be frequently talking about updating H (s ; w some target value T ∈ R. The following methods we consider are all: I

Online

I

Use stochastic gradient descent on either the squared ~ ))2 or sum squared error error 12 (T − H (s ; w P 1 ~ ))2 (Ts − H (s ; w 2

I

Invoked after a real action (move) is taken

s

For this talk, we are more interested in exactly how we choose the training target/s. Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗



TD TD-Leaf Caveats

Self play with TD Learning I

Famously applied to Backgammon (TD-Gammon) by Tesauro

I

Simple greedy action selection sufficed during training

I

...unfortunately, difficulties with highly tactical domains (e.g. Chess).

time = t

s1

time = t+1

time = t+2

s2


time = t+3

s3


s4


TD TD-Leaf Caveats

TD-Leaf Learning Introduced by Baxter et al. Combines game tree search and TD learning. Some well-known applications: Chess (KnightCap) and Checkers (Chinook). time = t

s1

time = t+1

time = t+2

s2


time = t+3

s3

s4



TD TD-Leaf Caveats

Limitations of TD-Leaf Although undoubtedly an improvement over TD for certain types of games, a number of issues remain: I

Difficult to achieve strong results from just randomly assigned weights.

I

Expert play in chess emerged only after material weights were initialised to expert values and likely opponent blunders were excluded.

I

KnightCap required carefully controlled learning regime to learn. Is deterministic case harder than stochastic case?

I

Higher computational overhead compared to TD(λ)




TD TD-Leaf Caveats

Our work in context... Program TD-Gammon Chinook KnightCap Meep (Us)

Game Backgammon Checkers Chess Chess

Weights Random Mixed Mixed Random

Self-play Yes Yes No Yes

Performance World Class World Class Expert/Master Expert/Master

Notes: I

Results of Knightcap, starting from random weights, trained via self play were disappointing.

I

The value of a checker and a king were fixed in Chinook.




TD TD-Leaf Caveats

An obvious, but important point... The distribution over:

Positions seen in search , Positions seen over the board E.g. Contrast:




Backup Properties

TreeStrap: an alternative backup scheme Consider the following modified backup policy: time = t


time = t+1



Backup Properties

TreeStrap Properties Three main points: I

Backups come from deeper search at the same time-step, not subsequent searches.

I

A single search provides many updates; potential to learn faster?

I

Training examples come from more “representative” positions; potentially more robust?

Implementation: I

Extended to alpha-beta search, uses one-sided loss function

I

High performance programs all use transposition tables; bound information already available




Experimental Setup Head to Head Online evaluation

Experimental Setup I

Heuristic evaluation consists of weighted linear combination of 1800 features

I

1m 1s Fischer Time controls used for training and evaluation (∼5 mins)

I

Time taken for updates reduced overall thinking time

I

Over 25000 training games played to learn weights

I

Over 16000 games played (time consuming!) in local evaluation tournament

I

2000 games used for online evaluation





Comparison to existing methods on Chess Learning from self−play: Rating versus Number of training games 2500

Rating (Elo)

2000

TreeStrap(alpha−beta) RootStrap(alpha−beta) TreeStrap(minimax) TD−Leaf Untrained

1500

1000

500

0 1 10

2

3

10 10 Number of training games


4

10




Performance at the Internet Chess Club Blitz performance at the Internet Chess Club: Algorithm TreeStrap(αβ) TreeStrap(αβ) I

I I

I

Training Partner Self Play Shredder

Rating 1950-2197 2154-2338

Self-play weights correspond to expert/weak master level play Strong opponent weights correspond to master level play Scored 13.5/15 against International Master opposition online Learning by playing a strong opponent helps, but effect is not as pronounced compared to TD-Leaf




Highlights Questions

Highlights I

TreeStrap (·) method introduced, alternative to TD-based approaches for self-play training for games.

With respect to Chess: I

Order of magnitude reduction in training time vs TD-Leaf

I

Simple greedy move selection sufficient for training

I

First successful self-play result, starting from entirely random weights!




Highlights Questions

Questions / Marketing Thankyou for listening. Please visit us at W38 this evening, especially if you are interested in talking about: I

algorithmic details, e.g. TreeStrap(αβ)

I

details of the chess specific features

I

how playing strength is measured

I

relationship of TreeStrap to other Reinforcement Learning techniques

I

ways in which this work can be extended



Bootstrapping from Game Tree Search - Semantic Scholar

Bootstrapping from Game Tree Search - Semantic Scholar

Suggest Documents

Distributed Game-Tree Search Using ... - Semantic Scholar

Tutoring Strategies in Game-Tree Search - Semantic Scholar

Parallel Game-tree Search TA Marsland and F ... - Semantic Scholar

Bootstrapping Monte Carlo Tree Search with an Imperfect Heuristic

IMPROVED LEXICAL TREE SEARCH FOR ... - Semantic Scholar

Game Tree Search Based on Non-Deterministic

The Bootstrapping Service - Semantic Scholar

Integer aperture bootstrapping - Semantic Scholar

Bootstrapping cluster analysis - Semantic Scholar

GNSS Ambiguity Bootstrapping - Semantic Scholar

BOOTSTRAPPING PERCEPTION USING ... - Semantic Scholar

Tree detection from aerial imagery - Semantic Scholar

Performance Analysis of Two Parallel Game-Tree ... - Semantic Scholar

Implications from Game Theory - Semantic Scholar

Decentralized and Autonomous Bootstrapping for ... - Semantic Scholar

Bootstrapping Semantics in an Autonomic ... - Semantic Scholar

Lightweight CoAP-Based Bootstrapping Service ... - Semantic Scholar

bootstrapping communication in language games ... - Semantic Scholar

SONG-SPECIFIC BOOTSTRAPPING OF SINGING ... - Semantic Scholar

Fast Bootstrapping by Combining Importance ... - Semantic Scholar

Using Learning Decomposition and Bootstrapping ... - Semantic Scholar

Bootstrapping the interactome: unsupervised ... - Semantic Scholar

A Bootstrapping Approach for Developing ... - Semantic Scholar

Bootstrapping the Language Archive - Semantic Scholar