Introduction to Real-Life Reinforcement Learning - Computer ...

38 downloads 258 Views 9MB Size Report
Introduction to Real-Life. Reinforcement Learning. Michael L. Littman. Rutgers University. Department of Computer Science. Brief History. Idea for symposium ...
Brief History

Introduction to Real-Life Reinforcement Learning Michael L. Littman Rutgers University

Idea for symposium came out of a discussion I had with Satinder Singh @ ICML 2003 (DC). Both were starting new labs. Wanted to highlight an important challenge in RL. Felt we could help create some momentum by bringing together like minded researchers.

Department of Computer Science

Attendees (Part I)

Attendees (More)

• • • • • • • • • •

• • • • •

ABRAMSON,MYRIAM BAGNELL,JAMES BENTIVEGNA,DARRIN BLANK,DOUGLAS BOOKER,LASHON DIUK,CARLOS FAGG,ANDREW FIDELMAN,PEGGY FOX,DIETER GORDON,GEOFFREY

GREENWALD,LLOYD GROUNDS,MATTHEW JONG,NICHOLAS LANE,TERRAN LANGFORD,JOHN LEROUX,DAVE LITTMAN,MICHAEL MCGLOHON,MARY MCGOVERN,AMY MEEDEN,LISA

MIKKULAINEN,RISTO MUSLINER,DAVID PETERS,JAN PINEAU,JOELLE PROPER,SCOTT

Definitions

Multiple Lives

What is “reinforcement learning”?

• Real-life learning (us): use real data, possibly small (even toy) problems

• Decision making driven to maximize a measurable performance objective. What is “real life”?

• Life-sized learning (Kaelbling): large state spaces, possibly artificial problems

• “Measured” experience. Data doesn’t come from a model with known or predefined properties/assumptions.

• Life-long learning (Thrun): Same learning system, different problems (somewhat orthogonal)

Find The Ball

The RL Problem

Learn:

Input: , , …, st

• which way to turn • to minimize steps • to see goal (ball)

, right,

, +1

• from camera input • given experience.

Output: ats to maximize discounted sum of ris.

Problem Formalization: MDP

Find the Ball: MDP Version

Most popular formalization: Markov decision process

• Actions: rotate left/right

Assume: • States/sensations, actions discrete. • Transitions, rewards stationary and Markov.

• States: orientation

• Transition function: Pr(s’|s,a) = T(s,a,s’). • Reward function: E[r|s,a] = R(s,a). Then: • Optimal policy !*(s) = argmaxa Q*(s,a) • where Q*(s,a) = R(s,a) + " #s’ T(s,a,s’) maxa’ Q*(s’,a’)

• Reward: +1 for facing ball, 0 otherwise

It Can Be Done: Q-learning

Real-Life Reinforcement Learning

Since optimal Q function is sufficient, use experience to estimate it (Watkins & Dayan 92) Given : Q(s,a) $ Q(s,a) + %t(r + " maxa’ Q(s’,a’) – Q(s,a) ) If: • all (s,a) pairs updated infinitely often • Pr(s’|s,a) = T(s,a,s’), E[r|s,a] = R(s,a) • #%t = !, #%t 2 < ! Then: Q(s,a) & Q*(s,a)

Emphasize learning with real* data. Q-learning good, but might not be right here… Mismatches to “Find the Ball” MDP: • Efficient exploration: data is expensive • Rich sensors: never see the same thing twice • Aliasing: different states can look similar • Non-stationarity: details change over time * Or, if simulated, from simulators developed outside the AI community

RL2: A Spectrum Unmodified physical world Controlled physical world

Unmodified Physical World RLRL

Electronic-only world Pure math world Detailed simulation Lab-created simulation

RLRL gray zone helicopter (Bagnell)

RL

weight loss (BodyMedia)

Controlled Physical World

Electronic-only World Recovery from corrupted network interface configuration.

After 95 failure episodes

Java/Windows XP: Minimize time to repair.

Mahadevan and Connell, 1990 Littman, Ravi, Fenson, Howard, 2004

Learning to sort fast Littman & Lagoudakis

Pure Math World

Detailed Simulation • Independently developed

backgammon (Tesauro)

RARS video game elevator control (Crites, Barto) Robocup Simulator

Lab-created Simulation Car on the Hill

The Plan Talks, Panels Talk slot: 30 minutes, shoot for 25 minutes to leave time for switchover, questions, etc. Try plugging in during a break.

Taxi World

Panel slot: 5 minutes per panelist (slides optional), will use the discussion time

Friday, October 22nd, AM

Friday, October 22nd, PM

9:00 Michael Littman, Introduction to Real-life Reinforcement-learning 9:30 Darrin Bentivegna, Learning From Observation and Practice Using Primitives 10:00 Jan Peters, Learning Motor Primitives with Reinforcement Learning 10:30 break 11:00 Dave LeRoux, Instance-Based Reinforcement Learning on the Sony Aibo Robot 11:30 Bill Smart, Applying Reinforcement Learning to Real Robots: Problems and Possible Solutions 12:00 HUMAN-LEVEL AI PANEL, Roy 12:30 lunch break

2:00 Andy Fagg, Learning Dexterous Manipulation Skills Using the Control Basis 2:30 Dan Stronger, Simultaneous Calibration of Action and Sensor Models on a Mobile Robot 3:00 Dieter Fox, Reinforcement Learning for Sensing Strategies 3:30 break 4:00 Roberto Santiago, What is Real Life? Using Simulation to Mature Reinforcement Learning 4:30 OTHER MODELS PANEL, Diuk, Greenwald, Lane 5:00 Gerry Tesauro, RL-Based Online Resource Allocation in Multi-Workload Computing Systems 5:30 session ends

Joint with Artificial Multi-Agent Learning

9:00 Drew Bagnell, Practical Policy Search 9:30 John Moody, Learning to Trade via Direct Reinforcement 10:00 Risto Miikkulainen, Learning Robust Control and Complex Behavior Through Neuroevolution 10:30 break 11:00 Michael Littman, Real Life Multiagent Reinforcement Learning 11:30 MULTIAGENT PANEL, Stone, Reidmiller, Moody, Bowling 12:00 HIERARCHY/STRUCTURED REPRESENTATIONS PANEL, Tadepalli, McGovern, Jong, Grounds 12:30 lunch break

Saturday, October 23rd, PM Joint with Cognitive Robotics

Saturday, October 23rd, AM

2:00 Lisa Meeden, Self-Motivated, Task-Independent Reinforcement Learning for Robots 2:30 Marge Skubic and David Noelle, A Biological Inspired Adaptive Working Memory for Robots 3:00 COGNITIVE ROBOTICS PANEL, Blank, Noelle, Booksbaum 3:30 break 4:00 Peggy Fidelman, Learning Ball Acquisition and Fast Quadrupedal Locomotion on a Physical Robot 4:30 John Langford, Real World Reinforcement Learning Theory 5:00 OTHER TOPICS PANEL, Abramson, Proper, Pineau 5:30 session ends

Sunday, October 24th, AM

Plenary

9:00 Satinder Singh, RL for Human Level AI 9:30 Geoff Gordon, Learning Valid Predictive Representations 10:00 Yasutake Takahashi, Abstraction of State/Action based on State Value Function 10:30 break 11:00 Martin Reidmiller/Stephan Timmer, RL for technical process control 11:30 Matthew Taylor, Speeding Up Reinforcement Learning with Behavior Transfer 12:00 Discussion: Wrap Up, Future Plans 12:30 symposium ends

Saturday (tomorrow) night

Darrin’s Summary

What Next?

• extract features

• Collect successes to point to

• domain knowledge • function approximators • bootstrap learning/behavior transfer • improve current skill • learn skill initially using other methods • start with low-level skills

6pm-7:30pm Plenary Each symposium gets a 10-minute slot Ours: Video. I need today’s speakers to join me for lunch and also immediately after the session today.

– Contribute to newly created page: http://neuromancer.eecs.umich.edu/cgibin/twiki/view/Main/SuccessesOfRL – We’re already succeeding (ideas are spreading) – rejoice: control theorists are scared of us

• Sources of information – This workshop web site: http://www.cs.rutgers.edu/~mlittman/rl3/rl2/ . – Will include pointers to slides, papers – Can include twiki links or a pointer from RL repository. – Michael requesting slides / URLs / videos (up front). – Newly created Myth Page: http://neuromancer.eecs.umich.edu/cgibin/twiki/view/Main/MythsofRL

Other Activities

Future Challenges

• Possible Publication Activities

• How can we better talk about the inherent problem difficulty? Problem classes?

– special issue of a journal (JMLR? JAIR?) – editted book – other workshops – guidebook for newbies – textbook?

• Benchmarks – Upcoming NIPS workshop on benchmarks – We need to push for including real-life examples – greater set of domains, make an effort to widen applications

• Can we clarify the distinction between control theory and AI problems? • Stress making sequential decisions (outside robotics as well). • What about structure? Can we say more? • Need to encourage a fresh perspective. • Help convey how to see problems as RL problems.