Introduction to Real-Life. Reinforcement Learning. Michael L. Littman. Rutgers
University. Department of Computer Science. Brief History. Idea for symposium ...
Brief History
Introduction to Real-Life Reinforcement Learning Michael L. Littman Rutgers University
Idea for symposium came out of a discussion I had with Satinder Singh @ ICML 2003 (DC). Both were starting new labs. Wanted to highlight an important challenge in RL. Felt we could help create some momentum by bringing together like minded researchers.
Department of Computer Science
Attendees (Part I)
Attendees (More)
• • • • • • • • • •
• • • • •
ABRAMSON,MYRIAM BAGNELL,JAMES BENTIVEGNA,DARRIN BLANK,DOUGLAS BOOKER,LASHON DIUK,CARLOS FAGG,ANDREW FIDELMAN,PEGGY FOX,DIETER GORDON,GEOFFREY
GREENWALD,LLOYD GROUNDS,MATTHEW JONG,NICHOLAS LANE,TERRAN LANGFORD,JOHN LEROUX,DAVE LITTMAN,MICHAEL MCGLOHON,MARY MCGOVERN,AMY MEEDEN,LISA
MIKKULAINEN,RISTO MUSLINER,DAVID PETERS,JAN PINEAU,JOELLE PROPER,SCOTT
Definitions
Multiple Lives
What is “reinforcement learning”?
• Real-life learning (us): use real data, possibly small (even toy) problems
• Decision making driven to maximize a measurable performance objective. What is “real life”?
• Life-sized learning (Kaelbling): large state spaces, possibly artificial problems
• “Measured” experience. Data doesn’t come from a model with known or predefined properties/assumptions.
• Life-long learning (Thrun): Same learning system, different problems (somewhat orthogonal)
Find The Ball
The RL Problem
Learn:
Input: , , …, st
• which way to turn • to minimize steps • to see goal (ball)
, right,
, +1
• from camera input • given experience.
Output: ats to maximize discounted sum of ris.
Problem Formalization: MDP
Find the Ball: MDP Version
Most popular formalization: Markov decision process
• Actions: rotate left/right
Assume: • States/sensations, actions discrete. • Transitions, rewards stationary and Markov.
• States: orientation
• Transition function: Pr(s’|s,a) = T(s,a,s’). • Reward function: E[r|s,a] = R(s,a). Then: • Optimal policy !*(s) = argmaxa Q*(s,a) • where Q*(s,a) = R(s,a) + " #s’ T(s,a,s’) maxa’ Q*(s’,a’)
• Reward: +1 for facing ball, 0 otherwise
It Can Be Done: Q-learning
Real-Life Reinforcement Learning
Since optimal Q function is sufficient, use experience to estimate it (Watkins & Dayan 92) Given : Q(s,a) $ Q(s,a) + %t(r + " maxa’ Q(s’,a’) – Q(s,a) ) If: • all (s,a) pairs updated infinitely often • Pr(s’|s,a) = T(s,a,s’), E[r|s,a] = R(s,a) • #%t = !, #%t 2 < ! Then: Q(s,a) & Q*(s,a)
Emphasize learning with real* data. Q-learning good, but might not be right here… Mismatches to “Find the Ball” MDP: • Efficient exploration: data is expensive • Rich sensors: never see the same thing twice • Aliasing: different states can look similar • Non-stationarity: details change over time * Or, if simulated, from simulators developed outside the AI community
RL2: A Spectrum Unmodified physical world Controlled physical world
Unmodified Physical World RLRL
Electronic-only world Pure math world Detailed simulation Lab-created simulation
RLRL gray zone helicopter (Bagnell)
RL
weight loss (BodyMedia)
Controlled Physical World
Electronic-only World Recovery from corrupted network interface configuration.
After 95 failure episodes
Java/Windows XP: Minimize time to repair.
Mahadevan and Connell, 1990 Littman, Ravi, Fenson, Howard, 2004
Learning to sort fast Littman & Lagoudakis
Pure Math World
Detailed Simulation • Independently developed
backgammon (Tesauro)
RARS video game elevator control (Crites, Barto) Robocup Simulator
Lab-created Simulation Car on the Hill
The Plan Talks, Panels Talk slot: 30 minutes, shoot for 25 minutes to leave time for switchover, questions, etc. Try plugging in during a break.
Taxi World
Panel slot: 5 minutes per panelist (slides optional), will use the discussion time
Friday, October 22nd, AM
Friday, October 22nd, PM
9:00 Michael Littman, Introduction to Real-life Reinforcement-learning 9:30 Darrin Bentivegna, Learning From Observation and Practice Using Primitives 10:00 Jan Peters, Learning Motor Primitives with Reinforcement Learning 10:30 break 11:00 Dave LeRoux, Instance-Based Reinforcement Learning on the Sony Aibo Robot 11:30 Bill Smart, Applying Reinforcement Learning to Real Robots: Problems and Possible Solutions 12:00 HUMAN-LEVEL AI PANEL, Roy 12:30 lunch break
2:00 Andy Fagg, Learning Dexterous Manipulation Skills Using the Control Basis 2:30 Dan Stronger, Simultaneous Calibration of Action and Sensor Models on a Mobile Robot 3:00 Dieter Fox, Reinforcement Learning for Sensing Strategies 3:30 break 4:00 Roberto Santiago, What is Real Life? Using Simulation to Mature Reinforcement Learning 4:30 OTHER MODELS PANEL, Diuk, Greenwald, Lane 5:00 Gerry Tesauro, RL-Based Online Resource Allocation in Multi-Workload Computing Systems 5:30 session ends
Joint with Artificial Multi-Agent Learning
9:00 Drew Bagnell, Practical Policy Search 9:30 John Moody, Learning to Trade via Direct Reinforcement 10:00 Risto Miikkulainen, Learning Robust Control and Complex Behavior Through Neuroevolution 10:30 break 11:00 Michael Littman, Real Life Multiagent Reinforcement Learning 11:30 MULTIAGENT PANEL, Stone, Reidmiller, Moody, Bowling 12:00 HIERARCHY/STRUCTURED REPRESENTATIONS PANEL, Tadepalli, McGovern, Jong, Grounds 12:30 lunch break
Saturday, October 23rd, PM Joint with Cognitive Robotics
Saturday, October 23rd, AM
2:00 Lisa Meeden, Self-Motivated, Task-Independent Reinforcement Learning for Robots 2:30 Marge Skubic and David Noelle, A Biological Inspired Adaptive Working Memory for Robots 3:00 COGNITIVE ROBOTICS PANEL, Blank, Noelle, Booksbaum 3:30 break 4:00 Peggy Fidelman, Learning Ball Acquisition and Fast Quadrupedal Locomotion on a Physical Robot 4:30 John Langford, Real World Reinforcement Learning Theory 5:00 OTHER TOPICS PANEL, Abramson, Proper, Pineau 5:30 session ends
Sunday, October 24th, AM
Plenary
9:00 Satinder Singh, RL for Human Level AI 9:30 Geoff Gordon, Learning Valid Predictive Representations 10:00 Yasutake Takahashi, Abstraction of State/Action based on State Value Function 10:30 break 11:00 Martin Reidmiller/Stephan Timmer, RL for technical process control 11:30 Matthew Taylor, Speeding Up Reinforcement Learning with Behavior Transfer 12:00 Discussion: Wrap Up, Future Plans 12:30 symposium ends
Saturday (tomorrow) night
Darrin’s Summary
What Next?
• extract features
• Collect successes to point to
• domain knowledge • function approximators • bootstrap learning/behavior transfer • improve current skill • learn skill initially using other methods • start with low-level skills
6pm-7:30pm Plenary Each symposium gets a 10-minute slot Ours: Video. I need today’s speakers to join me for lunch and also immediately after the session today.
– Contribute to newly created page: http://neuromancer.eecs.umich.edu/cgibin/twiki/view/Main/SuccessesOfRL – We’re already succeeding (ideas are spreading) – rejoice: control theorists are scared of us
• Sources of information – This workshop web site: http://www.cs.rutgers.edu/~mlittman/rl3/rl2/ . – Will include pointers to slides, papers – Can include twiki links or a pointer from RL repository. – Michael requesting slides / URLs / videos (up front). – Newly created Myth Page: http://neuromancer.eecs.umich.edu/cgibin/twiki/view/Main/MythsofRL
Other Activities
Future Challenges
• Possible Publication Activities
• How can we better talk about the inherent problem difficulty? Problem classes?
– special issue of a journal (JMLR? JAIR?) – editted book – other workshops – guidebook for newbies – textbook?
• Benchmarks – Upcoming NIPS workshop on benchmarks – We need to push for including real-life examples – greater set of domains, make an effort to widen applications
• Can we clarify the distinction between control theory and AI problems? • Stress making sequential decisions (outside robotics as well). • What about structure? Can we say more? • Need to encourage a fresh perspective. • Help convey how to see problems as RL problems.