Reinforcement Learning and Visual Object Recognition Lucas Paletta Computer Vision Group Institute for Computer Graphics and Vision Technical University Graz Munzgrabenstrae 11, A-8010 Graz, Austria Email:
[email protected]
http://www.icg.tu-graz.ac.at/lpaletta
1 Introduction
This presentation provides an introduction to reinforcement learning methods and a proposal for dissertation work on using this concept for visual object recognition ( gure 1). The rst section of the presentation ( gure 2) is concerned with the theoretic foundations of Markov decision problems (MDP). Two dierent solutions are considered, dynamic programming and reinforcement learning respectively. In the sequel, object recognition is described in the context of MDPs. Finally there is a summary to stress the most important ideas. While focusing on methods to nd optimal solutions of sequential decision problems in the framework of MDPs ( gure 3), several applications will be presented, current research is discussed, and the core ideas of the proposal are described in the sequel. 2 Markov Decision Process
2.1 Policies Imagine a mobile robot with the task to nd a battery charger in an oce ( gure 4). Assume decisions for aiming at certain directions are based upon visual information. For each time step t, a visual pattern, i.e. an image of brightness values is captured from a camera mounted on the platform of the robot ( gure 5). It is assigned an entry of a lookup-table representing the states xi on the way to the goal. The diagram to the right illustrates all possible states of the task by pictoral cells. It is possible to interpretate from the neighborness of cells a temporal vicinity in visiting corresponding states. When the robot occupies a particular state, it has the choice between dierent actions a ( gure 6), i.e. transitions according to the 4 directions to states in the neighborhood, to the north (aN ), south (aS ), east (aE ) or west (aW ) state. The set of xed decisions for every choice of actions, i.e. in each state, is called a policy . Hence is a mapping from the set of states X to the set of actions A, : X ! A. The optimal strategy (red) leads to the goal by the shortest path, whereas a suboptimal (blue) needs an additional number of actions to attain it.
2.2 Mathematical Background
A formal description of a MDP ( gure 7) consists of the set of possible states X and the set of possible actions A, further the transition function describing the change from state xi to state xj when executing action a. In deterministic MDPs the subsequent state is accessed with probability 1, whereas in nondeterministic MDPs, i.e. the more interesting ones, the transition is probabilistic according to a distribution over the following states. With each action a, the decision-maker, i.e. the agent receives a payo or reward r, which contributes to a de nition of a utility function. An optimal strategy can be found by means of a value function ( gure 8). It is de ned for every state as the cumulative reward that is received in subsequent steps until attaining the task goal when following a
particular decision strategy . There exist tasks that require optimization of a behavior instead of searching a path to the goal. For this purpose, a discount factor is introduced to exponentially decrease the contribution of future rewards so as to keep the resulting sum nite. In nondeterministic MDPs, one should compute the expected sum over all possible successor states. Now how to retrieve an optimal strategy ( gure 9)? A rst solution, dynamic programming, requires knowledge of all rewards r and transition descriptions . When starting from an arbitrary state, neither the number of steps nor the path to the goal is known in advance. Fortunately it suces to optimize the immediate next step, so a global solution is found by recursion: the value of a state xt is described by the value of the successor state xt+1 plus the reward r, received during the transition to the next state. If we would already know the optimal values V , i.e. the one representing evaluating cumulative rewards of the optimal future action sequence, we could perform an optimal strategy by selecting the action which maximizes the sum consisting of the reward and the successor value function, i.e. the action that promises maximum reward. The optimal value function is computed using the Bellmann equation, i.e. a relation in dynamic programming. Starting with an arbitrary value of the estimate Vk , the value function is recursively updated until eventually convergencing to V , which represents the optimal value. 3 Solving the MDP
3.1 Dynamic Programming
Dynamic programming methods ( gure 10) are preferably applied to problems that possess optimal substructure, i.e. global solutions are recursively computed from solutions of speci ed subproblems. In contrast to divide and conquer methods, they take advantage from solutions of commonly shared subproblems, so that these solutions are computed once but can be used a multiple times thereafter. Dynamic programming provides a solution to the MDP ( gure 9) if only the quantities V are unknown, the optimal strategy is derived from it.
3.2 Reinforcement Learning
In most applications, the parameters r and are initially unknown and thus have to be learned. In the framework of reinforcement learning ( gure 11), a so-called agent computes these quantities by statistical evaluation of its experience while executing its task. It tries dierent actions, observes their consequences and adjusts its strategy accordingly, e.g. in correlation to deviations from its expectations. A second solution to a MDP is thus provided by Temporal Dierence Learning ( gure 12), i.e. a particular method of reinforcement learning. A consistent value function should obey the consistency condition described above ( gure 9). Starting with an arbitrary estimator of the value function Vn , there possibly results an error caused by a deviation from the consistency equation. The means r and V will not be known in advance, thus an estimator for this quantity is used instead. An estimate for , ^, contributes to the update of the current values of the value function. This estimator converges to the optimal values V by mathematical proof [12]. In gure 13, the diagrams illustrate results of the reinforcement learning process in a simple application. Top left, the cell's contents depict the optimal values of corresponding states. Top right, the optimal action for each state is shown by arrows pointing to the successor state. Starting from any state, the arrows guide the agent to the goal of the task. The diagram bottom left represents the learning eect by plotting decreasing lengths of trials, i.e. the number of steps to goal, while continuously updating the estimators of the value function. The diagram bottom right shows the optimal strategy when some transitions are not permitted, i.e. the case of an obstacle, which changes the policy for some of the states. Current research in reinforcement learning is focused on nding universal function approximators ( gure 14) for estimating the value function [3]. Since today, stringent convergence proofs exist only for lookuptable approaches, although several successful implementations of generalizing estimators are reported in the literature [14, 3, 7]. Another issue is to balance exploration and exploitation: the state space has to be explored for registration of the payos received by executing actions; to avoid an exhaustive search over state space, this discovery should be ecient by visiting only those states that are needed for a suciently precise estimate. The knowledge about these state transitions can be exploited to de ne a strategy, which is evaluated in turn. Multi-agent learning deals with the communication and organization of sets of agents, performing subtasks in a hierarchy of goals.
3.3 Aplications of Reinforcement Learning
The most cited application ( gure 15) using temporal dierence learning methods is TD-Gammon, i.e. a neural network learning to play the game of Backgammon [13]. The network was proved to achieve masterlevel of play and has won against the best human players of the world. Another important example is the 4 elevator dispatcher for usage in skyscrapers. In computer vision, the break-through of reinforcement learning methods has not been achieved so far. Draper [8, 9] describes optimal assemblance of visual procedures by reinforcement learning, Peng [10] nds the best parameter setting for color image segmentations. Bandera et al. [5] use the framework to nd saccade sequences of shortest length for the purpose of 2-D object recognition. The most important research labs that are currently working on this topic are, Center for Visual Science at Rochester University (Ballard, Whitehead), Center for Automated Learning and Discovery at Carnegie Mellon University (Thrun, Davies), University of California (Peng), University of Massachusetts at Amherst (Sutton,Draper), MIT (Singh,Jordan), etc. 4 Object Recognition
4.1 Recognition Process
The theory of reinforcement learning is now applied to visual object recognition ( gure 16). Object recognition is the task to classify a certain pattern as an instance of an object class out of a database of known objects. In many cases, interpretation of single 2-D patterns does not suce for a con dent decision, thus the information from multiple views should be integrated to achieve an improved global classi cation. Object recognition in this context induces the task to attain a most reliable decision with minimal time costs. The decision process is de ned on the basis of visual information, i.e. for each 2-D view we are interested to nd the action that provides access to the most discriminative next view. The dynamics of the recognition process ( gure 17) emerges now from interpretation of a sequence of subsequent 2-D views. The visual patterns induce corresponding probability distributions on the object hypotheses. A distribution which integrates the information of all previous, local hypotheses, is computed by information fusion. In parallel, the information out of the sequence of visual patterns is fused to a sequence of recognition states which re ect the perceptual progress during a trial. If the resulting object hypotheses attain sucient con dence, the agent reaches the goal of the task. Reinforcement learning provides then a mapping from recognition states to actions, e.g. camer amovements, evaluated by the corresponding increase in the con dence in the object hypotheses. An optimal strategy selects exactly those actions that directly lead to the goal, represented by a prede ned level of entropy in the posteriori distribution of object hypotheses. Reinforcement learning not only nds the optimal actions but actually learns a resulting mapping which is performed as reactive behavior by autonomous systems. Necessary prerequisites ( gure 18) to implement reinforcement learning methods at the ICG are, the Active Vision Laboratory which enables visual experiments with multiple degrees of freedom, enabling control of illumination, controlled rotation and translation of the objects, etc. Theory about the integration of sensor information is provided by results of the active fusion research group.
4.2 Objectives
The following objectives ( gure 19) are identi ed to describe a project on applying reinforcement methods for automated object recognition: Learning optimal fusion strategies. By reinforcement learning, the optimal action sequence can be found. Perception is fused by statistical inference, and the recognition system becomes adaptive to changes in the probabilistic environment. Eventually the reactive system performs optimal control in real-time. Learning selective perception. For large object databases, the size of the state space explodes, thus reduction by clustering techniques or extraction of discriminative features should improve the scaling of the method. Scene exploration. A complex scene, consisting of a set of objects, should be interpretated by a complex behavioral architecture. The goal is to make the system learn to interact with the environment on a complex level. One important problem to face is occlusion. Hence particular reinforcement methods should be exploited or even developed to structure the strategical concept.
We now answer to the following 4 important questions ( gure 20): 1. Original contribution of the work? The intention is to apply global optimization to the task of object recognition in contrast to heuristic assumptions about the reasoning. Thus a global evaluation function contributes in the emergence of a complex behavioral architecture which is in accordance to the purpose of the task. Reinforcement learning has not yet been implemented for three-dimensional object recognition, there exist valuable frameworks to deal with 2-D objects [5], from which some ideas can be transfered from. 2. Importance of the work? We follow the paradigma of purposive vision [1, 4], i.e. to use the representations and methods that are necessary to perform the task at hand, without the constructon of a general-purpose system. The objective parameter, i.e. the reward, plays the role of evaluating computational structures for the purpose of discrimination between the dierent object models. MDPs provide a mathematical framework to outline the problems on a quantitative basis. 3. What is the most related work? Learning to recognize objects has already been outlined in the broad framework of aspects representation [11]. Sequential recognition minimizing perceptual entropy measures for the special case of three-dimensional, prede ned geometric shape models is described in [6, 2] in the framework of active recognition. Reinforcement learning was used to nd optimal saccade sequences in 2-D object recognition [5]. No work has been done so far, to the knowledge of the author, in (1) optimal recognition of (2) arbitrary 3-D objects (3) from appearance. 4. Who bene ts from this stu? Time-critical systems are dependent on minimal execution time which is guaranteed by the proposed methods. Once the optimal strategy is learned, the system follows a mapping from perceptual states to actions that promise processing of the most discriminating features. Reinforcement learning not only nds the most distinguishing views by incorporating knowledge about future payos into the decision, but also learns the mapping, i.e. the policy for automatic recognition without any reasoning. The mathematical framework should provide further insight into mechanisms underlying object recognition. 5 Conclusion
The Markov decision task ( gure 21) was described as fundamental problem class of object recognition, while reinforcement learning was outlined as an ecient tool to nd an optimal strategy without having a model about the environment. Object recognition is thus a decision process where the described mathematical framework enables the acquisition of optimal fusion strategies. Current work ( gure 22) is focused on nding an optimal strategy for discriminating wire models. After a preprocessing stage of background subtraction, edge detection and noprmalization in brightness and scale, the digital image, which is considered a high-dimensional vector of pixel brightness values, is projected onto a low-dimensional eigenspace by principal component analysis (PCA). The eigenrepresentation of the object is probabilistically interpretated by a radial basis function (RBF) network which performs classi cation by a conditional distribution on the object hypotheses. The information of each perception is fused to an integrated probability distribution which an entropy results from. The loss of entropy between two subsequent distributions is used by reinforcement methods to reinforce actions that lead to more discriminative views. Actions considered are rotation of a turn-table by shifts of k 30. References 1. Y. Aloimonos. Purposive and qualitative active vision. In International Conference on Pattern Recognition, pages 346{360, 1990. 2. T. Arbel and F. P. Ferrie. Informative views and sequential recognition. In European Conference of Computer Vision, pages 469{81, 1996. 3. L. Baird. Residual algorithms: Reinforcement learning with function approximation. In 12th International Conference on Machine Learning, pages 30{37, 1995. 4. D. H. Ballard and C. H. Brown. Principles of animate vision. CVGIP: Image Understanding, 56(1):3{21, 1992. 5. C. Bandera, F.J. Vico, J.M. Bravo, M.E. Harmon, and L.C. Baird III. Residual Q-learning applied to visual attention. In 13th International Conference on Machine Learning, pages 20{27, 1996.
6. F. G. Callari and F. P. Ferrie. Autonomous recognition: driven by ambiguity. In "Conference on Computer Vision and Pattern Recognition", pages 701{707, 1996. 7. R. H. Crites and A. G. Barto. Improving elevator performance using reinforcement learning. In Advances in Neural Information Processing Systems, volume 8, pages 1017{1023. The MIT Press, 1996. 8. B. A. Draper. Learning grouping strategies for 2d and 3d object recognition. In Proceedings ARPA Image Understanding Workshop, pages 1447{1454, 1996. 9. B. A. Draper. Learning control strategies for object recognition. In K. Ikeuchi and M. Veloso, editors, Symbolic Visual Learning, chapter 3, pages 49{76. Oxford University Press, New York, 1997. 10. J. Peng and B. Bhanu. Closed-loop object recognition using reinforcement learning. In "Conference on Computer Vision and Pattern Recognition", pages 538{543, 1996. 11. M. Seibert and A. M. Waxman. Adaptive 3-D object recognition from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):107{124, 1992. 12. R.S. Sutton. Learning to predict by the methods of temporal dierences. Machine Learning, 3:9{44, 1988. 13. G. Tesauro. Practical issues in temporal dierence learning. Machine Learning, 8:257{277, 1992. 14. G. Tesauro. Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2):215{219, 1994.
Overview
Reinforcement Learning and
• Markov Decision Process
Visual Object Recognition
• Dynamic Programming • Reinforcement Learning • Object Recognition
Lucas Paletta
• Summary
1
2 Robot Task
Benefits
• Solving sequential decision tasks • Reinforcement applications • Dissertation proposal
3
4
Visual State Space goal
i
image task
table
5
xi
Markov Decision Process
Policies x0
aN
aW
a0
x1
a1
x2
a2
...
aG−1
xG
π∗ r0
xi
r1
r2
rG−1
aE
Det.
xi
xj
1.0
xj1
0.7
MDP Θ={X,A,δ,r} xj2
X={x0,x1,...,xG}
π
aS
A={aN,aW,aS,aE}
policy π : X→ A
a=π(x)
Nondet.
δ(xi,a) = xj
xi
r(xi,a) = r
xj3
6
0.1
0.2
7
Value Function Det. Nondet.
V (x)
=
V (x)
=
V
rt = rt+1 + rt+2 + rt+3 + : : : + rG rt = rt+1 + rt+2 + 2rt+3 + : : : =
X1 k r (x) = E fr jxg = E f t
k=0
t+k+1g
Solution 1: Dynamic Programming
1X k
r
k=0
t+k+1
Optimal policy
(x) = arg max (rt+1 + V (y)) a Bellman optimality equation (Bellman 61) V(x0)
−1 −0
−0
−1
−2
−9
−8
−12 −16
V (x) = rt+1 + V (y ) Value iteration
0
−8
−12 −16
−0
−1
−1
−0
−1
−2
−13 −12 −16 −19
−2
−1
−2
−3
−16 −16 −18 −20
r = −1
π∗
πrand
8
Vk(x) = max rt+1 + V k?1(y ) a
V (x) Vk(x)
= =
r1
V(x1)=r2+r3+r4+...
rI
V(xI)=rII+rIII+rIV+...
rt+1 + V (y ) rt+1 + V k?1(y )
9
Reinforcement Learning
Dynamic Programming • optimal substructure • overlapping subproblems • recursive solution
π, V
agent
global
state x
δ, r
reward r
environment
local divide & conquer
dynamic programming
10
11 Demo: Shortest Path
Solution: Temporal Difference (TD) Learning
(A)
8 −1
−0
−1
−2
−3
−4
−5
−6
0
−1
−2
−3
−4
−5
7
6
Estimation error
6
−1
−0
−1
−2
−3
−4
−5
−6
−2
−1
−2
−3
−4
−5
−6
−7
−3
−2
−3
−4
−5
−6
−7
−8
−4
−3
−4
−5
−6
−7
−8
−9
−5
−4
−5
−6
−7
−8
−9
−10
−6
−5
−6
−7
−8
−9
−10
−11
−1
5
4
4
3
3
V n(y )] ? Vn(x) [rt+1 + Vn (y )] ? Vn (x)
= [r t+1 + =
0
5
temporal difference: xt, xt+1 = y
8
7
−0
^
action a
2
2
1
1
0
0
2
4
6
0
8
0
2
4
6
8
6
8
policy
value function
(B)
50 45
8
7
40
transitions
6
Convergence: Sutton 88
35 5
30 25
4
20
3
15 2
10 1
5 0 0
0
20
40
60
80
100
120
140
160
180
200
12
0
2
4
policy
trials
13 Applications
Research in Reinforcement Learning General • Generalization in state space Function approximation: V(xi) = f( φj(xi) , Wj )
• Exploration strategies Exploration vs. exploitation
• Multi−agent learning subtasks cooperation
14
• TD−Gammon (Tesauro 92) • Elevator dispatching (Crites/Barto 96)
Computer vision • Assembling visual procedures (Draper 96) • Image segmentation (Peng/Bhanu 96) • Visual attention (Bandera et al. 96)
Research labs • Rochester University (perception) • Carnegie Mellon University (robotics) • Univ. of Calif., Univ. of Mass., MIT, etc.
15
Recognition Process
Dissertation proposal Reinforcement learning in visual object recognition
x0
recognition states
X1
Object recognition:
information fusion
... p(Oi|x)
Optimization task Confident object hypotheses Active vision Decision process 2
3 4 Oi
1
2
x0
x1
h0
h1
x2
...
xn
?
object hypotheses information fusion
...
2
h0
p(Oi|x)
1
3 4 Oi
1
2
visual pattern
3 4 Oi
3 4 Oi
p(Oi|x)
1
2
p(Oi|x)
1
p(Oi|x)
• • • •
fused hypotheses
H1
3 4 Oi
r2
rn
rewards
p(Oi|x)
r1 1
2
3 4 Oi
16
17
Prerequisites Active Vision Lab
Objectives
Active Fusion (FWF Task 3.1) • • • •
xz−table controlled illumination
camera on pan−tilt head
Image understanding (Pinz 92) Probability theory (Prantl 95) Evidence theory (Ganster 96) Fuzzy control (Borotschnig 96)
States:
•
optimal action sequence task−dependent adaptation reactive system •
Camera motion Illumination Camera parameters Processing parameters
• Information fusion • Parameter set
extraction of discriminative features •
Rewards/costs: • Confidence measure • Illumination
18 Why ? 1. Original contribution
• global optimization • reinforcement learning in OR
2. Importance
• task−specific optimization • mathematical framework for OR
3. Related work
• Seibert/Waxman 92 • Callari/Arbel/Ferrie 96 • Bandera et al. 96
4. Benefits
Learning selective perception reduced state space
turn table
Actions: • • • •
Learning optimal fusion strategies
• applications: time−critical systems improveme OR systems
20
Scene exploration occlusion detection control of scene modeling
19
Current work Summary
entropy background subtraction sensor noise model
Canny
information fusion
• Markov decision process normalize
object hypotheses p(Oi|e)
• Reinforcement learning PCA
RBF
• Object recognition process • Learning optimal fusion strategies
view ± k*30°
a RL
1.5 1 0.5
controller
0
−1
vertical rotation const. illumination const. distance to object
21
r
2
−0.5
Assumptions:
x
2.5
−1.5 −2 −2.5 4 2 0 −2
−2
−1
0
1
22
2
3