Description and Acquirement of Macro-Actions in Reinforcement Learning Takeshi Yoshikawa
Yuki Kanazawa
Masahito Kurihara
Graduate School of Information Science and Technology, Hokkaido University Kita 14, Nishi 9, Kita-ku, Sapporo, 060-0814, Japan Email:
[email protected]
Fuji Xerox Co., Ltd.
Graduate School of Information Science and Technology, Hokkaido University Kita 14, Nishi 9, Kita-ku, Sapporo, 060-0814, Japan Email:
[email protected]
I. I NTRODUCTION Reinforcement learning is known as a unsupervised learning method and is investigated in control systems, multi-agent systems, and so on. By this method, an agent can acquire the policy of its action from interaction with environments. In this paper, we develop a new description of macroactions. A macro-action is a group or a sequence of actions and is an action control structure which provides an agent with control. McGovern[4] investigated the property of macroactions experimentally. Randlov[7] defined a macro-action as a mapping from actions to actions. We describe a macro-action based on basic control structure in structured programming. By our description, a macro-action has rich expressiveness. Moreover, we propose a method for dynamically acquiring macroactions from the experiences during reinforcement learning process. And we show that the proposed method is effective, experimentally. II. R EINFORCEMENT L EARNING WITH M ACRO - ACTIONS
Agent Reward
Experience Memory
Environment
Abstract— Reinforcement learning is a framing of enabling agents to learn from interaction with environments. It has focused generally on Markov decision process (MDP) domains, but a domain may be non-Markovian in the real world. In this paper, we develop a new description of macro-actions for non-Markov decision process (NMDP) domains in reinforcement learning. A macro-action is an action control structure which provides an agent with control which applies a collection of related microscopic actions as a single action unit. Also we propose a method for dynamically acquiring macro-actions from the experiences of agents during reinforcement learning process.
Observation acqure select MacroActions Action
Fig. 1.
Action control
Framework of reinforcement learning with macro-actions
framework of reinforcement learning with macro-actions, the policy π, by which each agent selects a macro-action, is π : O × M → [0, 1], (1) ∑ where, for any observation o, m∈M(o) π(o, m) = 1. The aim of each agent is to maximize the return Rui
= rui +1 + γrui +2 + γ 2 rui +3 · · · ∞ ∑ γ k rui +k+1 , =
(2)
k=0
or to acquire appropriate policy. γ ∈ [0, 1) denote a discount rate. The evaluation function is the following macro-action value function Qπ (o, m) for the observation o, macro-action m, and policy π, Qπ (o, m) = E{Ru |ou = o, mu = m, π}.
(3)
We show the framework of reinforcement learning with In this paper, we update the value function by using macro-actions in Fig. 1. Sarsa(λ) method. Let O be a finite set of observation and A be a finite set Qui+1 (o, m) = Qui (o, m) + αδui eui (o, m), (4) of actions taken by each agent. We denote ∪ by A(o) a set of actions at observation o ∈ O. Then, A = o∈O A(o). At time where step t = 1, 2, . . ., each agent receives a observation ot ∈ O, 2 and takes an action at ∈ A(o). After one time step, the agent δui+1 = rui +1 + γrui +2 + γ rui +3 + · · · receives a reward rt+1 ∈ ℜ. +γ ui+1 −ui Qui (oui+1 , mui+1 ) − Qui (oui , mui ) Let u = u1 , u2 , . . . be a time step that an agent select a macro-action. We denote by M a set of macro-actions, and (o = oui , m = mui ) 1 by M(o) a set of macro-actions that the agent can select 0 (o = oui , m ̸= mui ) e (o, m) = u i ∪ at the observation o ∈ O. Then, M = o∈O M(o). In the γλeui−1 (o, m) (o ̸= oui )
mo
a O1
….. …..
m1
mn
(b) complex macro-action
(a) single macro-action
Fig. 2.
On
Macro-Actions
α ∈ [0, 1) denote a learning rate. We name these scheme Macro/Sarsa(λ). III. M ACRO -ACTIONS A. Definition A macro-action m for observation o is defined as a following labeled tree (Fig. 2): 1) single macro-action a tree which has only one root node with a certain action a ∈ A(o). 2) complex macro-action a tree which has a root node with a macro-action m0 ∈ M(o) for observation o and some leaf nodes with macro-actions mk ∈ M(ok ) for each observation ok (k = 1, 2, . . . , n; n > 0; ok ̸= o). The edge (m0 , mk ) is labeled by ok . We define a length |m| of macro-action m as follows. 1) If m is a single macro-action, |m| = 1. 2) If m is a complex macro-action, |m| = |m0 | + max |mk |.
(1) initialize a set of macro-action M ,macro-action value function Q,and degree of eligibility trace e. (2) repeat for each episode: 1. receive the observation o 2. select the macro-action m ∈ M(o) by a certain rule 3. determine the action a by the macro-action m 4. repeat for each step in a episode: (a)take the action a and receive r, o′ (b)if macro-action m ∈ M(o′ ): determine the action a′ by the macro-action m (c)else: )select the macro-action m′ ∈ M(o′ ) by a certain rule )δ ← r + γ k Q(o′ , m′ ) − Q(o, m) where r is the summation of discount rewards in taking m, and k is time step of m )update the degree of eligibility trace )for each o, m: Q(o, m) ← Q(o, m) + αδe(o, m) e(o, m) ← γλe(o, m) )o ← o′ ; m ← m′
Fig. 3.
procedure of Macro/Sarsa(λ)
If a complex macro-action m is selected, the agent takes the action m0 ∈ M(o) of the root node at first. The agent takes the macro-action m0 while it receives the observation o. When it receives the other observation ok , it takes the macro-action mk . If mk is complex macro-action, it controls the action of agent by the procedure of this section recursively. If the agent receives a observation which does not exist in labeled tree, this procedure finish. C. Learning procedure with macro-actions
k
B. Procedure of action control A macro-action controls the action of agent by following procedure when the macro-action is selected by the agent or by the other macro-action. 1) procedure of action control by single macro-action procedure macro action(m: single macro-action); begin action a; end If a single macro-action m is selected, the agent takes the action a ∈ A(o) of the root node. After the action, this procedure finish. 2) procedure of action control by complex macro-action procedure macro action(m: complex macro-action); begin while observe o do macro action(m0 ); if observe o1 then macro action(m1 ) else if observe o2 then macro action(m2 ) : else if observe on then macro action(mn ); end
We show the procedure of Macro/Sarsa(λ) in Fig.3. In initialization of this procedure, each single macro-action for any action taken by an agent is generated, i.e. for any observation o ∈ O and any action a ∈ A(o), 1) m ← single macro-action which has the root node with a 2) M(o) ← M(o) ∪ {m} After an agent takes certain actions under the macro-action m for k(> 0) time steps and the action control procedure finish, the summation of the discount reward r is calculated. r = rt+1 + γrt+2 + γ 2 rt+3 + · · · + γ k−1 rr+k
(5)
And, macro-action value function is updated by using Sarsa(λ). D. Generation and acquirement of macro-actions In this section, we propose a method of acquiring macroaction based on experience. At each time ui , an agent accumulates the received observation oui , the selected macro-action mui , its time steps kui and received discount reward rui+1 during the time steps. Let Eui be the tuple < oui , mui , kui , rui+1 >, named the experience
䋹 10 10 䋸 10 10 12 䋵
䋵
䋵
䋵
G
䋵
䋵
䋵
䋵
䋳 10 10 䋲 10 10 䋶 Fig. 4.
Joint procedure of macro-actions Fig. 5. Maze. Each grid corresponds to a state and at the grid with the same number, an agent receives the same observation. G is the goal.
at ui . We assume that the number of experience is finite, and its upper bound is Nmax . For each accumulation of Nm experiences, Step 1.We generate a complex macro-action which has a root node with m0 ∈ M(o) and leaf node with m1 ∈ M(o1 )(o1 ̸= o). Step 2.For each complex macro-action generated in Step 1, we check the following inequalities: Q(o, m) > β1 Q(o, m0 ) + C1 ,
(6)
Q(o, m) > β2 Q(o1 , m1 ) + C2 ,
(7)
TABLE I PARAMETER SETTINGS FOR AGENT (A2) parameters Nmax Nm C1 C2
values 1000 500 4.0 1.0
parameters β1 β2 Tinit σ η
values 1.0 1.0 1000 0.9998 0.5
Maze problem where, parameters β1 , β2 , C1 , C2 > 0, and we estimate Q(o, m) by experiences. If both conditions are satisfied, we add the complex macro-action m to the set of macro-action M. Step 3.For each added macro-action m in Step 2, if there exists the macro-action m′ ∈ M(o) which has root node with m0 and o′k ̸= o1 for each label o′k (k = 1, 2, . . . , n; n > 0), we generate the macro-action m′′ which has root node with m0 and leaf nodes with m1 , m′1 , m′2 , . . . , m′n (Fig. 4). And we add this macro-action m′′ to the set of macro-action M and delete m and m′ from M. We adopt two following limitations in this paper. 1. limitation of the number of macro-actions The upper bound of the number of macro-actions with length d is ⌊|A|η d−1 ⌋, where 0 < η < 1 and |A| is the number of actions. Then, for the number of macro-actions |M|, |M|
≤ =
|A| + |A|η + · · · + |A|η i−1 + · · · |A| 1−η
2. limitation of generation of macro-actions If both (6) and (7) are satisfied, we generate the macroaction with the probability P = exp(−1/T ). The initial value of T is Tinit and T decreases at the rate σ (0 < σ < 1).
We deal with the modified maze problem in [3] with partially observable MDP (Fig.5). Agent starts from one of the four corners randomly. The actions taken by agent are up, down, left, right. The state transition is deterministic. The observation of agent is the existence of wall around itself. Agent receives the reward +5.0 at the goal G, -1.0 at crash against the wall, -0.1 at the others. Each agent updates its policy by using the method with Gibbs distribution. expQ(o,m)/τ , Q(o,m)/τ m∈M(o) exp
π(o, m) = ∑
where τ = 0.2. Parameters are set as discount rate γ = 0.9, learning rate α = 0.9, λ = 0.2. And we show the parameters for agent (A2) in Table I and the given macro-actions for agent (A3) in Fig.6. We show the average of reward for one action in Fig.7. This implies that macro-actions are effective in POMDP and agent (A2) acquire the appropriate macro-actions. V. C ONCLUSION In this paper, we developed a description of macro-actions based on basic control structure in structured programming, proposed a method for acquiring macro-actions during reinforcement learning process. The effectiveness of our macroactions and proposed acquiring method was shown by experiments with partially observable MDP. R EFERENCES
IV. S IMULATION We compare the following three agent systems: (A1)Sarsa(λ) (A2)Macro/Sarsa(λ): generating macro-actions (A3)Macro/Sarsa(λ): given appropriate macro-actions
[1] Littman, M. L., Cassandra, A. R., and Kaelbling, L. P.: Learning Policies for Partially Observable Environments: Scaling up, Proceedings of 12th International Conference on Machine Learning, pp.263–370(1995). [2] Loch, J., and Singh, S.: Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes, Proceedings of 18th International Conference on Machine Learning, pp.141–150(1998).
2
8
10
10
a
b
c
d
U
D
L
R
5
5
a
8
2
8
b
a
b
U
D
3
6
9
e
f
g
h
R
L
R
L
10
10 d
Fig. 6.
2
12
10 c
10 d
c
Given macro-actions for agent(A3)(U:up, D:down, L:left, R:right).
1 0.9 0.8 0.7
rewardforoneaction
0.6 0.5
(A1)
0.4
(A2)
0.3
(A3)
0.2 0.1 0
1
2
3
4
5
6
7
8
9
10
trials (x 100) Fig. 7.
Average of reward for one action in maze problem
[3] McCallum, R. A.: Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State, Proceedings of 12th International Conference on Machine Learning, pp.387–395(1995). [4] McGovern, A., Sutton, R. S., and Fagg, R. H.: Roles of Macro-Actions in Accelerating Reinforce Learning, Proceedings of the 1997 Grace Hopper Celebration of Women in Computing, pp.13–18(1997). [5] McGovern, A., and Barto, A. G.: Automatic Discovery of Subgoals in Reinforcement Learning Using Diverse Density, Proc. 18th International Conf. on Machine Learning (2001). [6] Pendrith, M. D., McGarity, M. j: An Analysis of Direct Reinforcement Learning in non-Markovian Domains, Proceedings of 15th International Conference on Machine Learning, pp.421–429(1998). [7] Randlov, J.: Learning Macro-Actions in Reinforcement Learning, Advances in Neural Information Processing Systems 11 (1999). [8] Singh, S. P.: Scaling Reinforcement Learning Algorithm by Learning Variable Temporal Resolution Models, Proceedings 9th International Conference on Machine Learning, pp.406–415(1992). [9] Sutton, R. S., Barto, A. G.: Renforcement Learning: An Introduction, MIT Press(1998). [10] Sutton, R. S., Precup, D., and Singh, S.; Between MDPs and SemiMDPs: A Framework for Temporal Abstraction in Reinforcement Learning, Artificial Intelligence, 112: pp.181–211(1999). [11] Theocharous, G., Kaelbling, L. P.: Approximate Planning in POMDPs with Macro-Actions, Neural Information Processing Systems(2003). [12] Wang, G., Mahadevan, S.: Hierarchical Optimization of Policy-Coupled Semi-Markov Decision Processes, Proceedings of 16th International Conference on Machine Learning, pp.464–473(1999).