Model-based Reinforcement Learning Approach for

Model-based Reinforcement Learning Approach for Planning in Self-Adaptive Software System Han Nguyen Ho

Eunseok Lee

Dept. of Electrical and Computer Engineering, Sungkyunkwan University, Korea.

Dept. of Electrical and Computer Engineering, Sungkyunkwan University, Korea.

[email protected]

[email protected]

ABSTRACT Policy-based adaptation is one of interesting topics in selfadaptive software research community. Current works in the field proposed the term of policy evolution, which concentrate to tackle the impact of environmental uncertainty on adaptation decision. These works adopted the advances of Reinforcement Learning (RL) to continuously optimize system behavior in run-time. However, there are several issues remain very primitive in current researches, especially the arbitrary exploitation–exploration trade-off and random exploration, which could lead to slow learning, hence, frail decision in exceptional situations. With model-free approach, these works could not leverage the knowledge about underlying system, which is essential and plentiful in software engineering, to enhance their learning. In this paper, we introduce the advantages of model-based RL. By utilizing engineering knowledge, system maintains a model of interaction with its environment and predicts the consequence of its action, to improve and guarantee system performance. We also discuss the engineering issues and propose a procedure to adopt model-based RL to build a self-adaptive software and bring policy evolution closer to real-world applications.

Categories and Subject Descriptors D.2.8 [Software Engineering]: Design—Methodologies

General Terms Design

Keywords Policy evolution, reinforcement learning, Bayesian inference, model-based RL

1.

INTRODUCTION

Self-adaptive capability of a software system has become more important recently, as the changing in dynamic runtime context and user’s needs have been increasing the administrative and maintenance overheads [1, 2]. Researchers Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. IMCOM ’15, January 08 - 10 2015, BALI, Indonesia Copyright 2015 ACM 978-1-4503-3377-1/15/01...$15.00 http://dx.doi.org/10.1145/2701126.2701191

in this domain have widely accepted the MAPE-K by IBM [3] as a reference architectural blueprint for engineering selfadaptive software system (SAS). MAPE-K, with some central Knowledge, incorporates a control loop that Monitoring, Analyzing, Planning and Executing over managed system. One of the most interesting and challenging tasks in MAPE-K is Planning, in which the system infers context information and utilizes its knowledge to make appropriate adaptation decision. Policy-based adaptation [1, 2] is recently proved as an efficient approach for engineering adaptation knowledge for its simplicity and reliability. Adopting this approach, system’s experts analyze and design a policy repository, in which indicates possible situations that system may encountered and corresponding strategies or actions to take in order to meet the users’ requirements. A clear issue of early works adopted policy-based approach is that static rules could not deal with emergent or exceptional situations that unknown to engineers at design time. Since the running environments are gradually more open and various, learning becomes an essential capability of self-adaptive software systems, including policy-based. Kim, D. et al. [9, 10] proposed concept of online planning, that make use of reinforcement learning to dynamically adjust the configuration of system based on real-time executing performance. Later on, in 2012, Gu, X. introduced IDES [11], an extension of Rainbow that could learn and update policy repository in run-time. IDES assigns a predefined preference value for every policy available at design time, and uses Actor–Critic algorithm to examine the run-time reward, then adjust preference of previously executed policy. While promising a novel improvement over static policy approach, there still remains very primitive issues that have not well addressed yet. The most important problem of reinforcement learning is finding an appropriate compromise between exploitation, where system utilizes its experience to maximize the expected reward, and exploration, where system makes trial decisions to enrich its knowledge. Current works [9, 10, 11] commonly adopted a model-free approach, in which, exploitation–exploration trade-off is solved in a heuristic manner, such as predefined ratio (–greedy), or gradually increasing of exploitation (Boltzmann policy). In –greedy algorithm [10], system chooses an arbitrary random action when it decides to explore, hence, take long time to converge. Boltzmann policy reduces the exploration ra-

tio when its experience supposed to be sufficient, could not promptly deal with new situation that may occur latter on. Clearly, for not leveraging the knowledge about underlying system, heuristic exploration gives poor performance, especially when state–action spaces are relatively large or there is an insufficient amount of training time. In this paper, we introduce the advances of model-based Reinforcement Learning [13, 14, 15, 16] that learn an explicit model of system–environment interaction, represents uncertainty in the parameters of this model, to provide improvement of learning performance over primitive techniques addressed above. Model-based and model-free RL have the same primary goal of learning: to improve behavioral policy. However, the model-based RL agent attempts to learn a model of its environment simultaneously, allows the agent to predict the consequences of actions before they are taken. Further, a model of the environment is not directly tied to the goal that currently performing. For example, consider a robot navigates to collect dust. While performing this task, the robot could learn (or provided beforehand) a structure representation of house. If it’s goal changes, for example, to find recharge station, the previously learned policy (the sequence of turns to locate the dust) is no help, but the learned model of the house structure could still aid the mouse in achieving its new goal. Generally, model-based approach brings two key advantages that suited in SAS domain: a) domain knowledge can be encoded to speed up learning; b) learned model could aid the system under new goals [18]. We also discuss about the mapping between a RL problem and a SAS problem, in which the main difference is concept of situation. This issue was stated in [10], however remains unsolved. Since the RL only depends on underlying MDP and does not take care of environmental situations, an arbitrary solution is to design different MDPs for different situations. We proposed here an extension of MDP that support situation concept, by which we bring the simplified designing and maintenance to system’s engineer. The remaining of this paper presents in detail how to adopt the advantages of model-based RL for planning in self adaptive software system. Section 2 provides some background information. The main approach and algorithm will be addressed in Section 3. We also explain about the implementation and engineering aspect of proposed learning method in Section 4. Section 5 and 6 represent a case study, discuss on experimental results, and outline some future directions.

2.

Planning

Knowledge Monitoring

Executing Learning

Sensor

Effector Managed System

Figure 1. The IBM MAPE-K structure with Learning process adaptation required to occur. It could be applied in various form of engineering styles. In Rainbow [5], Garlan et al., 2008, introduced architecture-based adaptation, in which system chooses new architectural reconfiguration based on predefined rules. Narges et al., 2010, provided a Policybased Self-Adaptive Model (PobSAM) [4] that used policies, represented formally in algebra, as a mechanism to direct and adapt the behavior of self-adaptive systems. Other works which are model-oriented, feature-oriented [7, 8] also used policy as their main planning method. As pointed out in [6, 10, 11], policy-based adaptation suffers the shortcoming of statically predefined policy that could not deal with unseen situations or uncertainties. System hence gives frail decision, as there is no automatic mechanism to react when exceptions occur and usually need human intervention.

2.2

Policy Evolution

To tackle the issues mentioned in section 2.1, the terms of online-planning, policy-evolution were introduced [10, 11]. The main approach is to model the underlying system as a Markov Decision Process: a 4–tuple (S, A, P, R), where: • S is a finite set of states, • A is a finite set of actions, • Ta (s, s0 ) is the probability that action a in state s will lead system to state s0 . • Ra (s, s0 ) is the immediate reward received after transition to s0 from state s.

RELATED WORK

Policy-based adaptation was proved to be an efficient and reliable technique for planning in self-adaptive software system [1, 2, 4, 5, 6]. Over past few years, continuous improvements has been introduced to make the adaptation more flexible and robust. In this section, we briefly review the road map and current results published recently in this topic.

2.1

Analyzing

Policy-based Adaptation

Generally, policy is a mapping between a situation or condition, to appropriate action, strategy or reconfiguration. Policy-based approach allows engineers to decouple system adaptation logic with knowledge about how to react when

and employ reinforcement learning to find policy: a function π that specifies the optimal action a = π(s) in state s. Reinforcement learning pursues the same philosophy with selfadaptive software system since it concerns the problem of an agent interacting with its environment to achieve a goal by discovering how to behave in order to get the most reward [12]. There are many algorithms realize RL that could be roughly classified into two trends: model-free: directly learn a value function to determine policy, and model-based : estimate the transition probabilities and the reward function, then derive a value function from the approximate MDP [19].

Kim, D. et al. [9, 10] made use of Q-Learning in an architecture based adaptation style. System is represented as an abstract architecture model consisting of components and their relations. For every reconfiguration of system (architectural variation), they specify a Q-value that indicates its efficiency in current execution context. As the nature of model-free reinforcement learning, by examining the realtime performance (i.e. reward value), their Q-value will be updated directly using a function which calculates the quality of a state–action combination Q : S × A → R. With β = 1 − α, α < 1, γ < 1, Q value at step t + 1 is given by: h i Qt+1 (st , at ) = β × Qt (st , at ) + α Rt+1 + γ max Qt (st+1 , a) a

To avoid falling into a local optima, they used a predefined ratio between exploitation (i.e. utilize experience to find the best reconfiguration so far) and exploration–where system choose a reconfiguration randomly (–greedy) to seek potential changes that may increase system performance. Although aiming to achieve a global optimization, this method is very arbitrary and impractical if there is a large number of possible reconfiguration, or insufficient training/executing time. Moreover, heuristic exploration is sensitive to choices of learning parameters (, α, γ) [17], which may give more difficulty for system designer to determine which is the best learning configuration. In IDES [11], an extension of Rainbow by Gu. X 2012, preference values are assigned for every policy available at design time, considering the cost–benefit factor of each policy as shown in Table 1. In run-time, learning process (Fig. 1) uses actor–critic algorithm to examine the received reward, then adjust preference of previously executed policy. The agent increases or decreases preference of a policy if it receives a positive or negative reward, respectively. There is no exploration in this algorithm, resulting a slow react performance to exception and a high probability to fall in a local-optima. Table 1. The Policy preference table of IDES, regarding to benefit and cost dimension of a policy. POLICY Situation Action Congested Replace Add new

QUALITY Benefit Cost 7/10 5/10 8/10 8/10

PREF. 1.4 1.0

An arbitrary solution to this gap is to design different MDPs for different situations that may encountered by the process [9]. Though, this way may bring more complexity to both system designer and system maintainer, since the system’s state transition was divided into different sequences. The next sections aim to solved the issues discussed here by introducing the advances of model-based reinforcement learning and showing how we could apply to planning process in policy-based self-adaptive system.

3.

MODEL-BASED REINFORCEMENT LEARNING

In AI community, reinforcement learning has been widely accepted as a successful technique to learn via interaction without an explicit teacher: an agent independently try to optimize its behavior in operating environment with several sources of uncertainty [12]. Pursuing the similar nature, the recent researches in self-adaptive software system has adopted reinforcement learning techniques with remarkable successes [10, 11]. Unfortunately, RL often exhibits extremely slow learning in complex problems. As stated in Section 2, the current works adopting RL in self-adaptive domain remain very primitive, resulting impractical performance. An ongoing challenge in machine learning is developing learning approach that share the advantages of RL, but learn in a more intelligent manner. In this section, we represent an approach to extend the capabilities of RL algorithms, known as model-based reinforcement learning. Model here means a model of the dynamics of the environment. In the simplest case, this means just an estimate of the state transition probabilities and expected immediate rewards of the environment. In general it means any predictions about the environment’s future conditional effect to the agent’s behavior [12]. Particularly, we present here model-based Bayesian RL approach [14, 16], which is one of the most successful techniques in the field [18]. In Bayesian learning, uncertainty is represented by a prior distribution over unknown parameters and learning is achieved by computing a posterior distribution based on observations. This yields several benefits: a) domain knowledge can be naturally encoded in the prior distribution to speed up learning; and b) the exploration/exploitation trade-off can be naturally optimize [18]. As depicted in Fig. 2, a typical model-based reinforcement learning can be decomposed into two parallel processes: (a)

Generally, model-free reinforcement learning directly reflect the uncertainty of executing environment into policy preference value. By this way, we could neither store nor utilize engineering knowledge about system to enhance the learning. In contrast, knowledge about underlying system, i.e. the state transitions, uncertainties, are very essential and plentiful in software engineering domain.

Environment (a) Estimate

(b) Determine s2

Adopting reinforcement learning to planning in self-adaptive software system is a natural way as the philosophy of both problems is to learn agent–environment interaction to optimize the returned reward (i.e. maximize system utility). However, the mapping between a RL problem and a SAS problem still has an essential difference: concept of situation. As stated in [10], RL problems rely on Markov Decision Process, and does not care about environmental situations.

s1 s3

s4

action

Policy Policy Policy

s6 s5

Model

Figure 2. The principle 2–processes behind the model-based Reinforcement Learning.

estimating the model of the underlying system; (b) determining optimal behavior from estimated model [14]. Mainly referred to the work of N. Vlassis et al., [18], next subsections will step-by-step introduce the overview of this approach.

3.1

Model Representation

The most essential task in model-based RL is how to represent the estimate model of underlying system, that will impact the later processes of model estimation and policy deriving. Generally, a system (or agent) is modeled as a typical Markov Decision Process, which have state space S, action space A, state transition function Ta (s, s0 ) and reward function Ra (s, s0 ). To represent uncertainties (here, the unsure state transition probability), we can formulate a hyper-state set SP as the hybrid set of states defined by the cross product of the nom0 inal MDP states s and the model parameters θas,s . The transition function will be modified to: Tp (s, θ, a, s0 , θ0 ) = P (s, θ, a, s0 , θ0 ), and can be factored in two conditional dis0 tributions: P (s0 |s, a) = θas,s for MDP states, and one for the unknown parameters of model P (θ0 |θ) = δθ (θ0 ) [18]. We can formulate a belief-state MDP by defining beliefs over 0 the unknown parameters θas,s . A natural representation of beliefs is via Dirichlet distributions (Fig. 3), as Dirichlets are conjugate densities of multinomials. A Dirichlet distribution Q Dir(p; n) ∝ i pini −1 over a multinomial p is parameterized by positive numbers ni , such that ni − 1 can be interpreted as the number of times that the pi –probability event (i.e, a state transition) has been observed. Pr(p) Dir(p; 20,80)

Dir(p; 2,8)

Dirichlet, it will update simply amounts to setting the current state to s0 and increment by one the hyper-parameter 0 ns,s that matches the observed transition s, a, s0 . This fora mulation of Bayesian reinforcement learning also provides a natural approach to reason about the exploration – exploitation trade-off. Beliefs encode all the information gained by learner and an optimal policy is a mapping from beliefs to actions that maximizes the expected total rewards. For that reason, an optimal policy naturally optimizes the exploration – exploitation trade-off.

3.3

Derive Policy from Model

A common technique to derive policy by model is to use dynamic programming to approximate the Bellman’s equation for optimal value function in the belief-state as follow: X 0 Vs∗ (b) = max R(s, a) + γ P (s0 |s, b, a)Vs∗0 (bs,s a ). a

s0

Here s is the current nominal MDP state, b is the current 0 belief over the model parameters θ, and bs,s is the updated a 0 belief after transition (s, a, s ). Online algorithms attempted to approximate the Bayes optimal action by reasoning over the current belief, which often results in short-sighted action selection strategies [13]. Early approximate online RL algorithms were based on the value of perfect information (VPI) criterion for action selection. It involves estimating the distribution of optimal Q-values for the MDPs in the support of the current belief that are then used to compute the expected ’gain’ for switching from one action to another action. Instead of building an explicit distribution over Q-values, we can use the distribution over models P (θ) to sample models and compute the optimal Q-values of each model. This yields a sample of Q-values that approximates the underlying distribution over Q-values. The value of perfect information can be approximated as follow: 1 X i i wθ Gains,a (qs,a ) V P I(s, a) ≈ P i i wθ i

Dir(p; 1,1)

0

1

p

0.2

Figure 3. The plots of Dirichlet distribution Dir(p; 0.2k, 0.8k) with different level of confidence (no confidence, k=10, k=100 )

3.2

Model Learning

The prior density and evidence available from observations can be combined together to derive the posterior probability density for model parameters. Belief monitoring corresponds to Bayesian updating of the beliefs based on observed state transitions. For a prior belief b(θ) = Dir(θ; n) over some transition parameter θ, when a specific (s, a, s0 ) transition is observed from the environment, the posterior belief will be obtained by the Bayes’ rule as follow: 0

b0 (θ) ∝ θas,s b(θ) 0

If we represent belief states by a tuple hs, {ns,s a }i including 0 the current state s and the hyper-parameters ns,s for each a

where the wiθ are the important weights of the sampled models depend on the used proposal distributions. Several efficient procedures to sample the models from some proposal distributions may be easier to work with than P (θ). Another action selection strategy is Thompson sampling, which involves sampling one MDP from the current belief, solve this MDP to find the optimization, and execute optimal action at the current state. We may achieve a better action selection strategy by computing a near-optimal policy in the belief-state MDP. As this is a standard MDP, we may use any approximate solver. Some other studies have followed this idea by applying the sparse sampling algorithm on the belief-state MDP. This approach carries out an explicit look-ahead to the effective horizon starting from the current belief, backing up rewards through the tree by dynamic programming or linear programming, resulting in a near Bayes optimal exploratory action. However, a new tree will have to be generated at each step which can cost overhead in practice. Since the space limitation, in this paper, we only introduce

briefly the generic approaches for model-based Bayesian reinforcement learning. More detail algorithms, refinements and discussions could be found in latest works [13, 14, 16, 18, 19].

Def inition : I = {i1 , i2 , ..., in } is the situation set. A situation it ∈ I indicates the environmental or internal event that an agent may encountered and triggers its adaptation.

As a result, A|st ,it indicates the possible action set when agent is in state st and situation it occurs. Let S ∗ be the combination between state set S and situation set I. Using Model-based reinforcement learning has shown various adthis notation, a new MDP that consists of {S ∗ , A, R, T } can vantages and promised a novel improvement for planning directly apply reinforcement learning without any gap. In in self-adaptive software domain, since it could leverage the knowledge about underlying system to enhance learning progress. another hand, new situation concept allows system engineers to design the policy repository with one more dimension that In this section, we discuss the engineering issues of how to obey the standard of self-adaptive system, and also maintain apply this approach in building a software system. the global unification of system.

4.

ONLINE PLANNING IN SAS

4.1

Mapping RL to SAS Problem

The reinforcement learning problem aims to solve the problem of learning from interaction to achieve a goal [12]. Specifically, the agent interacts with environment at a sequence of time steps. At each time step, the agent observes the environment’s state, and on that basis selects an action. At next step, as a consequence of its action, the agent receives a numerical reward, and makes a transition to a new state. Planning in software systems can be represented as an interaction between an agent and an environment, as depicted in Fig. 4. In the interaction model, a software system plays a role as an agent. The system monitors the current state of the environment via its sensors. Based on some experience or knowledge (i.e, policy), it chooses an action and evaluate feedback reward. The reward in SAS usually evaluated through the measurements of various sensed factors in form of utility functions [10, 11].

4.2

Model Construction Process

Model-based reinforcement learning provides a stable performance by maintaining an explicit model of operating environment to predict consequence of actions before they are taken. Therefore, encoding engineering knowledge about underlying system into a prior model strongly benefits the learning progress. This is a domain specific problem but share the common key activities. Here, we discuss the technical process to construct a model (i.e. system state transition) though requirement analysis and system specification. Generally, we proposed the 4–steps procedure as follow: 1. Discover system States, Actions, and Situations 2. Context Analysis and Uncertainty modeling 3. Action–Uncertainty mapping

State

Sensing

Adaptive Software

RL Agent

Reward

Situation

Utility Action

Environment

(a)

Action

Environment

(b)

Figure 4. The system–environment interaction difference between (a) RL and (b) SAS problem setting. However, one of the essential concept of SAS–situation, was missed in the original RL setting. Situation allows the system to know when it must monitor the state and reward, and also when it must take the action corresponding to the state (i.e, triggers the adaptation process) [10]. Beside, the important role of situation is to provide information about events of environment to agent, so that it could behave in a proper direction. An agent may stay in the same state, but should take different action in different situation. In other words, situation limits the available actions can be selected. To make this clear, let us consider a super-computer which could automatically manage it’s resources. In a situation of increasing workload, it should increase the amount of resources to meet performance requirement. In contrast, resources should be turned-off to reduce power consumption when workload is decreasing. To apply RL to SAS, we propose an extension to Markov Decision Process by supplement the definition of situation:

4. Derive State transition prior distribution Determining the state and action set is a fundamental process for designing online planning self-adaptive software system. At the early phase of development process, by discovering and analyzing user requirements, analysts derive the goals and operating scenarios of system. There were many works provided guidelines in this requirement engineering domain. Rolland et al. [20] described the procedure to represent and analyze scenarios to discover system states, actions and parameters as depicted in Fig. 5. State is usually the combination of important conditions of system, while Action would be the function or reconfigure process that leads the system to transform its state from one to another [10]. Finally, Situation is the event that triggers the scenario and requires agent to adapt. After this step, we could obtain a primarily version of state transition model without any prior knowledge (as example illustrated in Fig. 6). In second step, we analyze operating context to model the uncertainties that may affect to system run-time operation. Over past decade, many researchers have given much efforts in dealing with uncertainty by managing dynamic environment. Early studies including [20] used the concept of exception, while later works provided more detailed analysis on this topic. Esfahani et al. [21] provided a survey on recent uncertainty study, that classified uncertainty into different categories: model drift, sensor noise, operation parameters, human-in-loop, decentralization. Utilizing this classification along with the context modeling guideline of Villegas et al. [22], we could determine the uncertainty sources

Normal Scenario

adaptive software system, with the most important factor to be adapted is response time. By requirement discovering, we found two adaptation scenarios for the website when response time goes below expectation:

Exceptional Scenario

initial

Scenario

State final

desribe

1. Add more resource to increase system performance.

1+

Action

Agent

compose

to

Resource

from

1+ params

Composite Action

Atomic Action

Object

Figure 5. The scenario structure, in which describes its state, action and parameters [20].

of software system in a systematically way. Moreover, in design phases (system, functional design), by analyzing dependent relationships between functions or components, we need to take in to account other factors of internal structure/implementation that may also make effect on the action. This step takes the most crucial role in building the model since any missing factor could make a big impact on final estimation. Mapping between the discovered contextual uncertainties to affected actions and analyze their impact to derive probability distribution over primarily model are the final steps. From action’s parameters (resources, dependent objects), we firstly determine the corresponding context factor in context model. By looking to the association of uncertainties and context factor, we then clarify which uncertainties will affect the action. Designing the probability distributions over model requires a deep analysis (or simulated experience) over the impact of uncertainties combination. Dirichlet distribution allows designer to encode their knowledge into model with a level of confidence (Fig. 3). Hence, if there is not sufficient confidence about some factors, they will be learned during the run-time of system by exploration.

5.

CASE STUDY

To illustrate the engineering procedure described above and verify the performance of proposed learning approach, in this section, we consider a case study inspired from recent trend of technology: Cloud Computing. By virtualization of computational resources (CPU power, RAM, HDD storage), cloud server could provide different package of service based on customers’ requirement. For recent years, it has proved as an efficient solution by reducing several costs for user–side (initial deployment, hardware maintenance, software license, etc.). Let us here consider a case where a financial company wants to deploy their website using cloud service. The website will be operating in two modes: graphic mode provides more information and costs more resource than textual mode. The goal is providing to their customers the market’s realtime information as fast as possible. Beside, as the dynamic of cloud-based resources, they also want to reduce the cost paid to cloud provider by minimize hired resource whenever it is not necessary. It is clear that this website should be engineered as a self-

2. Switch to Textual Mode to reduce computational time. For the second requirement (reduce unnecessary resource), we need to consider another situation: the decreasing of website’s traffic, with possible scenarios: 3. Reduce unnecessary resource to save cost. 4. Switch to Graphic Mode to provide more information. Follow the state-action discovery process, we designed the states of this system are combinations of three key conditions mentioned in requirements: • M ode = {Graphic, T extual} • ResponseT ime = {N ormal, Low} • Resource = {Small, M edium, Large} There are four actions according to the scenarios addressed above: • Add or Release resources • Switch → T extual or Graphic mode Reward function of the website will strongly depend on response time, as it is our main goal. Other side factors will also affect to earned reward, including operating mode is graphic (more preferred) or textual ; the resource using is small (more preferred) or large. After this step, we could obtain a primarily version of state-action transition model as depicted in Fig. 6. The next step is to model and discover uncertainty. This is a domain specific process that needs expert analysis. Here, for simplicity, we only model one uncertainty that affect to system: the availability of cloud resource. As the resource is limited, there would be some cases that cloud provider could not provide enough amount of requested resource on demand of customer with probability of 2%, resulting delays or unexpected state transition of underlying system. Mapping this uncertainty to system action, we found affected action is: Adding more resource, hence, we assigned a belief distribution over every state transition that caused by this action. Also for simplicity, we ignore the internal uncertainty of system, and let other transitions with certain probability as 1.0. To evaluate the performance of model-based reinforcement learning in finding optimal policy (maximize system’s accumulated reward), we implemented and simulated this problem and observed the total reward the system received using Beliefbox framework [23], an open source software platform to evaluate reinforcement learning algorithms. We

Add / Remove resource

6.

Switch mode

The development of software aided technologies have achieved many successes nowadays. Building intelligent and reliable systems become more and more substantial. Especially, in self-adaptive software, reliability is the most fundamental goal to achieve. Recent studies proved that an agent wants to act competently in real-world environments requires explicitly representation of knowledge that predicts the consequences of actions. In this paper, we (1) analyzed the limitations of current researches in self-adaptive software policy evolution, (2) introduced the advantages of modelbased reinforcement learning and (3) proposed engineering procedure to adopt the technique in building a self-adaptive

Graphic ---------Normal ---------Small

Graphic ---------Normal ---------Large

Graphic ---------Normal ---------Medium Graphic ---------Low ---------Medium

Graphic ---------Low ---------Small

Textual ---------Normal ---------Large

Textual ---------Normal ---------Medium

Textual ---------Normal ---------Small Textual ---------Low ---------Small

Graphic ---------Low ---------Large

Textual ---------Low ---------Medium

CONCLUSIONS

Textual ---------Low ---------Large

Figure 6. The case study scenario of Cloud-based Adaptive Web-server.

compared model-based RL with some well-known modelfree algorithms: Q-Learning, Sarsa and TdBma. The result in Fig. 7, as expected and observed in other comparisons from reinforcement learning community [14, 16, 17], proved that with encoded knowledge of uncertainty, modelbased reinforcement learning allows system to receive higher long-term accumulated reward value, hence provide a better and more stable performance compared to model-free approaches. When we reduced the number of interaction in each simulation (to shorten the training time), results as depicted in Fig. 8 showed not much different among algorithms in term of total reward earned. However, modelbased approach provided stable performance compared to others, hence, more reliable to guarantee the system’s QoS. The most important drawback of model-based compared to model-free RL is the computational complexity. In our experiment, on a 2.4Ghz CPU machine running Linux, the model-based RL algorithms required 16% CPU, 1.8MB RAM and 6,480,000 system clocks (0.027 second) to finish the simulation showed in Fig. 7; while the remaining model-free algorithms only required around 4% CPU, 0.8MB RAM and 70,000 system clocks to do the same thing. However, since more and more SAS systems have been deployed in realword to serve the commercial purposes that required high level of QoS, these computational consumption trade–off is acceptable corresponding to the benefit it provides. Computational complexity is one of key issues in modelbased RL research community. In recent years, many progress has been made. Current works in the field has demonstrated the feasibility of model-based RL in large and complex problems using factored model to reduce to state-action search space; or approximate function to reduce the overhead of policy deriving algorithms [16, 19].

! "

#

Figure 7. Simulation result of Cloud-based Adaptive Web server using different reinforcement learning algorithms. We ran 100 simulations (x–axis) for 4 algorithms (each consists of 1000 interactions), and observed the accumulated reward of system (y–axis).

! "

#

Figure 8. Results when we ran 100 simulations, each consists of 500 interactions (shorter training time). Accumulated rewards were not much different among algorithms. However, model-based showed a more stable performance than others.

software system. By surveying the latest researches, we described an efficient and widely accepted solution to the problem–the Bayesian based approach, which allows system engineers naturally encode the knowledge of uncertainty and obtain optimal exploitation–exploration rate [14, 16, 18]. With a plenty amount of information resource from software development process (e.g. requirement engineering, software specification, system design), we explained step-by-step procedure to construct a model and design the prior knowledge. The case study and experimental result proved that, with model-based RL approach and provided knowledge (prior model), a self-adaptive software system can achieve higher and more stable performance than previous works. Our future direction is to continue improve policy evolution by analyzing and adopting the state-of-art approach in AI community to deal with recent issues of SAS, concentrate on performance and QoS. Transfer learning is one of promising directions, which allows the system to transfer its knowledge (model) and experience (posterior) earned during the executing of one task to another, hence, speed-up the learning of new task in multiple-goal system.

7.

[10]

[11]

[12] [13]

[14]

ACKNOWLEDGMENT

This work has been supported in part by Next Generation Information Computing Development Program through the National Research Foundation of Korea (NRF), the Korea Ministry of Education, Science and Technology (No. 2012033347).

8.

[9]

[15]

[16]

REFERENCES

[1] Elisabetta Di Nitto, Carlo Ghezzi, Andreas Metzger, Mike Papazoglou, Klaus Pohl. A journey to highly dynamic, self-adaptive service-based applications. Automated Software Engineering, Volume 15, Issue 3-4, pp 313-341, December 2008. [2] Mazeiar Salehie, Ladan Tahvildari. Self-adaptive software: Landscape and research challenges. ACM Transactions on Autonomous and Adaptive System, Volume 4, Issue 2, May 2009. [3] IBM. An architectural blueprint for autonomic computing. Technical report, IBM 2003. [4] Narges Khakpour, Ramtin Khosravi, Marjan Sirjani, Saeed Jalili. Formal analysis of policy-based self-adaptive systems. SAC ’10 Proceedings of the 2010 ACM Symposium on Applied Computing, p2536-2543, 2010. [5] David Garlan, Bradley Schmerl, and Shang-Wen Cheng. Software Architecture-Based Self-Adaptation. In Autonomic Computing and Networking, Springer Science Business Media, LLC 2009. [6] David Garlan, Bradley Schmerl, and Shang-Wen Cheng. Software Architecture-Based Self-Adaptation. Autonomic Computing and Networking, 31-56, 2009. [7] Ahmed Elkhodary, Naeem Esfahani, Sam Malek. FUSION: a framework for engineering self-tuning self-adaptive software systems. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering - FSE ’10, Pages 7-16 , 2010. [8] German H. Alferez, Vicente Pelechano. Dynamic Evolution of Context-Aware Systems with Models at

[17]

[18]

[19]

[20]

[21]

[22]

[23]

Runtime. MODELS 2012, LNCS 7590, pp. 70-86, Springer-Verlag Berlin Heidelberg 2012. Dongsun Kim and Sooyong Park. AlchemistJ: A Framework for Self-adaptive Software. In International Federation for Information Processing, LNCS 3824, pp. 98-109, 2005. Dongsun Kim and Sooyong Park. Reinforcement Learning-Based Dynamic Adaptation Planning Method for Architecture-based Self-Managed Software. In SEAMS’ 09, Vancouver, Canada, May 18-19, 2009. Xiaodong Gu. IDES: Self-adaptive Software with Online Policy Evolution Extended from Rainbow, In Computer and Information Science 2012, SCI 429, pp. 181-195, Springer-Verlag Berlin Heidelberg 2012. Sutton, R., Barto, A. Reinforcement Learning: An Introduction. MIT Press 1998. Dearden, R., Friedman, N., Andre D. Model based Bayesian exploration. In Proceedings of Fifteenth Conference on Uncertainty in Artificial Intelligence, San Francisco, 1999. Malcolm Strens. A Bayesian Framework for Reinforcement Learning. In Proceeedings of the Seventeenth International Conference on Machine Learning (ICML-2000), Stanford University, California, June 29-July 2, 2000. Pablo Samuel Castro. Bayesian exploration in Markov decision processes. Master Thesis, McGill University, Montreal, Quebec, Canada, 2007-06-18. Ross, S., Pineau, J. Model-based Bayesian reinforcement learning in large structured domains. arXiv preprint arXiv:1206.3281, 2012. G. Atkeson and Juan Carlos Santamar. A Comparison of Direct and Model-based Reinforcement Learning. In International Conference on Robotics and Automation, 1997. Nikos Vlassis, Mohammad Ghavamzadeh, Shie Mannor, and Pascal Poupart. Bayesian Reinforcement Learning. Reinforcement Learning State-of-the-Art: Adaptation, Learning, and Optimization, Volume 12, Springer 2012. Istvan Szita, Csaba Szepesvari. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010. Colette Rolland, Carine Souveyet, and Camille Ben Achour. Guiding Goal Modeling Using Scenarios. IEEE Transactions on Software Engineering, Vol. 24, No. 12, December 1998. Naeem Esfahani and Sam Malek. Uncertainty in Self-Adaptive Software Systems. Self-Adaptive Systems, LNCS 7475, pp. 214-238, Springer 2013. Norha M. Villegas, Hausi A. Muller. Managing Dynamic Context to Optimize Smart Interactions and Services. The Smart Internet, LNCS 6400, pp. 289-318, Springer 2010. Christos Dimitrakakis and Nikolaos Tziortziotis and Aristide Tossou. Beliefbox: A framework for statistical methods in sequential decision making. http://code.google.com/p/beliefbox/

Model-based Reinforcement Learning Approach for

Model-based Reinforcement Learning Approach for

Suggest Documents

Reinforcement Learning Approach for Ad Hoc Network

Reinforcement Learning Approach for Parallelization in Filters ...

Reinforcement Learning Approach for RF-Powered Cognitive

A reinforcement-learning approach for admission ...

A Reinforcement Learning Automata Optimization Approach for ...

Reinforcement Learning based approach to

A Reinforcement Learning Approach to Call ... - ScienceDirect

Recurrent Reinforcement Learning: A Hybrid Approach

Variable Impedance Control A Reinforcement Learning Approach

A Distributed Reinforcement Learning Approach to Mission ...

A Reinforcement Learning Approach to Finding

A Reinforcement Learning Approach to Airline Seat Allocation for ...

A Reinforcement Learning approach for the Cross-Domain Heuristic ...

Q-caching: an integrated reinforcement-learning approach for caching ...

A Reinforcement Learning Approach for Inventory Replenishment in ...

A Model-Based Reinforcement Learning Approach for a Rare Disease

A Reinforcement Learning Approach for Product Delivery ... - CiteSeerX

A Novel Approach Based on Reinforcement Learning for Finding

A Reinforcement Learning Approach to Online Learning of Decision ...

Reinforcement Learning

Learning to play Monopoly: A Reinforcement Learning approach

Reinforcement Learning for Humanoid Robotics

Benchmarking for Bayesian Reinforcement Learning

Ant System Reinforcement Learning for