Reinforcement Learning-Based Dynamic Adaptation ... - SEAMS 2009

Reinforcement Learning-Based Dynamic Adaptation Planning Method for Architecture-based Self-Managed Software Dongsun Kim and Sooyong Park Sogang University South Korea SEAMS 2009, May 18-19, Vancouver, Canada

Outline • Introduction • Planning Approaches in Architecturebased Self-Management – Offline Vs. Online

• Q-learning-based Self-Management • Case Study • Conclusions

2009-06-02

SEAMS 2009, May 18-19, Vancouver, Canada

2

Introduction

2009-06-02


3

Previous Work Fully manual reconfiguration (in Architecture-Based Runtime Software Evolution by Oreizy et al. 1998) Architectural configuration at T1

System administrator can reconfigure the architecture using a console,

Architectural configuration at T2

Or hard-coded reconfiguration plans

à Difficult to react rapid and dynamic changes à What if the plan does not meets environmental and situational changes? 2009-06-02


4

Previous Work Invariants and strategies (in Rainbow: Architecture-Based Self-Adaptation with Reusable Infrastructure, Garlan et al. 2004)

Invariants indicate situational changes

- What if the strategies do not resolve the deviation of invariants? Strategies imply architectural reconfigurations for adaptation

2009-06-02


5

Previous Work Policy-Based reconfiguration (in Policy-Based Self-Adaptive Architectures: A Feasibility Study in the Robotics Domain, Georgas and Taylor, 2008) Adaptation policies

Adaptation Conditions Adaptation Behavior

- What if policies deviate from what developers (or system administrators) anticipated?

2009-06-02


6

Problems • Existing approaches – Statically (offline) determine adaptation policies. – Administrators can manually change policies when the policies are not appropriate to environmental changes. – Administrators must know the relationship between the situations and reconfiguration strategies. à Human-in-the-Loop! – These are appropriate for applications in which adaptive actions and their consequences are readily predictable. 2009-06-02


7

Requirements on Dynamic Adaptation Planning

• Low chances of human-intervention – Some systems cannot be updated by administrators because of physical distances or low connectivity.

• High level of autonomy – Even if administrators can access the system during execution, in some cases, they may not readily make an adaptation plan for the changed situation. 2009-06-02


8

Goals Situation-1

Architecture-1

Situation-2

Architecture-2

Architecture-3

Localizer

Navigation Control

...

...

Wheel Controller

Wheel Controller

Wheel Controller

Navigation Control

Situation-3

PathPlanner

Localizer

Navigation Control

PathPlanner Vision-based MapBuilder

Vision-based MapBuilder


2009-06-02


Situation-n

Architecture-m Wheel Controller

Navigation Control

PathPlanner


Localizer

9

Planning Approaches to Architecturebased Self-Management

2009-06-02


10

S/W system and Environment Interaction

Software System State st

Reward rt

Situation it

Reconfiguration Action

at

rt +1 st +1

2009-06-02

Environment


11

Formulation Policy (P) (or a set of plans): indicates that which configuration should be selected when a situation occurs Situations (S): environmental states that the system concerns

Configurations (C): possible architectural configurations that the system can take

Value Function (V): evaluates the utility of a selected configuration in a given situation 2009-06-02


12

Offline Planning in Self-Management • Predefined policies – Developers or administrators define policies of the system. – They generate policies based on already known information prior to runtime. – However, in general, the information is not complete and it may become invalid since the environment can be changed. – It is difficult to monitor all details of the environment 2009-06-02


13

Online Planning in Self-Management • Adaptive Policy Improvement – Allows incomplete information about the environment • Gradually LEARNS the environment during execution.

• Minimizing human-intervention – The system can autonomously adapt (update) policies to the environment.

2009-06-02


14

Research Questions • RQ1 : What kind of information should the system monitors? • RQ2 : How can the system determine whether its policies are ineffective? • RQ3 : How can the system change (update) the policies?

2009-06-02


15

Q-learning-based Self-Management

2009-06-02


16

On-line Evolution Process Q( s, ci ) ←Q( s, ci ) Å

Situation State

Current Arch. configuration

cbest - so - far

Planner Using prior experience (i.e. learning data), Choose Reconfigured best-so-far(arg max Q( s, c) ) Arch. c action cbest - so - far

Evaluator Fitness Functions

Detection

2009-06-02

Planning

Execution

Evaluation


Reward

Learner Reward After execution, evaluated rewards is used for updating existing Q( s, ci ) value. (i.e. reinforcement)

Learning

17

Detecting Situations and States • Situations are a set of events that the system is interested in (RQ1). – e.g., ‘signal-lost’, ‘hostile-detected’

• States are a set of status information that can be observable from the environment (RQ1). – e.g., distance={near,far}, battery={low,mid,full}

• Identifying states and situations – They can be identified from system execution scenarios. 2009-06-02


18

Planning • When a situation is detected, – The system should determine what it must do (i.e., reconfiguration actions) – State information may influence the decision.

• Exploration Vs. Exploitation – Usually the system use the current policy. – With some probability, it must try another action because the current policy may not the best one or the environment can be changed. 2009-06-02


19

Execution • Reconfiguration – According to the chosen policy by planning phase, the system change its configuration.

2009-06-02


20

Evaluation • Fitness (Reward) – After a series of execution with the reconfigure architecture, the system should evaluate the effectiveness of it (RQ2). – The fitness function can be identified from the goal of the system. à e.g., maximize survivability, minimize latency. – Systems can have several fitness functions and they can be merged. à e.g., weight sum: 2009-06-02


21

Learning • Updating policies (RQ3) – REINFORCES the current knowledge on selecting reconfiguration actions for observed situations.

• Updating rule

Fitness (reward) observed

Situation detected Reconfiguration action

2009-06-02


22

Case Study

2009-06-02


23

Implementation • • • • •

Robots in Robocode 4 situations and 4 state variables 16 reconfiguration actions Fitness: score Environments: enemy robots

• Experiments outline – Static (offline) policy – Adaptive online policy – Online policy in changing environments • off (known environment) à off (new) à on

2009-06-02


24

With Static (offline) Policy alpha=0.3, gamma=0.7, epsilon=1.0 3500 3000

y = -19.0 x + 2,721.1

Score

2500 2000 1500 1000 500 0 1

2

3

4

5

6

7

8

9

10

Round(X 10) A Robot 2009-06-02

B Robot


25

With Online Policy alpha=0.3, gamma=0.7, epsilon=0.5 3500 y = 74.9 x + 2,361.5

3000

Score

2500 2000 1500 1000 500 0 1

2

3

4

5

6

7

8

9

10

Round(X 10) A Robot 2009-06-02

B Robot


26

Revisited alpha=0.3, gamma=0.7, epsilon=0.0 3500 3000

y = -4.7 x + 2,813.3

Score

2500 2000 1500 1000 500 0 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 Round(X 10)

A Robot 2009-06-02

B Robot


27

Enemy Changed (static policy) alpha=0.3, gamma=0.7, epsilon=0.0 3000 y = 4.8 x + 2,234.7 2500

Score

2000 1500 1000 500 0 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 Round(X 10)

A Robot 2009-06-02

C Robot


28

Online Policy Again alpha=0.3, gamma=0.7, epsilon=0.5 3500 y = 12.4 x + 2,534.0

3000

Score

2500 2000 1500 1000 500 0 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 Round(X 10)

A Robot 2009-06-02

C Robot


29

Conclusions • Making a decision in architecture-based adaptive software systems. – Offline Vs. Online

• A RL-based (Q-learning) Approach – Enables adaptive policy control. – When the environment has been changed, the system can adapt its policies about architectural configuration to the changed environment. 2009-06-02


30

Further Study • Better Learning – Q-learning might not be the best choice for adaptive software systems. – Learning techniques have a large number of parameters. à e.g., balancing ‘exploration’ and ‘exploitation’

• Better Representations – Situations, configurations, and rewards – Better representation may lead to faster and more effective adaptation.

• Scalability – Avoiding state explosion.

• Need for Best Practice – Comprehensive and common examples 2009-06-02


31

Q&A

2009-06-02


32

Reinforcement Learning-Based Dynamic Adaptation ... - SEAMS 2009

Reinforcement Learning-Based Dynamic Adaptation ... - SEAMS 2009

Suggest Documents

Reinforcement Learning and Dynamic Optimization

Dynamic adaptation to trial difficulty 1 Dynamic adaptation ... - CiteSeerX

Measurement-based adaptation protocol with quantum reinforcement

Measurement-based adaptation protocol with quantum reinforcement ...

semiotic seams - Semantic Scholar

Dynamic Adaptation in Ubiquitous Services

Dynamic Preferences in Multi-Criteria Reinforcement Learning

Sabatucci SEAMS 15.pptx

Reinforcement Learning and Dynamic Optimization - Semantic Scholar

Deep Reinforcement Learning for Dynamic Multichannel Access

Approximate dynamic programming and reinforcement learning

Reinforcement Learning based Dynamic Model Selection for

Dynamic Terrain Traversal Skills Using Reinforcement Learning

Dynamic Treatment Regimes using Reinforcement ... - Google Sites

Approximate dynamic programming and reinforcement learning

Reinforcement Learning and Adaptive Dynamic ... - Semantic Scholar

Reinforcement Learning for Dynamic Channel Allocation ... - CiteSeerX

Reinforcement Learning in Dynamic Environments using Instantiated ...

Dynamic Treatment Regimes using Reinforcement ... - Google Sites

When Seams Fall Apart

Behind the Seams 2 - Traid

Sabatucci SEAMS 15.pptx

reinforcement learning: state-of-the-art (adaptation ... - Google Sites

Dynamic Adaptation of Aspect-Oriented Components - UPV