Adaptive and Dynamic Service Composition via Multi ... - IEEE Xplore

2 downloads 120 Views 276KB Size Report
School of Computer Science and Engineering, Southeast University. Nanjing, China ... a multi-agent Q-learning algorithm for service composition based on this ...
2014 IEEE International Conference on Web Services

Adaptive and Dynamic Service Composition via Multi-agent reinforcement learning Hongbing Wang∗

Qin Wu∗ Xin Chen∗ Qi Yu† Zibin Zheng‡ Athman Bouguettaya§ of Computer Science and Engineering, Southeast University Nanjing, China, 211189 [email protected], [email protected], [email protected] † Rochester Institute of Technology,College of Computing and Information Sciences Rochester, NY, USA, [email protected] ‡ Department of Computer Science and Engineering, The Chinese University of Hong Kong HongKong, China, [email protected] § School of Computer Science and Information Technology, RMIT, Australia, [email protected] ∗ School

Abstract—1 In the era of big data, data intensive applications have posed new challenges to the filed of service composition, i.e. composition efficiency and scalability. How to compose massive and evolving services in such dynamic scenarios is a vital problem demanding prompt solutions. As a consequence, we propose a new model for large-scale adaptive service composition in this paper. This model integrates the knowledge of reinforcement learning aiming at the problem of adaptability in a highly-dynamic environment and game theory used to coordinate agents’ behavior for a common task. In particular, a multi-agent Q-learning algorithm for service composition based on this model is also proposed. The experimental results demonstrate the effectiveness and efficiency of our approach, and show a better performance compared with the single-agent Q-learning method.

overcome these difficulties for large-scale adaptive service composition in a highly-dynamic environment. In recent years, many scholars manifest considerable interest in adaptive service composition, and have proposed many solutions from perspectives of integer programming, graph planning, reinforcement learning (RL) and so on. [1] proposed a multichannel model based on integer programming to optimize the service selection for adaptive service composition. However, such method will demand an insane amount of computation time when addressing the large-scale problems. [2] modeled the service composition as a partially effective planning graph and replaced some undesired services during the execution of composition. But, continuous issuance and perishment of services will need sustained search of feasible services for updating the corresponding planning graph, which is not so suitable for a highly-dynamic environment. [3] implemented adaptation during the composition execution so as to retain an optimal and satisfying-user combination scheme. [4] has tried to adapt to large-scale service composition via multi agents. Nevertheless, they are difficult to systemically resolve largescale, dynamic, adaptive applied demands. As a consequence, more feasible and effective methods are urgently needed. Considering the advantages of RL [5], which learns by trial-and-error interaction with dynamic environment and thus has good self-adaptability, so incorporating it into the execution of services composition can optimize the system to dynamically choose the best service without the complete knowledge about the environment. While in complex workflow scene with massive candidate services, conventional RL methods may not ensure good efficiency in a distributed environment. At this time, the multi-agent technologies arise as a viable solution. To achieve self-adaptability in a dynamic environment and maintain its efficiency when faced with massive candidate services, we naturally resort to multi-agent reinforcement learning (MARL) closely related with the game theory, which have proposed in the field of Distributed Artificial

Keywords-service composition; multi-agent systems; reinforcement learning; game theory

I. I NTRODUCTION Web service is a promising method to implement Serviceoriented architecture (SOA) for its inherent features, such as, self-contained, self-describing, which can be published and invoked over the Internet. With the web services dramatically increasing, service composition has gained a considerable momentum as a means to combine simple services together to construct a complex task to meet practical demands. In practical applications, under the premise of satisfying functional demands, composition efficiency and adaptability are particularly important. On the one hand, certain quality of service (QoS) attributes continuously evolves, thus composition method should constantly adapt to those changes. That is to say, adaptability is necessary to achieve valid service composition in highly-dynamic environment. On the other hand, massive candidate services consume great computing power, especially in constantly changing environment. So, efficiency is an inevitable and urgent problem. In brief, we should invest enormous efforts to 1 This work is partially supported by NSFC Key Project (No.61232007) and Doctoral Fund of Ministry of Education of China (No.20120092110028)

978-1-4799-5054-6/14 $31.00 © 2014 IEEE DOI 10.1109/ICWS.2014.70

447

composition based on reinforcement learning method combining logic preference, which is not applicable for largescale complicated composition cases. [4] proposed a multiagent learning model based on MDP (Markov Decision Process), which is not very suitable for multi-agent solution. In the field of DAI, [11] proposed a minimax-Q learning algorithm for a two-player zero-sum game with strict constraints, consequently, it not performs well in general-sum scenarios. [12] proposed a Nash-Q learning algorithm, which is a generalization of Q-learning for general-sum games. Yet, Nash-Q exists a tough problem of selection of a unique equilibrium deterministically when face with multiple Nash equilibriums. To solve the problem of equilibrium options, scholars have came up with many wit methods. [13] proposed the joint action learner (JAL) algorithm by introducing fictitious play, in which the probability of a given action in the next stage game is assumed to be its past empirical frequency without warranty to converge to the optimal equilibrium. [14] proposed the optimal adaptive learning (OAL) which can ensure to converge to an optimal equilibrium, but its price of computation has limited its practical application. In this paper, we employ the new model named Web Service Composition - Team Markov Game (W SC −T M G) based on MARL to implement the adaptive services composition, but significantly different from other similar works by utilizing the coordination equilibrium and fictitious play process to ensure the agents to converge to a unique equilibrium and propose our multi-agent algorithm to suit for multi-agent service composition.

Intelligence (DAI). There have been some attempts incorporating MARL into service composition. Besides [4], [6] has proposed a model based on Markov Games and hierarchical goal structure, but their method may not work well when faced with a highly complicated goal with more mutual dependencies between each sub-goals as their agents are fixed for certain service classes. In this paper, to achieve better adaptability and scalability for service composition in highly-dynamic environment, we propose a new model, which is based on team Markov Games. We utilize the coordination equilibrium and fictitious play process to ensure the agents to converge to a unique equilibrium when faced with multiple equilibriums that is the common problem of agent coordination and equilibrium selection in multi agents scenario. We also propose a multiagent Q-learning algorithm and prove its convergency to implement this model. Our contributions can be listed as follows: 1) We bring in a new model for web service composition in highly dynamic and complex environment. 2) We propose the multi-agent Q-learning algorithm to adapt to multi-agent service composition scenario and achieve a better performance. 3) The entire framework proposed here caters to the trend of distributed environment and big data era. The reminder of this paper is organized as follows. Section 2 briefly go over the state-of-the-art related with our work. Section 3 introduces the W SC − T M G model. The experiment is performed in the fifth section. Finally, section 6 provides a conclusion and future work. II. R ELATED

III. A BOUT WSC-TMG M ODEL

WORK

In our previous works, like [10], modeled the service composition as W SC − M DP (Web Service CompositionMarkov Decision Process), which can be visualized as a transition graph. Due to the limited space of the paper, W SC − M DP will not be described in detail here. It is worth noting that the W SC − M DP transition graph can be either created manually by engineers or using some automatic composition approaches.

In this section, we gives an overview of some related works, which integrate services composition with RL, software agent or multi-agent system techniques. [7] combined a simple architecture with a novel algorithm based on RL to enable openness, distribution, and multicriteria-driven service composition at runtime according to specialized requirements. [3] proposed a new service composition algorithm based on RL and preference logic reasoning, according to services’ functions and QoS, but the solution is less efficient when faces with massive candidate services. [8] integrated the web service and software agent technologies into one cohesive entity to addresses the distributed nature of web service composition. [3], [9] utilized the agent as the learner of learning algorithm, also the executor of service composition. [6] adopted Markov game for cooperative Multi-agent System (MAS) and Q-learning with a hierarchical goal structure to accelerate the searching of states space during the learning, but efficiency of the learning procedure is still an intractable problem. [10] proposed an adaptive service

A. WSC-TMG To improve the efficiency and scalability when face with a large number of candidate services, a new model W SC − T M G is proposed to be tailored to multi-agent scene. Before giving the definition of this model, we first introduce some relevant definitions. Definition 1 (Candidate Initial State). The joint state S0 = s1 × s2 × ... × sn (si is the state of the ith agent in the team, i=1,...,n) is a candidate initial state iff si = s0 (1 ≤ i ≤ n), where s0 is the initial state of the original WSC-MDP transition graph.

448

Definition 2 (Possible Terminal State). The joint state Sx = s1 × s2 × ... × sn (si is the state of the ith agent in the team, i=1,...,n) is a possible terminal state iff si = sτ (1 ≤ i ≤ n), where sτ is among terminal states of the original WSC-MDP transition graph.

When the set of services corresponding to the joint action are invoked and the environment has changed into the resulting state s , the team in charge of service composition will receive an immediate reward R(s | s, A(s)) according to the feedback of this execution.

Definition 3 (Passed State Set). The set Sp is a passed state set iff Sp contains all the states that agents in the team have passed by.

In order to have a better understanding about this model, we can see Fig. 1, a part of the W SC − T M G transition graph for vacation plan in a 3-agent scenario. A W SC − T M G can be visualized as a multi-dimensional transition network evolved from the W SC − M DP transition graph. The upper part of Fig. 1 is a three dimensional transition graphic that represents the current state of the three agents, and corresponds to the bottom part, which is the primitive W SC − M DP transition graph. Obviously, we can know that the Candidate Initial State is s = s0 × s2 × s7 , where Agent1 at state s0 which is also the initial state of W SC − M DP , Agent2 at state s2 , Agent3 at state s7 respectively. A(S1 ) indicates the agent can choose the corresponding actions (Vacation Place) to be executed, and A(S1 ) × A(S3 ) × A(S8 ) comprises the joint action sets. A traditional service composition finnally corresponds to a unique workflow determined by a deterministic policy [10]. Our objective is to choose an optimum among multiple workflows. Service Workflow can be defined in definition 5 and plotted in Fig. 2 for multi agents scenario.

According the above definitions, we can give our composition model as follows. $JHQW  

V  V  V  V 6

6

6

$JHQW

$  $6   $6    $6  

$JHQW

 9DFDWLRQ3ODFH $LUIDUH /X[XU\&RVW

$LUIDUH

6

/X[XU\ &RVW 6

6

6

6

6

9DFDWLRQ 3ODFH

6 6

6

6

Definition 5 (Service Workflow). W F is a service workflow iff there is at most one joint action that can be invoked at each state. In other words, ∀s ∈ W F, | A(s) ∩ W F |≤ 1.

6

Figure 1. A Part of the W SC − T M G of a Composite Service for Vacation Plan

V

Definition 4 (W SC − T M G). A Web Service Composition TMG is a 7-tuple WSC-TMG=< a, S, S0 , Sx , A, T, R >, where —a is the set of agents. —S is the discrete set of environment states. —S0 is the set of all the candidate initial state S0 . An execution of the service composition starts from one S0 in S0 . —Sx is the set of all the possible terminal state Sx . Upon arriving at one Sx in Sx , an execution of the service composition has a possibility to terminate. —A(s) = A1 (S1 )×A2 (S2 )×...×An (Sn ) is the finite set of joint actions that can be executed in joint state s ∈ S, where Ai (Si )(i=1,...,n) is the discrete set of actions available to the ith agent at its current state Si . —T:S × A × S → [0, 1] is the state transition probability function. When the set of services corresponding to the joint action are invoked, the world makes a transition from its current state s to the resulting state s , and the probability for this process can be represented as P(s | s, A(s)). —R:S × A →  is the reward function for all the agents in the team. In other words, they share the common payoff.

V u V u V

Vc

V u V  u V 

V

Vc $c

Figure 2.

$ 6 u $ 6 u $ 6

$c $ 6 u $ 6 u $ 6

One Workflow Contained by the W SC − T M G in Fig.1

B. Reward Assessment As each deterministic policy can uniquely determine a workflow of a W SC − T M G, the service customer is supposed to receive a certain amount of reward by executing a workflow, which is equivalent to the cumulative reward of all the executed services. If the optimal policy is to identify a workflow that can get maximum cumulative reward, our composition solution can always determine an optimal policy in a highly dynamic environment in that way. To maximize user satisfaction, we construct the reward function on the basis of various QoS values. First, we should compute the reward of single service invoked by one agent in the team. Given that the unit differs from each QoS attribute, we need to standardization for the values

449

C. Equilibrium Coordination

of different attributes and map them into the interval [0,1] represented as v  . Eq. 1 is a single agents reward value aggregating the various QoS attributes, where m is the number of QoS attributes, and wi is the weight of each attribute according to users preference (the sum of all the weights is 1), Vi is the standardized value of the ith QoS attribute. r (s) =

m 

wi ∗ Vi 

How can we define an deterministic optimal policy in our framework, which is a nontrivial challenge? For many Markov games, there is no undominated policy as the performance depending critically on the behavior of the other agents in the environment. If π(s, a) denotes the probability assigned to action a in state s according to policy π, Qπ (s, a) expresses the expected discounted future reward of the agent starting from state s and executing action a for one step based on π. In a n-player Markov game, π1 , ..., πn means stationary policies for each player (agent). The Q-function for agent i(1 ≤ i ≤ n) can be represented as Qπi (s, a1 , ..., an ). Further on, the best-response Q-function for the ith agent can be derived from Qπi (s, a1 , ..., an ) and showed as follows:

(1)

i=1

Then, we can delimit the joint reward function for the agents team in Eq. 2, which simply aggregate the reward values of each service invoked by each agent in the team at one step, where n is the number of agents, s is the current joint state for agents team. With the progress of learning, the value changes as the evolution of the joint state. For n agents, we adopt the reward matrix ri1 ,i2 ,...,in to store the immediate reward computed by Eq. 2, where ik (1 ≤ k ≤ n) indicates the individual action chosen by the kth agent.

Q∗i (s, a1 , ..., an ) = Ri (s, a1 , ..., an )+ γ∗ T (s, a1 , ..., an , s )∗ s ∈S

max

R(s) =

n  m 





a i ∈Ai a 1 ,...,a i−1 a i+1 ,...,a n

wij ∗ V  ij

(2)

i=1 j=1

Assume a Q-value Q(s, a) in conventional Q-learning method that provides an estimation of the value of performing (individual or joint) action, the learner updates its estimation Q(s, a) as follows: Q(s, a) ← (1 − α) ∗ Q(s, a) + α ∗ (r + γ ∗ max Q(s , a )) In multi-agent scene, we need to modify the form of Q(s, a) employed in W SC − T M G. First, in order to facilitate the formal representation, we give the form of ri1 ,i2 ,...,in at joint state s in Eq. 4, where ik (1 ≤ k ≤ n) means the reduced joint action combined by every agent’s action except the ik agent, Φ(ik ) is the collection of the reduced joint actions of the ik agent’s teammates. ri1 ,i2 ,...,in (s) = ri1 , i1 (s) = ri1 ,j (s)

(4)

j∈Φ(i1 )

Then, we draw Q(s, a) function in the multi-agent scene in Eq. 5, where s is the successor state of s.

π1 (s , a 1 ) ∗ ...∗ πi−1 (s , a i−1 )∗ πi+1 (s , a i+1 )∗ ... ∗ πn (s , a n )∗ Q∗i (s , a 1 , ..., a n )

(6)

The key idea of Qπi (s, a1 , ..., an ) is to hold all policies fixed except for πi and allows the ith agent choose optimal actions to maximize the reward implying a similar singleagent scene in fact. If strategy choice of each agent is a best response to the others’ choice, they can be considered to reach a Nash equilibrium [15]. Namely, the value got by (3) each agent from any state is equal to its best response. Coordination equilibrium is one form of Nash equilibrium, in which all agents work towards a common goal to maximize the team’s payoff. Apparently, a team Markov game is a coordination equilibrium as all the players involved strive for a common task. So, for every state s and a1 ∈ A1 , a2 ∈ A2 , ... , their reward r1 (s, a1 , ..., an ) = ... = rn (s, a1 , ..., an ). Consequently, a single reward function R(s, a1 , ..., an ) can be used as a common payoff for the whole team for their same goal. R(s, a1 , ..., an ) = r1 (s, a1 , ..., an ) = ... = rn (s, a1 , ..., an ). Then, the optimal policy π ∗ can be defined by coordination equilibrium for all states s in team Markov games in Eq. 7. π ∗ (s, a1 , ..., an ) =

Qi,j (s) ← (1 − α) ∗ Qi,j (s) + α(ri,j (s) + γ ∗ max Qi,j (s )) (5)

 a1 ,...,an

= max Q(s, a1 , ..., an )

π1 (s, a1 ) ∗ · · · ∗ πn (s, an )∗ Q(s, a1 , ..., an )

(7)

a1 ,...,an

In addition, to take the future benefits into consideration, we need to compute and record the accumulative rewards, just like Q(s, a) in Q-learning. So, a cumulative reward matrix Qi1 i2 ...in (s) come into being which can be computed by Eq. 5 and represented as Qi1 ,i1 (s) or Qi1 ,j (s) similar to the reward matrix.

According to these derivations and formulas, the problem of seeking the optimal policy in Team Markov games can be turned into the old question of maximizing the Qvalue, which has already been solved in the Q-learning algorithm. Therefore, the key problem converted to how

450

we can ensure that the team Markov games converges to a unique coordination equilibrium deterministically. Nash’s Existence Theorem [16] has already guaranteed the existence of Nash equilibrium (of course, including coordination equilibrium) in our scenario. So, how to determine a unique coordination equilibrium? Here we utilize the indirect coordination method to solve the problem of equilibria selection, which bias action selection toward actions that are likely to result in good rewards. The likelihood of good rewards is evaluated by models of the other agents estimated by the learner, or statistics of the values observed in the past, and so on. With the help of such methods, the team participants can coordinate to converge to a unique coordination equilibrium. Fictitious play is an easy and well-understand indirect coordination method for equilibrium selection in game theory [13]. By estimating the empirical distribution of the other agents’ past actions, the agents can figure out a best response until now. Suppose that Caj j is the frequency of agent j invoking action aj in the past time, where a is the set of agents, each j ∈ a and aj ∈ Aj (Aj represents the set of actions available to the jth agent). Then, agent i assumes agent j to play the action aj with the probability as follows: Caj Priaj =  j j Cbj

Initialization: ∀s ∈ S, s = s1 × s2 × ... × sn , where si (1 ≤ i ≤ n) is the individual state of agent i, and n is the number of agents. the reward matrix is initialized as:ri1 i2 ...in (s) m1 ×m2 ×...×mn =0 the cumulative reward matrix is initialized as:Qi1 i2 ...in (s) m1 ×m2 ×...×mn =0 the number of times visiting state s is initialized as:n(s) (to avoid the scene “ln 0”, initialize to 1) Set the length of queue for actions storage as m and initialize the queue Repeat for an episode  do 1.Learning of coordination policy if n(s) ≤ m then select an action randomly else select an action x at state s with an exploitation probability as follows: Pr(x|s, t, Q) =

ln nt (s) W EQ(s,x) e Ct (s)

 y∈A

ln nt (s) W EQ(s,y) e Ct (s)

where W EQ(s, Ai ) =

 Aj ∈Ψ(s,A−i ) 1≤j≤n,j=i

;

m Kt (Aj ) Qi,j (s) k

2. Off-policy learning of game structure Q Observe the others’ reduced joint action y, which is the joint action only excluding the chosen action x. Observe state transition s → s and payoff ri,j (s) under the joint action A =< x, y > if Ai = x, Ai = y then update the following Qi,j (s) ← (1 − α)Qi,j (s) + α(ri,j (s) + γ max Qi,j (s )), where α is the learning rate. We define Const ∗ β, 0 < β < 1. Const is is a positive α = Const+n(s) constant. The subscript j represents the reduced joint action of agent i’s teammates. Update the storage queue; Add each si (1 ≤ i ≤ n) in s into the passed set Sp . n(s) = n(s) + 1 3.Terminal condition check if s is a possible terminal state, s = s1 × s2 × ... × sn , Si ∈ τ (1 ≤ i ≤ n) then Create a set named Temp, Temp= {s } Create a set named Prev, Prev contains all the previous states of any element in Temp while Sp ∩ T emp = Φands0 ∈ / Pr ev do T emp ← Sp ∩ T emp P rev ← all the previous states of any element in Temp if Prev contains S0 then This episode is ended Until the cumulative reward matrix converges

(8)

bj ∈Aj

The set of strategies based  on the fictitious play process forms a reduced profile −i , for which the ith agent can adopt a best response action. After current round of playing, agent i will update its Caj j according to the actions taken by the others in the last round. In a sense, the Caj j reflect the beliefs an agent has given the historical choices of the others. Dov Monderer gives the definition of Fictitious Play Property and also proves the following theorem in his work [17]. Definition 6 (Fictitious Play Property). A game has the fictitious play property (FPP) if every fictitious play process converges in beliefs to equilibrium. Theorem 1: Every game with identical payoff functions has the fictitious play property. 2

Algorithm 1: Multi-agent Q-learning based on WSCTMG

Due to the limitation of space, we skip his proof. In view of theorem 1, we can deduce a corollary that the team game where the agents have common interests has the fictitious play property. Hence, fictitious play process can be applied in the Team Markov Games and help to converge to a unique equilibrium surely despite the existence of multiple Nash equilibriums.

To improve the efficiency of fictitious play process, [18] proposed a modified method and proved its validity. By taking the advantage of Young’s approach, we propose a new

451

will be updated after each execution of a joint action using Eq. 5. Note that learning rate α, which controls the speed of learning, need to make a balance as a very fast learning speed may trap into local optima. So the positive constant Const leverages the effect of increasing nt (s) (the number of time visit the state s at time t), satisfying the convergence condition of learning rate for Q-learning [13]2 . Finally, we need to check whether the possible terminal state reached is a true terminal state in this episode. By recording all the state nodes passed by agents in the team, we can traverse from the possible terminal state to the initial state to ensure an undoubted termination of one episode.

formula represented in Eq. 9 called the weighted expectation of Q-value (abbr.WEQ, the weight is the probability of its opponents’ possible joint action at joint state s ) for estimating cumulative reward of joint action in the W SC − T M G model, which combine the Q-value and fictitious play process together.  Ktm (Aj ) W EQ(s, Ai ) = (9) Qi,j (s) k Aj ∈Ψ(s,A−i ) 1≤j≤n,j=i

K m (A )

Where, t k j is a probability model for agent i at the joint state s based on the fictitious play process. It indicates the frequency that agent i s opponents take the reduced joint action Aj from the K samples. t is the number of times in state s. m is the length of the queue which stores the reduced joint action of agent i s opponents in chronological order. Ψ(s, A−i ) is the best response for agent i s opponents’ joint action at joint state s. Qi,j (s) is the cumulative reward of the team’s joint action at state s. To this place, we should incorporate WEQ function into the learning policy for Team Markov Game so as to select appropriate behavior in each step of the learning process. In the field of RL, the learning policy should not be selected optionally as it need to meet certain constraints. [19] pointed out that if the learning policy for a certain algorithm satisfy the requirements of GLIE, which stands for “greedy in the limit with infinite exploration”, it can ensure the property that each action is executed infinitely often in every state which is visited infinitely often, and choose the best response for now according to the greedy strategy in a limit. Boltzmann learning policy is a commonly used GLIE learning policy, which can better characterize our coordination mechanism and equilibrium selection technique. To fit the W SC − T M G model, the modified Boltzmann exploration with fictitious play process involved can be seen t (s) in Eq. 10, where βt (s) = lnCnt (s) (reason will be given later) is the exploration coefficient at time t for state s, which controls the rate of exploration in the learning policy and need to be infinite in the limit, but not approach infinity too fast. W EQ(s, a) is the cumulative reward of joint action a in state s. eβt (s)W EQ(s,a) Pr(a|s, t, Q) =  β (s)W EQ(s,b) e t

IV. E XPERIMENTAL A NALYSIS Simulation study is conducted to validate the feasibility and effectiveness of the W SC − T M G model. The PC configuration is as follows: Intel(R) Core(TM) i3-2120 3.30GHz with 4GB RAM. The input W SC − M DP transition graph is produced randomly, which may be a complex network structure. The number of state nodes in a W SC − M DP graph ranges from 100 to 400, services that can be invoked by each state varies from 1,000 to 4,000, and agents grow from 4 to 16. We mainly take four QoS attributes into consideration which are ResponseT ime, T hroughput, Availability, Reliability from QWS Dataset3 . Then we expand the dataset to satisfy randomly assignment for each service in W SC − M DP transition graph one QoS data record. Experiment parameters are set as follows without special announcement, the discount factor γ of the Q-learner was 0.9, the value  of Greedy policy is 0.6. We first verify the selection of Const related with learning rate in multi-agent Q-learning, then prove the effectiveness and adaptability of the Algorithm 1, finally demonstrate the scalability by comparing different number of agent, state and service. A. Effectiveness and Adaptability Before validating the correctness and effectiveness of the algorithm, we should determine the fixed constant Const in the first place. Here, we fixed the number of simulated W SC − T M G state to 100, services for each state is 1,000 Const and agents to 4. Our learning rate is α = Const+n(s) β, which may influence convergence to the optimality. According to our previous work [10], a higher learning rate can accelerate the learning process, whereas a smaller learning rate is helpful to avoid local optimality. Consequently, 0.6 is a compromise choice for learning rate. Then, we set β=0.6, Const the initial value of α = Const+n(s) β ≈ 0.6 (n(s) ≥ 0). When we vary the value of Const from 1,000 to 5,000,

(10)

b∈A

D. Algorithm In Algorithm 1, we should first determine Candidate Initial Service s, then initialize the reward matrix, the cumulative reward matrix and some related value. An episode is a learning process which move from initial state to the terminal state. How to choose the actions is up to learning policy. One episode ends only if reaching the terminal state of W SC − T M G model. During one episode, the Q-value

2 Due to limited space, the convergence of the algorithm will be proved in our corresponding journal paper 3 You can get to know the meaning of each attribute from this address: http://www.uoguelph.ca/ qmahmoud/qws/

452

70

60

60

50 40 30 20 10

discounted cumulative reward

70

60 discounted cumulative reward

discounted cumulative reward

70

50 40 30 20 10

Const-1000 Const-3000 Const-5000 2000

3000

4000

30 20 10 General Q-Greedy Multi agents Q-Boltzmann

0 1000

40

Q-learning with greedy 4 agent Q-learning with Boltzmann

0 0

50

5000

0

1000

2000

Episodes

3000

4000

5000

0 1000

2000

Episodes

(a)

(b)

Figure 3.

3000

(a) Const selection

4000 5000 Episodes

6000

7000

8000

(c)

(b) Comparative Effectiveness

(c) adaptive testing

30 Q-Greedy Multi agent Q-Boltzmann

25

discounted cumulative reward

60 50 40

70 20 60 discounted cumulative reward

Cumulative time(min)

70

15

30 10

20 4 agent Q-learning with Boltzmann 8 agent Q-learning with Boltzmann 12 agent Q-learning with Boltzmann 16 agent Q-learning with Boltzmann

10

5 1000

2000

3000

4000

5000

100

200

Episodes

(a) Figure 4.

40 30 20 1000 services 2000 services 3000 services 4000 services

10

0 0

50

300

400

2000

3000

State number

(b) (a) Different number of agent

0 1000

4000

5000

6000

7000

8000

Episodes

(c)

(b) Different number of state

the result in Fig. 3 (a) shows Const = 3000 can be a tradeoff between learning speed and effects, because the bigger of Const result in a comparative faster learning rate with higher risk to trap into local optima, the smaller learns slower but achieve a higher cumulative rewards for the full exploration of multi agents.

(c) Different number of service

with General-Q, which converges sooner and retains a higher cumulative rewards from its power of teamwork. B. Scalability Fig. 4 highlights the scalability of the algorithm. Above all, when we vary the number of agents from 4 to 16 in Fig. 4 (a), convergence speed and effects changes correspondingly. Multi agents collaborate with each other and towards the common goal achieving a significantly improved efficiency. While not the more agents the better, 12 agents can achieve a comparative best result. Although the more agents the adequate space exploration, communication consumption between agents also gives an impact on result, that is to say, the more agents the more elapsed time. In order to stress the time consumption for algorithm 1, we supposed that the execution cost of each service is 1ms, response time is 0.1ms. We add up the total time until the algorithm convergence as the result. From Fig. 4 (b) we can see that the more state nodes the longer time consumption of service composition entirely, moreover, Multi-Q performs much better than General-Q with relatively small rise and less time consumption, which can be a corroborator for adaptability of Multi-Q. Fig. 4 (c) demonstrates that service volume increase will affect the convergence time and cumulative reward. Generally, increasing the volume of services will incurs the rise of time consumption but not equal to the improvement

In Fig. 3 (b) demonstrates the correctness and effectiveness of multi-agent Q-learning (abbr.Multi-Q) algorithm. Specifically, we fixed the learning rate of general Q-learning (abbr. General-Q) to 0.6, Const to 3,000 for Multi-Q and 0.6 for β. Before episode 2500, the General-Q performs better, while Multi-Q converges sooner and achieves higher cumulative rewards. Not surprisingly, the General-Q learner implemented the -greedy policy continuously, while multiagent Q-learner executed boltzmann exploration similar to random selection at the beginning. After further learning, the cooperation of 4 agents and the advantages of Boltzmann policy overwhelmed the General-Q’s merits. We also executed adaptability testing to see how changes in the environment affect the implementation of the algorithm. The basic settings are the same as before. We randomly change the QoS of %5 services at a fixed period. We can see form the Fig. 3 (c), changes do not stop the final convergence just postpone the convergence. However, it apparently displays the different between two methods. Multi-Q adapts faster to the dynamic environment compared

453

[5] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Cambridge Univ Press, 1998, vol. 1, no. 1.

of service quality. In a word, large service volume is bound to delay the convergence.

[6] W. Xu, J. Cao, H. Zhao, and L. Wang, “A multi-agent learning model for service composition,” in Services Computing Conference (APSCC), 2012 IEEE Asia-Pacific. IEEE, 2012, pp. 70–75.

V. C ONCLUSION AND F UTURE W ORK In this paper, we propose a novel framework for the service composition with MARL. Agents involved in the composition team turn towards a same goal sharing the common payoff. To ensure the convergence of the multiagent algorithm tailored to W SC − T M G model, we introduce the fictitious play process which assures the unique equilibrium for equilibrium selection and incorporate it with the Boltzmann learning policy. Our experiments prove its feasibility and effectiveness. However, many drawbacks still exist in this framework. Firstly, the agents in our framework may be assigned in fixed region to avoid wandering in the whole transition graph, which can reduce redundant efforts. Secondly, we have a prerequisite that each candidate service in our framework has met its fundamental QoS requirement without considering the global QoS constraints. Thirdly, fictitious play process presumes that each agent could access to global knowledge which may be a salient limitation. Fourthly, our adaptability only considers the variation of services’ non-functional properties, while candidate services may deteriorate to exit the combination, or new functional services join in. In the future, we need to pay more efforts to optimize this framework, even to overthrow it and towards a new solution. Of course, we can learn from the task decomposition ideas of hierarchical reinforcement learning to make the agents more independent, even have their own decision-making. What’s more, how to further safeguard the global QoS constraints is full of challenges. In addition, if we can come up with a novel method to guarantee the unique coordination equilibrium, which can be implemented with less restrictions? Even, we could bring in prediction and repair mechanisms to timely replace services which do not meet the requirements. Road is long in the future, we will continue fighting.

[7] I. J. Jureta, S. Faulkner, Y. Achbany, and M. Saerens, “Dynamic web service composition within a service-oriented architecture,” in IEEE International Conference on Web Services, 2007. ICWS 2007. IEEE, 2007, pp. 304–311. [8] H. Tong, J. Cao, S. Zhang, and M. Li, “A distributed algorithm for web service composition based on service agent model,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 12, pp. 2008–2021, 2011. [9] H. Wang and X. Guo, “Preference-aware web service composition using hierarchical reinforcement learning,” in IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, 2009. WIIAT’09., vol. 3. IET, 2009, pp. 315–318. [10] H. Wang, X. Zhou, X. Zhou, W. Liu, W. Li, and A. Bouguettaya, “Adaptive service composition based on reinforcement learning,” in Service-Oriented Computing. Springer, 2010, pp. 92–107. [11] M. L. Littman, “Markov games as a framework for multiagent reinforcement learning.” in ICML, vol. 94, 1994, pp. 157–163. [12] J. Hu and M. P. Wellman, “Multiagent reinforcement learning: Theoretical framework and an algorithm,” in Proceedings of the Fifteenth International Conference on Machine Learning, ser. ICML ’98. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, pp. 242–250. [13] C. Claus and C. Boutilier, “The dynamics of reinforcement learning in cooperative multiagent systems,” in Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, ser. AAAI ’98/IAAI ’98. Menlo Park, CA, USA: American Association for Artificial Intelligence, 1998, pp. 746–752. [14] X. Wang and T. Sandholm, “Reinforcement learning to play an optimal nash equilibrium in team markov games,” in Advances in Neural Information Processing Systems. MIT Press, 2002, pp. 1571–1578.

R EFERENCES [1] D. Ardagna and B. Pernici, “Adaptive service composition in flexible processes,” Software Engineering, IEEE Transactions on, vol. 33, no. 6, pp. 369–384, 2007.

[15] M. L. Littman, “Value-function reinforcement learning in markov games,” Cognitive Systems Research, vol. 2, no. 1, pp. 55–66, 2001.

[2] Y. Yan, P. Poizat, and L. Zhao, “Self-adaptive service composition through graphplan repair,” in Web Services (ICWS), 2010 IEEE International Conference on, 2010, pp. 624–627.

[16] J. Nash, “Non-cooperative games,” Annals of Mathematics, vol. 54, no. 2, pp. pp. 286–295, 1951. [17] D. Monderer and L. S. Shapley, “Fictitious play property for games with identical interests,” Journal of Economic Theory, vol. 68, no. 1, pp. 258 – 265, 1996.

[3] H. Wang and P. Tang, “Preference-aware web service composition by reinforcement learning,” in 20th IEEE International Conference on Tools with Artificial Intelligence, 2008. ICTAI’08., vol. 2. IEEE, 2008, pp. 379–386.

[18] H. P. Young, “The evolution of conventions,” Econometrica, vol. 61, no. 1, pp. pp. 57–84, 1993.

[4] H. Wang and X. Wang, “A novel approach to large-scale services composition,” in Web Technologies and Applications. Springer, 2013, pp. 220–227.

[19] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” arXiv preprint cs/9605103, 1996.

454