A New QoS Provisioning Method for Adaptive. Multimedia in Cellular Wireless Networks. Fei Yu, Vincent W.S. Wong and Victor C.M. Leung. Department of ...
A New QoS Provisioning Method for Adaptive Multimedia in Cellular Wireless Networks Fei Yu, Vincent W.S. Wong and Victor C.M. Leung Department of Electrical and Computer Engineering The University of British Columbia 2356 Main Mall, Vancouver, BC, Canada V6T 1Z4 E-Mail: {feiy, vincentw, vleung}@ece.ubc.ca Abstract – Third generation cellular wireless networks are designed to support adaptive multimedia by controlling individual ongoing flows to increase or decrease their bandwidth in response to changes in traffic load. There is growing interest in quality of service (QoS) provisioning under this adaptive multimedia framework, in which a bandwidth adaptation algorithm needs to be used in conjunction with the call admission control algorithm. This paper presents a novel method for QoS provisioning via the use of the average reward reinforcement learning, which can maximize the network revenue subject to several predetermined QoS constraints. By considering handoff dropping probability, average allocated bandwidth and intra-class fairness simultaneously, our algorithm formulation guarantees that these QoS parameters are kept within predetermined constraints. Unlike other modelbased algorithms, our scheme does not require explicit state transition probabilities and therefore the assumptions behind the underlying system model are more realistic than those in previous schemes. Moreover, by considering the status of neighboring cells, the proposed scheme can dynamically adapt to changes in traffic condition. Simulation results demonstrate the effectiveness of the proposed approach in adaptive multimedia cellular networks. Keywords – QoS; adaptive multimedia; cellular wireless networks; mathematical programming/optimization
I. INTRODUCTION With the growing demand for bandwidth-intensive multimedia applications (e.g., video) in cellular wireless networks, quality of service (QoS) provisioning is becoming more and more important. An efficient call admission control (CAC) scheme is crucial to guarantee the QoS and to maximize the network revenue simultaneously. Most of the CAC strategies proposed in the literature only consider nonadaptive traffic and non-adaptive networks [1], [2]. However, in recent years, the scarcity and large fluctuations of link bandwidth in wireless networks have motivated the development of adaptive multimedia applications where the bandwidth of a connection can be dynamically adjusted to adapt to the highly variable communication environments. Examples of adaptive multimedia traffic include Motion Picture Experts Group (MPEG) - 4 [3] and H.263+ [4] coding for audiovisual contents, which are expected to be used extensively in future cellular wireless networks. Accordingly, advanced cellular networks are designed to provide flexible radio resource allocation capabilities that can efficiently
0-7803-8356-7/04/$20.00 (C) 2004 IEEE
support adaptive multimedia traffic. For example, the third generation (3G) universal mobile telecommunications services (UMTS) can reconfigure the bandwidth of ongoing calls [5]. Under this adaptive multimedia framework, a bandwidth adaptation (BA) algorithm needs to be used in conjunction with the CAC algorithm for QoS provisioning. CAC decides the admission or rejection of new and handoff calls, whereas BA reallocates the bandwidth of ongoing calls. Recently, QoS provisioning for adaptive multimedia services in cellular wireless networks has been a very active area of research [6]-[12], [14], [15]. Channel sub-rating scheme for telephony services is proposed in [6]. In [7], an analytical model is derived for one class of adaptive service. The extension of these schemes designed for one traffic class to the case of multiple traffic classes in real cellular wireless networks may not be an easy task. Talukdar et al. [8] study the trade-offs between network overload and fairness in bandwidth adaptation for multiple classes of adaptive multimedia. A near optimal scheme is proposed in [9]. Markov decision process formulation and linear programming are used in [10]. Degradation ratio and degradation ratio degree are considered in [11]. Authors in [12] use simulated annealing algorithm to find the optimal call-mix selection. The shortcomings of [6]-[12] are that only the status of the local cell is considered in QoS provisioning. However, due to increasing handoffs between cells that are shrinking in size, the status of neighboring cells has an increased influence on the QoS of the local cell in future multimedia cellular wireless networks [13], and therefore, information on neighboring cell traffic is very important for the effectiveness of QoS provisioning methods that can adapt to changes in the traffic pattern [2]. Authors in [14], [15] make fine attempts to consider the status information of neighboring cells. However, only one class of traffic is studied and they do not consider maximizing network revenue. This paper introduces a novel average reward reinforcement learning (RL) approach to solve the QoS provisioning problem for adaptive multimedia in cellular wireless networks, which aims to maximize the network revenue while satisfying several predetermined QoS constraints. The novelties of the proposed scheme are as follows: 1) The proposed scheme takes into account the effects of the status of neighboring cells with multiple classes of traffic,
IEEE INFOCOM 2004
enabling it to dynamically adapt to changes in the traffic condition. 2) The underlying assumptions of the proposed scheme are more realistic than those in previous schemes. Particularly, the scheme does not need prior knowledge of system state transition probabilities, which are very difficult to estimate in practice due to irregular network topology, different propagation environment and random user mobility. 3) The algorithm can control the adaptation frequency effectively by accounting for the cost of bandwidth adaptation in the model. It is observed in [7], [8] that frequent bandwidth switching among different levels may consume a lot of resources and may be even worse than a large degradation ratio. The proposed scheme can control the adaptation frequency more effectively than previous schemes. 4) Handoff dropping probability, average allocated bandwidth and intra-class fairness are considered simultaneously as QoS constraints in our scheme and can be guaranteed. 5) Trading off action space with state space is proposed in our scheme. As mentioned in [10], the large action space problem may hinder the deployment of this scheme in real systems. With the approach of trading off action space with state space, the large action space problem in QoS provisioning can be solved. Recently, RL has been used to solve CAC and routing problems in wireline networks [16], [17] and channel allocation problem in wireless networks [18], [19]. This paper focuses on applications of RL to solve the QoS provisioning problem in adaptive cellular wireless networks. We compare our scheme with two existing non-adaptive and adaptive QoS provisioning schemes for adaptive multimedia in cellular wireless networks. Extensive simulation results show that the proposed scheme outperforms the others by maximizing the network revenue while satisfying the QoS constraints. The rest of this paper is organized as follows. Section II describes the QoS provisioning problems in the adaptive framework. Section III describes the average reward RL algorithm. Our new approach to solve the QoS provisioning problem is presented in Section IV. Section V discusses some implementation issues. Section VI presents and discusses the simulation results. Finally, we conclude this study in Section VII. II. QOS PROVISIONING IN ADAPTIVE FRAMEWORK A. Adaptive Multimedia Applications In adaptive multimedia applications, a multimedia connection or stream can dynamically change its bandwidth requirement throughout its lifetime. For example, using the layered coding technique, a raw video sequence can be compressed into three layers [20]: a base layer and two
0-7803-8356-7/04/$20.00 (C) 2004 IEEE
enhancement layers. The base layer can be independently decoded to provide basic video quality, whereas the enhancement layers can only be decoded together with the base layer to further refine the quality of the base layer. Therefore, a video stream compressed into three layers can adapt to three levels of bandwidth usage. B. Adaptive Cellular Wireless Networks Due to the severe fluctuation of resources in wireless links, the ability of adapting to the communication environment is very important in future cellular wireless networks. For example, in UMTS systems, a radio bearer established for a call can be dynamically reconfigured during the call session [5]. Fig. 1 shows the signalling procedure between a user terminal (UE) and the serving universal terrestrial radio access network (UTRAN) in radio bearer reconfiguration. Radio bearer in UMTS includes most of the layer 2 and layer 1 protocol information for that call. By reconfiguring the radio bearer, the bandwidth of a call can be changed dynamically. C. QoS Provisioning Functions and Constraints We consider two important functions for QoS provisioning, CAC and BA, in this paper. The problem of QoS provisioning in an adaptive multimedia framework is to determine CAC and BA policies to maximize the long-term network revenue and guarantee QoS constraints. To reduce network signalling overhead, we assume that BA is invoked only when a call arrival or departure occurs. That is, BA will not be used when congestion occurs briefly due to channel fading. Low-level mechanisms such as error correction coding and efficient packet scheduling are usually used to handle brief throughput variations of wireless links. Smaller cells (micro/pico-cells) will be employed in future cellular wireless networks to increase capacity. Therefore, the number of handoffs during a call’s lifetime is likely to be increased and the status of neighbouring cells has an increased influence on the QoS of the local cell. In order to adapt to changes in traffic pattern, the status information of neighbouring cells should be considered in QoS provisioning. We consider three QoS constraints in this paper. Since forced call terminations due to handoff dropping are generally more objectionable than new call blocking, an important calllevel QoS constraint in cellular wireless networks is Phd, the probability of handoff dropping. As it is impractical to eliminate handoff call dropping completely, the best one can do is to keep Phd below a target level. In addition, although adaptive applications can tolerate decreased bandwidth, it is desirable for some applications to have a bound on the UE
UTRAN RADIO BEARER RECONFIGURATION RADIO BEARER RECONFIGURATION COMPLETE
Fig. 1. Radio bearer reconfiguration in UMTS [5]
IEEE INFOCOM 2004
average allocated bandwidth. Therefore, we need another QoS parameter to quantify the average bandwidth received by a call. The normalized average allocated bandwidth of class i calls, denoted as ABi, is the ratio of the average bandwidth received by class i calls to the bandwidth with un-degraded service. In order to guarantee the QoS of adaptive multimedia, ABi should be kept above a target value. Finally, due to bandwidth adaptation, some calls may operate at very high bandwidth levels, whereas some calls within the same class may operate at very low bandwidth levels. This is undesirable from users’ perspective. Therefore, the QoS provisioning scheme should be fair to all calls within one class, and intraclass fairness is defined as another QoS constraint in this paper. These constraints will be formulated in Section IV. We formulate the QoS provisioning problem as a semiMarkov decision process (SMDP) [21]. There are several well-known algorithms, such as policy iteration, value iteration and linear programming [21] that find the optimal solution of an SMDP. However, these traditional model-based solutions to SMDP require prior knowledge of state transition probabilities and hence suffer from two “curses”: the curse of dimensionality and the curse of modeling. The curse of dimensionality is that the complexity in these algorithms increases exponentially as the number of states increases. QoS provisioning involves very large state space that makes model-based solutions infeasible. The curse of modeling is that in order to apply model-based methods, it is first necessary to express state transition probabilities explicitly. This is in practice a very difficult proposition for cellular wireless networks due to the irregular network topology, different propagation environment and random user mobility. D. Average Reward Reinforcement Learning In recent years, RL has become a topic of intensive research as an alternative approach to solve SMDPs. This method has two distinct advantages over model-based methods. The first is that it can handle problems with complex transitions. Secondly, RL can integrate within it various function approximation methods (e.g., neural networks), which can be used to approximate the value function over a large state space. Most of the published research in RL is focused on the discounted sum of rewards as the optimality metric. Qlearning [22] is one of the most popular discounted reward RL algorithms. These techniques, however, cannot extend automatically to the average reward criterion. In QoS provisioning problems, performance measures may not suitably be described in economic terms, and hence it may be preferable to compare policies based on time averaged expected reward rather than expected total discounted reward. Discounted RL methods can lead to sub-optimal behavior and may converge much more slowly than average reward RL methods [23]. An algorithm for average reward RL called SMART (Semi-Markov Average Reward Technique) [24][26] has emerged recently. The convergence analysis of this algorithm is given in [25] and it has been successfully applied to production inventory [24] and airline seat allocation [26]
0-7803-8356-7/04/$20.00 (C) 2004 IEEE
problems. We use this average reward RL method to solve the QoS provisioning problem for adaptive wireless multimedia in this paper. III. SOLVING AVERAGE REWARD SMDP BY RL In this section, we describe the average reward SMDP. The optimality equation is introduced. We then describe the reinforcement learning approach. A. Average Reward Semi-Markov Decision Process For an SMDP, let S be a finite set of states and A be a set of possible actions. In state s ∈ S , when an action a ∈ A is chosen, a lump sum reward of k ( s, a) is received. Further accrual of reward occurs at a rate c( s ′, s, a) , s′ ∈ S , for the time the process remains in state s ′ between the decision epochs. The expected reward between two decision epochs, given that the system is in state s, and a is chosen at the first decision epoch, may be expressed as:
τ r ( s, a) = k ( s, a) + E ∫ c(Wt , s, a)dt , 0
(1)
where τ is the transition time to the second decision epoch, Wt denotes the state of the natural process and E denotes the expectation. Starting from state s at time 0 and using a policy π , the average reward g π (s ) can be given as: σ N E sπ k ( sn , an ) + c(Wt , sn , an )dt n=0 σ g π ( s ) = lim , N N →∞ π Es τ n n =0
∑
n +1
∫
n
∑
(2)
where σ n represents the time of the (n+1)th decision epoch,
τ n = σ n +1 − σ n , and E sπ denotes the expectation with respect to policy π and initial state s. The Bellman optimality equation for SMDP [21] can be stated as follows. THEOREM 1. For any finite unichain SMDP, there exists a scalar g * and a value function R * satisfying the system of equations
R * ( s ) = max r (s, a ) − g * q ( s, a) + a∈ A
∑P
ss′
s′∈S
(a) R * ( s ′) ,
(3)
where q ( s, a) is the average sojourn time in state s when action a is taken in it, and Pss′ (a ) is the probability of transition from state s to state s ′ under action a. For a proof of Theorem 1, see chapter 11 of [21].
IEEE INFOCOM 2004
B. Solution Using Reinforcement Learning In the RL model depicted in Fig. 2, a learning agent selects an action for the system that leads the system along a unique path till another decision-making state is encountered. At this time, the system needs to consult with the learning agent for the next state. During a state transition, the agent gathers information about the new state, immediate reward and the time spent during the state-transition, based on which the agent updates its knowledge base using an algorithm and selects the next action. The process is repeated and the learning agent continues to improve its performance. Average reward RL uses the action value representation that is similar to its counterpart, Q-learning. The action value R π ( s, a) represents the average adjusted value of choosing an action a in state s once, and then following policy π subsequently [23]. Let R * ( s, a) be the average adjusted value by choosing actions optimally. The Bellman equation for average reward SMDPs (3) can be rewritten as:
R* (s, a) = r (s, a ) − g * q( s, a) +
∑P
ss′
s ′∈S
(a) max R* (s′, a′) . (4) a′∈A
The optimal policy is π * ( s) = arg max R * ( s, a) . The
System Environment
State s
Reward r Action a
Agent (Decision maker)
Fig. 2. A reinforcement learning model
IV. FORMULATION OF QOS PROVISIONING IN ADAPTIVE FRAMEWORK In adaptive multimedia cellular wireless networks, we assume that call arrivals at a given cell, including new and handoff calls, follow a Poisson distribution. We further assume that each call needs a service time that is exponentially distributed and independent of the inter-arrival time distribution. The QoS provisioning problem for adaptive multimedia can be formulated as an SMDP. In order to utilize the average reward RL algorithm, it is necessary to identify the system state, actions, rewards, and constraints. The exploration scheme and the method to trade off action space with state space are also described in this section.
a∈A
average reward RL algorithm estimates action values on-line using a temporal difference method, and then uses them to define a policy.
A. System States Assume that there are K classes of services in the network. A class i call uses bandwidth among {bi1, bi2, …, bij, …, biN }, where bij < bi(j+1) , where i = 1, 2, …, K, j = 1, 2, …, Ni, and Ni is the maximum bandwidth level which can be used by class i call. At random times an event e can occur in a cell c (we assume that only one event can occur at any time instant), where e is either a new call arrival, a handoff call arrival, a call termination, or a call handoff to a neighboring cell. At this time, cell c is in a particular configuration x defined by the number of each type of ongoing calls in cell c. Let x = (x11, x12, …, xij ,…, x KN ), where xij denotes the number of i
The action value of state-action pair (s, a) visited at the nth decision making epoch is updated as follows. Assume that action a in state s results in a system transition to s′ at the subsequent decision epoch, then,
Rnew (s, a) = (1 − α n )Rold (s, a) +
α n [ract ( s′, s, a) − ρ nτ n + max Rold ( s′, a′)] , (5) a′∈A where αn is the learning rate parameter for updating of the action value of a state-action pair of the nth decision epoch and ract ( s′, s, a ) is the actual cumulative reward earned between two successive decision epochs starting in state s (with action a) and ending in state s′ . The reward rate, ρn, is calculated as: T (n − 1) ρ n −1 + ract ( s ′, s, a ) , ρ n = (1 − β n −1 )ρ n −1 + β n −1 T (n)
1 ≤ i ≤ K and 1 ≤ j ≤ N i . Since the status of neighboring cells is important for QoS provisioning, we also consider it in the state description. The status of neighboring cells y can be defined as the number of each type of ongoing calls in all neighboring cells of cell c. Let y = (y11, y12, …, yij ,…, y KN ), where y ij denotes the number of ongoing calls of class i using K
(6)
where T(n) denotes the sum of the time spent in all states visited till the nth epoch, and βn–1 is the learning rate parameter. If each action is executed in each state an infinite number of times on an infinite run and αn and βn are decayed appropriately, the above learning algorithm will converge to the optimality [25].
0-7803-8356-7/04/$20.00 (C) 2004 IEEE
K
ongoing calls of class i using bandwidth bij in cell c for
bandwidth bij in all neighboring cells of cell c. We assume that the status of neighboring cells is available in cell c by the exchange of status information between cells. Note that this assumption is common among dynamic QoS provisioning schemes [2]. The configurations and the event together determine the state, s = (x, y, e). We assume that each cell has a fixed channel capacity C and cell c has M neighboring cells. The state space is defined as:
IEEE INFOCOM 2004
S = s = ( x , y, e) :
Ni
K
∑∑ x b ij
i =1
≤ C;
ij
j =1
Ni
K
∑∑ y b ij
i =1
ij
j =1
to the action of bandwidth adaptation. The definition of the cost function depends on specific traffic, user terminal, and network architecture in real networks. One intuitive definition is that the cost is proportional to the number of bandwidth adaptation operations, which is used in this paper.
≤ MC .
1 ≤ i ≤ K, 1 ≤ j ≤ Ni
B. Actions When an event occurs, the agent must choose an action according to the state. An action can be denoted as: a = (aa, ad , au), where aa stands for the admission decision, i.e., admit (aa = 1), reject (aa = 0), or no action due to call departures (aa = -1), ad stands for the action of bandwidth degradation when a call is accepted, and au stands for the action of bandwidth upgrade, when there is a departure (call termination or handoff to a neighboring cell ) from cell c. ad has the form
{( d
ad =
1 12
n ij
,..., d ,..., d
N k −1 KN k
), 1 ≤ i ≤ K , 1 < j ≤ N , 1 ≤ n < j}, i
where d ijn denotes the number of ongoing class i calls using bandwidth bij that are degraded to bandwidth bin . au has the form
{(
2 11
n ij
au= u ,...,u ,...,u
Nk KN k −1
), 1 ≤ i ≤ K , 1 ≤ j < N , j < n ≤ N } , i
i
where uijn denotes the number of ongoing class i calls using bandwidth bij that are upgraded to bandwidth bin . After the action of bandwidth degradation, configuration (x11, x12, …, xij ,…, x KN K ) becomes ( x11 +
N1
∑d
, x12 +
1 1m
2 1m
− d121 , ...,
m =3
m=2
xij +
N1
∑d
the
Ni
j −1
m = j +1
m =1
N K −1
∑ d imj − ∑ d ijm , …, x KN − ∑ d KNm ). K
K
m =1
Similarly, after the action of bandwidth upgrade, the configuration (x11, x12, …, xij ,…, x KN ) becomes K
( x11 −
N1
∑u
m 11
m=2
xij +
j −1
∑u m =1
j im
N1
, x12 + u112 − ∑ u12m ,..., m =3
−
Ni
∑u
m = j +1
m ij
N K −1
N , …, x KN + ∑ u Km ). K
K
m =1
C. Rewards Based on the action taken in a state, the network earns deterministic revenue due to the carried traffic in the cell. On the other hand, extra signaling overhead is required for bandwidth adaptation, which will consume radio and wireline bandwidth, as well as the battery power in the mobile. It is observed in [7], [8] that frequent bandwidth switching among different levels may consume a lot of resources and may be even worse than a large degradation ratio. Thus, there is a trade-off between the network resources utilized by the calls and the signaling and processing load incurred by bandwidth adaptation operation. We use a function to model the cost due
0-7803-8356-7/04/$20.00 (C) 2004 IEEE
Let rij be the reward rate of class i call using bandwidth bij , ca be the cost of one bandwidth adaptation operation, and N a (a ) be the total number of bandwidth adaptation operations in action a. The actual cumulative reward, ract ( s ′, s, a ) , between two successive decision epochs starting in state s (with action a) and ending in state s ′ can be calculated as: ract ( s ′, s, a) = τ act ( s ′, s, a )
K
Ni
i =1
j =1
∑∑ x′ r
ij ij
− N a ( a )c a ,
where τ act ( s ′, s, a) is the actual sojourn time between the decision epochs. By formulating the cost of the bandwidth adaptation operation in the model, we can control the adaptation operation frequency effectively. Note that all ongoing calls in the cell, including those that have been degraded or upgraded, contribute to the reward ract ( s ′, s, a ) . Therefore, we do not need an extra term to formulate the penalty related to the bandwidth degradation. D. Constraints For a general SMDP with L constraints, the optimal policy for at most L of the states is randomized [27]. Since L is much smaller than the total number of states in the QoS provisioning problem considered in this paper, the non-randomized stationary policy learned by RL is often a good approximation to the optimal policy [28]. To avoid the complications of randomization, we concentrate on non-randomized policies in this study.
As mentioned in Section II, the first QoS constraint is related to the handoff dropping probability. Let Phd ( s ) be the measured handoff dropping ratio and TPhd denote the target maximum allowed handoff dropping probability. The constraint associated with Phd can be formulated as: N lim P ( s )τ n N →∞ ∑ hd n=0
N
∑τ
n
≤ TPhd .
n =0
The Lagrange multiplier formulation relating the constrained optimization to an unconstrained optimization [29], [30] is used in this paper to deal with the handoff dropping constraint. To fit into this formulation, we need to include the history information in our state descriptor. The new state descriptor is s = ( N hr , N hd ,τ , s ) , where N hr and N hd are the total number of handoff call requests and handoff call drops, respectively, τ is the time interval between the last and the current decision epochs, and s is the original state
IEEE INFOCOM 2004
descriptor. In order to make the state space finite, quantified values of Phd = N hd / N hr and τ are used. A Lagrange multiplier ω is used for the parameterized reward ract ( s ′, s , a) = ract ( s ′, s , a) − ω z ( s ′, s , a) ,
where ract ( s ′, s , a) is the original reward function and z ( s ′, s , a) = Phd ( s )τ act ( s ′, s , a ) is the cost function associated with the constraint. A nice monotonicity property associated with ω shown in [29] facilitates the search for a suitable ω . The second QoS constraint is related to AB i , the normalized average allocated bandwidth of class i calls. Let B i denote the bandwidth allocated to class i calls, AB i can be defined as the mean of B i biN over all class i calls in the i
current cell. Recall that biN is the bandwidth of a class i call with un-degraded service. i
Ni
Ni
AB i = ∑ xij bij
biN
j =1
i
∑x
ij
i = 1, …, K.
,
j =1
AB i should be kept larger than the target value TAB i : AB i ≥ TAB i ,
i = 1, …, K.
The third QoS constraint is the intra-class fairness constraint, which can be defined in many ways. In this paper, we use the variance of B i biN over all class i calls in the current cell, VBi, to characterize the intra-class fairness: i
{
}
VB = var B biN i = i
Ni
Ni
Ni
∑ x ∑ x (b ) − ∑ x b ij
j =1
2
ij
ij
ij ij
j =1
j =1
2
i
biN i
Ni
∑ x ij
E. Exploration Each action should be executed in each state an infinite number of times to guarantee the convergence of a RL algorithm. This is called exploration [31]. Exploration plays an important role in ensuring that all the states of the underlying Markov chain are visited by the system and all the potentially beneficial actions in each state are tried out. Therefore, with a small probability pn upon the nth decisionmaking epoch, decisions other than that with the highest action value should be taken.
In this paper, we use the Darken-Chang-Moody searchthen-converge procedure [32] to decay the learning rates α n , β n and the exploration rate pn. In the following expression, Θ can be substituted by α , β and p for learning and exploration, respectively. We use the equation: Θ n = Θ 0 (1 + ζ n ) , where ζ n = n 2 (Θ r + n ) , and Θ 0 and Θ r are constants. F. Trading off Action Space Complexity with State Space Complexity We can see that the action space in our formulation is quite large. In this paper, we propose a method to trade off action space complexity with state space complexity in the QoS provisioning scheme using a scheme described in [31]. The advantages of doing this are that the action space will be reduced and the extra state space complexity may still be dealt with by using the function approximation described in Section V.
Suppose that a call arrival event occurs in a cell with state s, the action that can be chosen from is N k −1 ), a = (aa, d 121 ,..., d ijn ,..., d KN k
2
, i = 1,…, K.
j =1
VB i reflects the difference between the bandwidth of individual class i calls and the average bandwidth. For absolute fairness, VB i should be kept to zero all the time. However, this is very difficult to achieve in practice as bandwidth is adjusted in discrete steps. Therefore, it is better to keep VB i below a target value TVBi:
VB i ≤ TVB i , i = 1, …, K.
where there are at most V = 1 +
K
Ni
i =1
j =2
∑∑ ( j − 1)
components.
We can break down the action a into a sequence of V controls N k −1 aa, d121 ,..., d ijn ,..., d KN , and introduce some artificial k intermediate “states” ( s , aa), ( s , aa, d121 ), …, ( s , aa, N −1 d121 ,..., d ijn ,..., d KN ), and the corresponding transitions to model the effect of these actions. In this way, the action space is simplified at the expense of introducing V−1 additional layers of states and V−1 additional action values R( s , aa), R( s , aa, N −2 d121 ),…, R( s , aa, d121 ,..., d ijn ,..., d KN ) in addition to R( s , aa, k
k
k
k
AB and VB are intrinsic properties of a state. With the i
i
current state and action information ( s , a) , we can forecast AB i and VB i in the next state s ′ , AB i ( s ′) and VB i (s ′) . If AB i ( s ′) ≥ TAB i and VB i ( s ′) ≤ TVB i , i = 1, …, K, the action is feasible. Otherwise, this action should be eliminated from the feasible action set A(s ) .
0-7803-8356-7/04/$20.00 (C) 2004 IEEE
n ij
N k −1 KN k
d ,..., d ,..., d ). Actually, we view the problem as a deterministic dynamic programming problem with V stage. For v = 1, ..., V , we can have a v-solution (a partial solution involving just v components) for the vth stage of the problem. The terminal state corresponds to the V-solution (a complete solution with V components). Moreover, instead of selecting the controls in a fixed order, it is possible to leave order subject to choice. 1 12
IEEE INFOCOM 2004
In the reformulated problem, at any given state s = ( N hr , N hd ,τ x, y, e) where e is a call arrival of class i, the control choices are: 1) Reject the call, in which case the configuration x does not evolve. 2) Admit the call and no bandwidth adaptation is needed, in which case the configuration x evolves to (x11, x12, …, xij ,
Identify state s Find action set {a}
Action value representation
…, xiN + 1 , …, x KN K ). i
3) Admit the call and bandwidth adaptation is needed. In this case, the problem can be divided into V stages. On the vth stage ( v = 1,...,V ), one particular call type that has not been selected in previous stages, say the one using bandwidth bij with xij > 0 , can be selected and there are following options: a)
Degrade one call using bandwidth bij one level, in which case the configuration x evolves to (x11, x12, …, xij −1 + 1 , xij − 1 ,…, xiN + 1 ,…, x KN K ). i
b) Degrade two calls using bandwidth bij one level, in which case the configuration x evolves to (x11, x12, …, xij − 2 + 2 , xij − 2 ,…, xiN + 1 , x KN K ). i
c)
Increase the number of calls being degraded until the call arrival can be accommodated. The number of options depends on specific selected call type and the class of call arrival.
The similar trade-off can be applied when a call departure event occurs. V. ALGORITHM IMPLEMENTATION A. Approximate Representation of Action Values In practice, an important issue is how to store the action value R(s, a). Approximate representation should be used to break the curse of dimensionality in the face of very large state spaces. Neural network is an efficient method to represent the action values. A popular neural network architecture is the multi-layer perceptron (MLP) with a single hidden layer [31]. Under this architecture, the state-action pair (s, a) is encoded as a vector and transformed linearly through the input layer involving coefficients in this layer to give several scalars. Then, each of these scalars becomes the input to the sigmoidal function in the hidden layer. Finally, the outputs of the sigmoidal functions are linearly combined using coefficients, known as weights of the network, to produce the final output.
The network is trained in a supervised fashion using the back-propagation algorithm. This means that during training both network inputs and target outputs are used. An input pattern is applied to the network to generate an output, which is compared to the corresponding target output to produce an error that is propagated back through the network. The
0-7803-8356-7/04/$20.00 (C) 2004 IEEE
Retrieve action values
With probability 1-pn choose and execute an action with the largest action value. Otherwise, perform the exploration.
The next event occurs
Action value update
Fig. 3. The structure of the QoS provisioning scheme initialize iteration count n := 0, action value R(s, a) := 0, cumulative reward CR := 0, total time T := 0, reward rate ρ 0 := 0 while n < MAX_STEPS do calculate p n , α n , β n using iteration count n with probability of ( 1 −
p n ), tradeoff action space with state
space and choose an action a n ∈ A that maximizes R ( s n , a n ) . Otherwise, choose a random (exploratory) action from A execute the chosen action wait for the next event e update Rnew ( s n , a n ) = (1 − α n )Rold ( s n , a n ) +
α n ract ( s n +1 , s n , a n ) − ρ nτ act ( s n +1 , s n , a n ) + max Rold ( s n +1 , a n +1 )
a n +1 ∈ A
if the action value is stored in neural network Rnew ( sn , an ) − Rold ( sn , an ) is served as a back
propagated error to learn the weight parameters in the neural network endif if an exploration action was not chosen update CR = CR + ract ( s n +1 , s n , a n ) ,
Tn+1 = Tn + τ act ( sn+1 , s n , a n )
and
T ρ + r act ( s n + 1 , s n , a n ) ρ n + 1 = (1 − β n )ρ n + β n n n Tn +1
endif update iteration count n = n + 1
Fig. 4. A pseudo-code of the QoS provisioning scheme
network weights are adjusted to minimize the sum of the errors squared. B. Structure and Pseudo-code The structure of the RL-based QoS provisioning scheme is shown in Fig. 3. When an event (either a call arrival or departure) occurs, a state s is identified by getting the status of the local and neighbouring cells. Then, a set of feasible actions {a} is found according to the state. The state and action information is fed into the neural network to get the action values. With probability 1 – pn, the action with the largest action value is chosen. Otherwise, exploration is performed and an action is chosen randomly. When the next event occurs, the action value is updated and the process is repeated. A pseudo-code description of the proposed scheme is given in Fig. 4.
VI. SIMULATION RESULTS AND DISCUSSIONS A cellular network of 19 cells is used in our simulations, as shown in Fig. 5. To avoid the edge effect of the finite network size, wrap-around is applied to the edge cells so that each cell has six neighbours. Each cell has a fixed bandwidth
IEEE INFOCOM 2004
of 2 Mbps. Two classes of flows are considered (see Table I). Class 1 traffic has three different bandwidth levels, 128, 192 and 256 kbps. The three possible bandwidth levels of class 2 traffic are 64, 96 and 128 kbps. Two reward functions are used in simulations, as shown in Table I. Reward function 1 represents the scenario in which the reward generated by a call is a linear growing function with the bandwidth assigned to the call. Specifically, rij = bij . In reward function 2, a convex
2 function rij = (bmax − (bij − bmax ) 2 ) bmax is used, where bmax is the maximum bandwidth used by a call in the network. We assume that the highest possible bandwidth level is requested by the call arrival. That is, call arrival of class 1 always requests 256 kbps and call arrival of class 2 always requests 128 kbps. Then the network will make the CAC decision and decide which bandwidth level the call can use if it is admitted. 30% of the offered traffic is from class 1. Moreover, call holding time and cell residence time are assumed to follow exponential distributions with mean values of 180 seconds and 150 seconds, respectively. The probability of a user handing off to any adjacent cell is equally likely. The target maximum allowed handoff dropping probability, TPhd , is 1% for both classes. Other QoS constraints are changed in the simulations for evaluation purposes.
The action values are learnt by running the simulation for 30 million steps with a constant new call arrival rate of 0.1 calls/second. The constants used in the Darken-Chang-Moody decaying scheme for the learning and exploration rates are chosen as α 0 = β 0 = p 0 = 0.1 , and α r = β r = p r = 1011 . A monotonicity property associated with ω is used to search for a suitable ω , which is 157560 in the simulations. A multilayer neural network is used in the approximate representation of action values, in which there are 31 inputs units representing the state and action, 20 hidden units with sigmoid functions, and one output unit representing the action value. The neural network is trained on-line using the backpropagation algorithm in conjunction with the reinforcement learning. The trained network is then used to make CAC and BA decisions with different call arrival rates. Two QoS provisioning schemes are used for comparisons, guard channel (GC) scheme [1] for non-adaptive traffic and ZCD02 scheme [12] for adaptive multimedia. 256 kbps is reserved for handoff calls in the GC scheme. In ZCD02, an optimal call mix selection scheme is derived using simulated annealing. The proposed scheme is called RL in the following. The linear reward function is used in all simulation experiments except those in Subsection VI.C, where the convex reward function is used. A. Uniform Traffic We first use uniform traffic distribution in the simulations, where the traffic load is the same among all 19 cells. Call arrivals of both classes to each cell follow a Poisson process with mean λ .
The average rewards of different schemes normalized by that of the GC scheme are shown in Fig. 6. Average allocated
0-7803-8356-7/04/$20.00 (C) 2004 IEEE
13 12 11 10
14 5
4 3 9
1 2 8
15 6
16 17
7 19
18
Fig. 5. Cellular network configuration used in simulations
bandwidth and intra-class fairness constraints are not considered here. We can see that RL and ZCD02 yield more rewards than GC. In the GC scheme, a call will be rejected if the free bandwidth available is not sufficient to satisfy the request. Both RL and ZCD02 have bandwidth adaptation function and therefore can yield more reward than GC. In Fig. 6, the reward of the proposed scheme is similar to that in ZCD02, because both of them can maximize network revenue in QoS provisioning. We can also observe that at low traffic load, as the new call arrival rate increases, the gain becomes more significant. This is because the heavier the offered load, the more the bandwidth adaptation is needed when the cell is not saturated. However, when the traffic is high and the cell is becoming saturated, the performance gain of RL and ZCD02 over GC is less significant. The cost of adaptation operation is not considered, i.e., ca = 0 in Fig. 6. Fig. 7 shows the effects of ca, the cost of adaptation operation, when new call arrival rate is 0.067 calls/second. The reward of ZCD02 drops quickly as ca increases, and even less than that in GC when ca = 150. In contrast, the reward drops slowly in RL. Since RL formulates ca in the reward function, it eliminates those actions requiring a large number of adaptation operations when ca is high by comparing the action values of different actions. Therefore, the proposed scheme can control the adaptation cost, and therefore the adaptation frequency, effectively. We use ca = 30 in the following simulation experiments. Fig. 8 shows that RL maintains an almost constant handoff dropping probability for a large range of new call arrival rates. In contrast, neither ZCD02 nor GC can enforce the QoS guarantee for the handoff dropping probability. We can reduce the handoff dropping probability in GC scheme by increasing the number guard channels and in ZCD02 by increasing the “virtual gain function” of handoff calls. However, this will further reduce the reward earned in these two schemes. Fig. 9 and Fig. 10 show the new call blocking probabilities of class 1 Table I. Experimental Parameters
Traffic Bandwidth Level Class (kbps) b11: 128 Class 1 b12: 192 b13: 256 b21: 64 Class 2 b22: 96 b23: 128
Reward Function 1 128 192 256 64 96 128
Reward Function 2 192 240 256 112 156 192
IEEE INFOCOM 2004
and class 2 traffic, respectively. Both ZCD02 and RL have lower blocking probabilities compared with GC, because both of them have adaptation capability and can accept more new calls. Fig. 11 shows the normalized allocated average bandwidths. TAB i = 0.7 is considered here. We can observe that as the new call arrival rate increases, the average bandwidths of both classes in ZCD02 and RL decrease. This is the result of the bandwidth adaptation. For some applications, it maybe desirable to have a bounded average allocated bandwidth. From Fig. 11, it is shown that the normalized allocated average bandwidth can be bounded by the target value in RL. In contrast, ZCD02 cannot guarantee this average bandwidth QoS constraint. The average bandwidth of GC is always 1, because no adaptation operation is done in GC. Note that the lowest possible normalized average bandwidth is 0.5 for both classes. This can be seen from Table I, where the lowest bandwidth level is half of the highest bandwidth level for both classes. The normalized bandwidth variance, VB , an indicator of intra-class fairness, is shown in Fig. 12. We can see that RL can keep the bandwidth variance below the target value. Since the bandwidth in GC cannot be changed, the bandwidth variance is always 0 in GC. The achievements of higher QoS requirements come at a cost to the system. The effects of different values of TAB and TVB on the average reward are shown in Fig. 13 and Fig. 14, respectively. We can see that higher TAB , which is preferred from users’ point of view, will reduce the reward. Similarly, lower TVB , which means higher intra-class fairness, will reduce the reward as well. B. Non-Uniform Traffic In the non-uniform traffic situation, the cells in the second ring, i.e., cells 2, 3,…, 7 in Fig. 5, have 1.5 times the new call arrival rate of those cells in the outer ring, i.e., cells 8, 9,…, 19. The central cell has 2 times the new call arrival rate of cells in the outer ring. Since the method of predicting handoff rate from neighboring cells is not given in ZCD02, a static predicted handoff rate is used in the revenue function, and we call it ZCD02-static. Fig. 15 shows that RL yields more reward than ZCD02-static and GC schemes. The performance gain of RL over GC and the difference between RL and ZCD02-static are significant in the non-uniform traffic situation. This is because our RL method takes into account the status of neighbouring cells, and therefore it can dynamically adapt to different traffic patterns. C. A Different Reward Function
2 A convex reward function rij = (bmax − (bij − bmax ) 2 ) bmax is used in this situation. The reward rate for specific bandwidth level of each class is shown in Table I. The simulation results using the convex reward function show a similar pattern to those using the linear reward function, and therefore only one figure is provided here. Fig. 16 shows the average rewards of different QoS schemes with non-uniform traffic. We can see that Fig. 16 is similar to Fig. 15.
0-7803-8356-7/04/$20.00 (C) 2004 IEEE
D. Computation Complexity ZCD02 uses simulated annealing to find the optimal callmix, in which a variable called temperature is decreased periodically by employing a monotone descendent cooling function. We follow the example given in ZCD02, where 90 temperature steps are used and each step is repeated 100 times. In each of 9000 steps, the revenue and the constraints are re-evaluated. In RL, since neural network is used in the approximate representation, the major operations required to make the CAC and BA decisions come from retrieving action values and comparing these action values. We run the simulations with a fixed call arrival rate of 0.1 calls/second for 1000 call arrivals and departures, and calculate the average number of operations (additions, multiplications and comparisons) required to make one decision. The number of operations is about 1.8 × 10 5 in ZCD02 and 1× 10 4 in RL. This shows that ZCD02 will be more expensive than RL for computation resources in practice. However, training is needed for the RL approach, whereas ZCD02 and GC do not need any training.
VII. CONCLUSIONS In this paper, we have proposed a new QoS provisioning method for adaptive multimedia in cellular wireless networks. The large number of states and the difficulty to estimate the state transition probabilities in practical systems motivate us to choose a model free average reward reinforcement learning solution to solve this problem. By considering the status of neighboring cells, the proposed scheme can dynamically adapt to the changes in traffic condition. Three QoS constraints, handoff dropping probability, average allocated bandwidth and intra-class fairness have been considered. Simulation results have been presented to show the effectiveness of the proposed scheme in adaptive multimedia cellular networks. Further study is in progress to reduce or eliminate the signaling overhead of exchanging status information by some feature extraction and local estimation functions. It is also very interesting to consider other average reward reinforcement learning algorithms [17], [33]. ACKNOWLEDGMENT This work was support by the Canadian Natural Sciences and Engineering Research Council through grants RGPIN 261604-03 and OGP0044286. REFERENCES [1]
[2]
[3] [4]
D. Hong and S. S. Rappaport, “Traffic model and performance analysis for cellular mobile radio telephone systems with prioritized and nonprioritised handoff procedures,” IEEE Trans. Veh. Technol., vol. VT35, pp. 77-92, Aug. 1986. S. Wu, K. Y. M. Wong, and B. Li, “A dynamic call admission policy with precision QoS guarantee using stochastic control for mobile wireless networks,” IEEE/ACM Trans. Networking, vol. 10, no. 2, pp. 257-271, Apr. 2002. ISO/IEC 144962-2, “Information technology coding of audio-visual objects: visual,” Committee draft, Oct. 1997. ITU-T H.263, “Video coding for low bit rate communications,” 1998.
IEEE INFOCOM 2004
[7]
[8]
[9]
[10]
[11]
[12]
[13] [14]
[15]
[16]
[17]
[18]
[19]
[20]
[21] [22] [23]
[24]
[25]
[26]
0-7803-8356-7/04/$20.00 (C) 2004 IEEE
[27] E. Altman, Constrained Markov decision process, Chapman and Hall, London, 1999. [28] Z. Gabor, Z. Kalmar, and C. Szepesvari, “Multi-criteria reinforcement learning,” in Proc. Int’l Conf. Machine Learning, Madison, WI, Jul. 1998. [29] F. J. Beutler and K. W. Ross, “Optimal policies for controlled Markov chains with a constraint,” J. Math. Anal. Appl., vol. 112, pp. 236-252. 1985. [30] F. J. Beutler and K. W. Ross, “Time-average optimal constrained semiMarkov decision processes,” Adv. Appl. Prob., vol. 18, pp. 341-359. 1986. [31] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming, Athena Scientific, Belmont, MA, 1996. [32] C. Darken, J. Chang, and J. Moody, “Learning rate schedules for faster stochastic gradient search,” in Proc. IEEE Workshop on Neural Networks for Signal Processing, Sept. 1992. [33] J. Abounadi, D. Bertsekas and V. S. Borkar, “Learning algorithms for Markov decision processes with average cost,” SIAM J. Contr. Optimization., vol. 40, no. 3, pp 681-198, 2001.
1.3
Normalized average reward
[6]
3GPP, “RRC protocol specification,” 3G TS25.331 version 3.12.0, Sept. 2002. Y. B. Lin, A. Noerpel, and D. Harasty, “The sub-rating channel assignment strategy for PCS hand-offs,” IEEE Trans. Veh. Technol., vol. 45, no. 1, pp. 122-130, 1996. C. Chou and K. G. Shin, “Analysis of combined adaptive bandwidth allocation and admission control in wireless networks,” in Proc. IEEE INFOCOM’02, June 2002. A. K. Talukdar, B. R. Badrinath, and A. Acharya, “Rate adaptive schemes in networks with mobile hosts,” in Proc. ACM/IEEE MOBICOM’98, Oct. 1998. T. Kwon, J. Choi, Y. Choi, and S. Das, “Near optimal bandwidth adaptation algorithm for adaptive multimedia services in wireless/mobile networks,” in Proc. IEEE VTC’99-Fall, Sept. 1999. Y. Xiao, P. Chen, and Y. Wang, “Optimal admission control for multiclass of wireless adaptive multimedia services,” IEICE Trans. Commun., vol. E84-B, no. 4, pp. 795-804, April 2001. Y. Xiao, C.L.P. Chen, and B. Wang, “Bandwidth degradation QoS provisioning for adaptive multimedia in wireless/mobile networks,” Computer Commun., vol. 25, pp.1153-1161, 2002. G. V. Zaruba, I. Chlamtac, and S. K. Das, “A prioritized real-time wireless call degradation framework for optimal call mix selection,” Mobile Networks and Applications, vol. 7, pp. 143-151, April 2002. T. S. Rappaport, Wireless communications: Principles and practice. Englewood Cliffs, NJ: Prentice Hall, 1996. T. Kwon, Y. Choi, C. Bisdikian, and M. Naghshineh, “QoS provisioning in wireless/mobile multimedia networks using an adaptive framework,” Wireless Networks, vol. 9, pp. 51-59, 2003. S. Ganguly and B. Nath, “QoS provisioning for adaptive services with degradation in cellular network,” in Proc. IEEE WCNC’03, New Orleans, Louisiana, March 2003. H. Tong and T. X. Brown, “Adaptive call admission control under quality of service constraints: a reinforcement learning solution,” IEEE J. Select. Areas Commun., vol. 18, no. 2, pp.209-221, 2000. P. Marbach, O. Mihatsch, and J.N. Tsitsiklis, “Call admission control and routing in integrated services networks using neuro-dynamic programming,” IEEE J. Select. Areas Commun., vol. 18, no. 2, pp.197208, 2000. S. P. Singh and D. P. Bertsekas, “Reinforcement learning for dynamic channel allocation in cellular telephone systems,” in M. Mozer et al. (Eds.) Advances in NIPS 9, pp. 974-980, 1997. J. Nie and S. Haykin, “A Q-learning based dynamic channel assignment technique for mobile communication systems,” IEEE Trans. Veh. Technol., vol. 48, no. 5, pp. 1676-1687, Sept. 1999. D. Wu, Y.T. Hou, and Y.-Q. Zhang, “Scalable video coding and transport over broadband wireless networks,” Proc. of the IEEE, vol. 89, no.1, pp. 6-20, Jan. 2001. M. L. Puterman, Markov decision processes, Wiley Interscience, New York, USA, 1994. C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, pp. 279-292, 1992. S. Mahadevan, “Average reward reinforcement learning: Foundations, algorithms, and empirical results,” Machine Leaning, vol. 22, pp. 159196, 1996. T. K. Das, A. Gosavi, S. Mahadevan and N. Marchalleck, “Solving semi-markov decision problems using average reward reinforcement learning,” Management Science, vol. 45, no. 4, pp. 560-574, 1999. A. Gosavi, An algorithm for solving semi-Markov decision problems using reinforcement learning: convergence analysis and numerical results, Ph.D. Dissertation, University of South Florida, 1999. A. Gosavi, N. Bandla and T. K. Das, “A reinforcement learning approach to airline seat allocation for multiple fare classes with overbooking,” IIE Transactions on Operations Engineering, vol. 34, pp. 729-742, 2002.
1.2
1.1
1
0.9 RL ZCD02 GC
0.8
0.02
0.04
0.06
0.08
0.1
0.12
0.14
New call arrival rate
Fig. 6. Normalized average rewards 1.3
Normalized average reward
[5]
1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 RL ZCD02 GC
0.85 0.8
0
20
40
60
80
100
120
140
160
180
Adaptation operation cost
Fig. 7. Normalized average rewards vs. adaptation cost
IEEE INFOCOM 2004
0
-1
10
-2
10
-3
10
GC Class 1 ZCD02 Class 1 GC Class 2 ZCD02 Class 2 RL Class 1 RL Class 2
-4
10
0.02
0.04
0.06
0.08
0.1
0.12
Normalized average bandwidth
Handoff dropping probability
10
1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3
0.14
GC RL Class 1 RL Class 2 i TAB = 0.7 ZCD02 Class 2 ZCD02 Class 1
0.02
0.04
New call arrival rate
0.8 0.7 0.6 0.5 0.4 0.3 0.2 GC RL ZCD02
0.1
0.02
0.04
0.06
0.08
0.1
0.12
Normalized bandwidth variance
New call blocking probability (Class 1)
0.1
0.12
0.14
0.12
0.9
0.1
ZCD02 Class 1 ZCD02 Class 2 i TVB = 0.03 RL Class 1 RL Class 2 GC
0.08
0.06
0.04
0.02
0
0.14
0.02
0.04
0.06
0.08
0.1
0.12
0.14
New call arrival rate
New call arrival rate
Fig. 12. Normalized bandwidth variances
Fig. 9. New call blocking probabilities of class 1 calls 0.9
1.3 0.8
Normalized average reward
New call blocking probability (class 2)
0.08
Fig. 11. Normalized average bandwidths
Fig. 8. Handoff dropping probabilities
0
0.06
New call arrival rate
0.7 0.6 0.5 0.4 0.3 0.2 GC RL ZCD02
0.1 0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
New call arrival rate
Fig. 10. New call blocking probabilities of class 2 calls
0-7803-8356-7/04/$20.00 (C) 2004 IEEE
1.2
1.1
1
0.9 RL (no average bandwidth guarantee) i RL (TAB = 0.7) GC
0.8
0.02
0.04
0.06
0.08
0.1
0.12
0.14
New call arrival rate
Fig. 13. Normalized average rewards for different average bandwidth requirements
IEEE INFOCOM 2004
Normalized average reward
1.3
1.2
1.1
1
0.9 RL (no average bandwidth guarantee) i RL (TVB = 0.03) GC 0.8
0.02
0.04
0.06
0.08
0.1
0.12
0.14
New call arrival rate
Fig. 14. Normalized average rewards for different bandwidth variance requirements
Normalized average reward
1.3
1.2
1.1
1
0.9 RL ZCD02-static GC
0.8
0.02
0.04
0.06
0.08
0.1
0.12
0.14
New call arrival rate
Fig. 15. Normalized average rewards with non-uniform traffic
Normalized average reward
1.3
1.2
1.1
1
0.9 RL ZCD02-static GC
0.8
0.02
0.04
0.06
0.08
0.1
0.12
0.14
New call arrival rate
Fig. 16. Normalized average rewards with convex reward function and nonuniform traffic
0-7803-8356-7/04/$20.00 (C) 2004 IEEE
IEEE INFOCOM 2004