A Learning-Based Network Selection Method in ... - IEEE Xplore

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE Globecom 2011 proceedings.

A Learning-based Network Selection Method in Heterogeneous Wireless Systems Haleh Tabrizi, Golnaz Farhadi, and John Cioffi Department of Electrical Engineering, Stanford University Email: {htabrizi, gfarhadi, cioffi} @stanford.edu

Abstract—With the coexistence of various wireless technologies, next generation wireless communications will likely consist of an integrated system of networks, where the Access Points (APs) and Base Stations (BSs) work together to maximize the mobile-user Quality of Service (QoS). In such heterogeneous environment where handheld devices with different access technologies are not uncommon, it should be possible to select networks and seamlessly switch from one AP/BS to another in order to elevate user performance. In this paper, this type of network selection and handover mechanism with the goal of maximizing QoS is formulated as a Markov Decision Process (MDP). An algorithm based on Reinforcement Learning (RL) is then obtained that selects the best network based not only on the current network load but also the potential future network states. This algorithm aims at balancing the number of handovers and the achievable QoS. The results illustrate that while the QoS performance of the proposed algorithm is comparable to the performance of the optimum opportunistic selection algorithm, fewer number of network handovers (on average) are required. In addition, compared to the existing predefined network selection strategies with no handover, the MDP-based algorithm offers significantly better QoS.

I. I NTRODUCTION The coexistence of several wireless access technologies, such as WiFi and WiMAX, as broadband access networks and LTE and UMTS as cellular networks, gives rise to a heterogeneous wireless access environment. Such environment provides mobile devices having sufficient access capabilities the opportunity to select a network (Figure 1). Selecting networks intelligently plays an important role in improving system performance and QoS. Currently, if a mobile device is capable of connecting to all the available networks, it will select a single network based on a simple predefined strategy. This strategy usually consists of choosing (if available) the WiFi network that provides the largest signal strength. Furthermore, the device will stay connected to this network for the entire desired connection duration or until the QoS degrades such that the connection is dropped. There are two main problems with this strategy: first, the WiFi network with the largest signal strength does not guarantee the best service, because it might be congested with many users. For example, a congested WiFi network at a hotspot might provide the largest signal strength (compared to neighboring WiFi APs) to a user at its proximity, but not the best possible QoS. Nonetheless, the existing 3G network or a more distant neighboring WiFi network might serve it better. Second, the QoS experienced by a mobile device during its

Fig. 1.

Select which network?

connection to a wireless network is time-varying. Because of the dynamics of user arrivals and departures, the behavior of each new connection, and channel conditions, the state of the heterogeneous system evolves with time. Hence the network selected at one instant might not provide the best service for the entire duration of the connection. Therefore, a decision-making algorithm that allows the mobile device to select and handover to different networks with the purpose of QoS maximization is required. Such selection strategy should not only depend on the system state at time of selection, but also on the potential future states during the connection. This paper proposes a network selection algorithm based on reinforcement learning. The goal of this network selection algorithm is to maximize the QoS and minimize the number of handovers in order to reduce communication overhead. Significant work investigating the problem of network selection across heterogeneous wireless networks can be found in literature. An analytic hierarchy process is proposed in [1] based on weighted evaluation factors. In [2], a Grey relational analysis is studied for best network selection. In particular, multi-attribute decision making algorithms are jointly used to assist the mobile device in selecting the most suitable network. An algorithm for dual-mode WiFi/WiMAX network selection is analyzed in [3] aiming at maximizing the throughput while maintaining a smooth connection. In [4], an adaptive centralized network selection algorithm based on the current user and network context information is proposed. A median-based network selection is discussed in [5] considering different decision factors. Nevertheless, the network dynamics have not been considered in these papers. In [6], a centralized algorithm for dynamic selection among heterogeneous networks is examined. However, this algorithm is based on exhaustive search, and considers a large number of users and optimization variables along with various ra-

978-1-4244-9268-8/11/$26.00 ©2011 IEEE


dio access technologies. In [7] and [8], distributed network access mechanisms are developed based on reinforcement learning. However, the number of parameters involved in the reward function can be large, and in some cases obtaining such attributes at the mobile users may not be feasible. Therefore, solving the MDP-based problem in a distributed manner becomes computationally cumbersome or impossible. Furthermore, in [9] a distributed MDP-based vertical handover decision algorithm is proposed that selects networks based on the available bandwidth and delay of each network. However, the variation of these parameters for a WLAN network are obtained through simulations at the mobile device, which requires large storage and intensive computing facilities. A central entity authorized to collect network state information, has more powerful computing capability to learn the parameters and solve the optimization problem for network selection. Thus, in this paper, the consequent use of reinforcement learning based on the reward of user throughput generates a network selection policy. In evaluation, the proposed scheme is compared with related approaches based on the number of handovers and target user throughput. In this paper, two selection strategies, namely, predefined selection strategy and opportunistic selection, are compared to the results of the proposed MDP-based algorithm. The predefined selection strategy, as explained above, selects the WiFi network that provides the largest signal strength, independent of the presence of other networks and the state of the system. The second method, the opportunistic selection strategy, selects the network that provides the best QoS at every instant. Without having knowledge of later states of the network, the opportunistic method switches from one network to another when the conditions of the latter network outperform the current network conditions. Such strategy results in a large number of handovers when the environment dynamics change rapidly. On the other hand, the proposed MDP-based network selection algorithm dynamically selects networks by balancing the number of required handovers and QoS. This noticeable performance gain is, however, achieved at the expense of a central controller learning the network statistics and providing the information to the users. II. S YSTEM MODEL The heterogeneous wireless model considered here consists of N different networks that can include WiMAX, WiFi, LTE, and UMTS. Mobile users are accepted to each network n ∈ {1, 2, ..., N } according to a Poisson process of rate λn users per minute and depart the network according to a Poisson process of rate νn users per minute. A mobile device with multiple transceivers is capable of connecting to all radio access networks and every τ units of time can handover to a different network. It chooses only one interface for communication at any time interval of τ . Time is slotted with a slot duration of τ and the time at the beginning of each slot is represented as t1 , t2 , t3 , ... At time t1 the target user requests a connection, and the problem is to determine the set of networks that the target user should connect to at every time

Fig. 2.

Markov decision process.

instant tk , i.e, networks n1 , n2 , n3 , ... Furthermore, at every tk , a reward equivalent to the QoS of selected network nk is received. The goal is to exercise network selection in such a way that maximizes the long term reward (to be later defined). The decision about the selected network at each time tk is based on the current state, sk , and the future states or learned statistics of the heterogeneous system. It is assumed that the central controller that controls handovers between networks is also capable of learning the network statistics. III. DYNAMIC P ROGRAMMING P ROBLEM F ORMULATION The network selection problem in a heterogeneous wireless setting can be modeled as a discounted Markov Decision Process. The state of the system is described by the (N + 1)-tuple s = {s(n), n = 1, ...N + 1}, where s(n) for n = 1, 2, ..., N denotes the number of users currently using network n. Parameter s(N + 1) represents the current network associated with the target user. Hence the state space S is given by S = {s ∈ Z N +1 |s(n) ≤ smax (n), n = 1, 2, ..., N, s(N + 1) ∈ {0, 1, 2, ..., N }}

(1)

where smax (n) is the maximum number of users that network n can tolerate. The maximum number of users for the WiFi networks is calculated as the maximum number of connections for which the achieved throughput is above a certain threshold value. The last state element, s(N + 1), represents one of the N networks that the target user is connected to at the current state. It takes on the value of 0 only at the start of a connection, when no network has been selected yet: s1 (N + 1) = 0. The control action is denoted by a and determines the network selected among the existing N . Let A denote the set of all possible actions: A = {a|a ∈ {1, 2, ..., N }}.

(2)

Figure 2 represents this decision process as an MDP. This process begins at state s1 ∈ S, and according to this state, some action a1 ∈ A is selected. As a result of this action, the state of the MDP randomly transitions to some successor state, s2 , drawn according to s2 ∼ Ps1 ,a1 . Then at state s2 , another action, a2 , is selected and the state transitions again according to s3 ∼ Ps2 ,a2 and so forth. Let sk be the state of the system in the interval (tk , tk+1 ]. In the network selection scenario, if the system is in state sk , the next state, sk+1 , is determined by the user arrival and departure rates, λn and νn , and the action, ak , the target user takes. Hence, given state sk ∈ S and a control action ak ∈ A, the next state, sk+1 , is given by a stochastic function f : S × A → S such that sk+1 = f (sk , ak ) and will be defined shortly. However, the last element of the state vector, sk+1 (N + 1), which corresponds to the network associated with the target user is directly determined by the previous action of the MDP, i.e, sk+1 (N + 1) = ak .


Associated with each state sk and action ak for k = 1, 2, ... is a reward function R(sk , ak ) which is defined as follows: ak−1 } R(sk , ak ) = (1 − α)Q(sk |ak = n)1{ak = +αQ(sk |ak = n)1{ak = ak−1 }

(3)

where 1{} is the indicator function and Q(sk |ak = n) is the target user’s QoS in state sk given that it selects network n. Note that the zeroth action, a0 = s1 (N + 1) is equal to zero. The parameter α ∈ (0.5, 1] represents the relative importance of the second term on the right hand side of the equation to the first term and its objective is to minimize the number of handovers. With the two indicator functions, the reward is either equal to the first term or the second term on the right hand side. That is, if the current action, ak , is equal to the previous action, ak−1 , the reward will be equal to the second term and if ak is not equal to ak−1 , the reward will be equal to the first term. In this scenario, since α > 0.5, the reward function is higher if taking action ak does not result in a network handover. With a stationary policy μ : S → A, the discounted rewardto-go function at state s is given by: D γk R(sk , ak ) | s1 = s (4) Vμ (s) = E k=1

where γk is the discount factor, and D is the length of the MDP. In the case of infinite horizon MDP, D = ∞, and in the case of finite horizon MDP, D is the number of time steps considered. In both cases, the reward-to-go function is the expected sum of rewards that can be obtained throughout the process when starting from state s. Equation (4) can be rewritten in the form of Bellman equation as follows: Psa (s )V (s ))] (5) V (s) = max[R(s, a) + γ a∈A

s

where γ is again the discount factor and Psa (s ) is the probability of transitioning from state s to s when action a is taken. This transition probability is defined as: Psa (s ) =

N n=1

e−(λn +νn )τ

λn νn

( K2n )

I|Kn | (2τ

A. Obtaining an Optimal Policy There are various algorithms for solving MDPs. For the infinite horizon MDP with finite state and action spaces, value iteration is an efficient algorithm that works as follows: 1. For each state s ∈ S, initialize V (s) := 0. 2. Repeat until convergence: For every state, update V (s) := max[R(s, a) + γ a∈A

Psa (s )V (s )], ∀s ∈ S. (7)

s

After obtaining the value function, V (s), the user determines its action, ak , given its current state, sk , according to the following: ak = arg max R(sk , a) + γ

a∈A

Psk ,a (s )V (s ) .

(8)

In a finite horizon MDP, the value function is time dependent, i.e, for every time step, k, a different value function, Vk is generated. The finite horizon MDP problem has a simple solution method based on dynamic programming that uses backward induction: 1. For each state s ∈ S, initialize VD (s) := max R(s, a).

(9)

a∈A

2. For k = D − 1, D − 2, ..., 1, and ∀s ∈ S, let Vk (s) := max R(s, a) + γ

a∈A

Psa (s )Vk+1 (s )

(10)

s

a∈A

The parameter Kn is the difference in the number of users of network n in states s and s . IK (z) is the modified Bessel function of the first kind. Equation (6) represents the product of N Skellam distributions. The Skellam distribution is the discrete probability distribution of the difference of two statistically independent Poisson distributions [10]. Each Skellam distribution in the product represents the difference between the Poisson arrival and departure distributions of each network n and determines the number of users in that network. It is assumed that the arrival and departure probability function of each network is independent. Solving Bellman equation (5), requires determining the reward function, R(s, a) (or the QoS metric), and a method to obtain the optimal policy. Both are explained in the next two subsections.

s

ak (s) := arg max R(s, a) + γ λn νn ) (6)

Psa (s )Vk+1 (s ) . (11)

s

The user at state sk , selects action ak . B. QoS Metric One basic measure of QoS for each user is the achieved throughput. WiFi networks based on the IEEE 802.11 standard employ Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA) protocol for packet transmissions. Since the user throughput is affected by the action of other random users in the network, calculation of throughput is a tedious task. However, a common method of throughput calculation in persistent CSMA systems as introduced by [11] and further extended in [3] uses the following equation: T =

N D

(12a)


where N = pM

A. Simulation Parameters ∞ (1 − p)l − (1 − g)1+(1/a) l=0

p(1 − p)l − g(1 − g)l (1 − p)l+1 − (1 − g)1+(1/a) p · p−g p(1 − p)l+1 − g(1 − g)l+1 M −1 · (12b) p−g

and D =1+a+a

∞

(1 − p)l

l=1

− (1 − g)l+(1/a) p

p(1 − p)l − g(1 − g)l p−g

M

.

(12c)

In this calculation, slotted time with duration a is considered. M is the number of users in the network. The parameter p is the probability that a user starts transmission at the slot boundaries, and g = min {1, aG/M } is the packet arrival rate for the total offered load G. The reader is referred to [11] for further detail. The WiMAX network throughput depends on the OFDMA signal structure, the permutation scheme, and the channel bandwidth [12]. The OFDMA technology supports multiple modulation schemes depending on the users’ distance from the cell. The modulation schemes and their attendant code rate variations also deliver varying bandwidth capabilities by channel size. WiMAX throughput for different modulation schemes and channel bandwidths have been reported in literature [12][13]. Similarly, the LTE network throughput depends on the OFDMA signal structure, modulation, coding, and channel bandwidth. LTE and WiMAX, in contrast to WiFi, provide dedicated bandwidth per user such that the performance does not degrade with increasing network users. IV. S IMULATION R ESULTS The heterogeneous system model considered here consists of only WiFi and WiMAX networks, since the interoperability and seamless transition between these two have been notably analyzed in literature [3], [14]. However, the work presented here can be extended to other networks under the condition that handover among all the networks is possible. Commonalities between 802.11n/g WiFi and 802.16 Mobile WiMAX enable a smooth integration of the two and hence are the focus of the analysis presented here. Both WiFi and WiMAX are open IEEE wireless standards built for Internet Protocol-based applications. Furthermore, WiMAX and WiFi (802.11g and 802.11n) technologies are both based on an OFDM air interface, and IEEE 802.11n and WiMAX employ MIMO antenna mechanisms and can share the same antennas. Achieving a transparent handover between WiFi and WiMAX requires a controller, which can be attained through the existing Access Service Network Gateway [14].

Simulations are performed for a heterogeneous network consisting of one WiMAX network and 2 WiFi networks. It is assumed that the WiMAX uplink throughput is 16 Mbps for a single cell that can be achieved using the 16QAM modulation with code rate 34 [15]. For simplicity, ignoring channel conditions, the WiMAX network provides a data rate of 1Mbps to each user independent of the number of users present. However, it will not accept more than 16 users. WiFi network 1 is based on 802.11g with a maximum data rate of 6 Mbps and WiFi network 2, based on 802.11n, provides a maximum data rate of 7.2 Mbps. In this paper, equation (12a) is employed for calculating the WiFi throughput with a = 0.01 sec and assumes all users behave similarly by assigning the same values G = 10 packets and p = 0.5 to all users. A one hour connection time is considered that consists of 6 time-steps with a duration of τ = 10 minutes each. This finite duration decision process is modeled as a finite horizon MDP. However, for comparison purposes, an infinite horizon MDP is also employed as an approximation to the finite horizon model. In order to solve the infinite horizon model and find the optimum network selection policy, value iteration with discount factor γ = 0.995 is computed. Simulations show that this algorithm requires 15 iterations, on average, to converge to the optimum. The finite horizon MDP is solved using backward induction as discussed in Section III.A. For evaluation purposes, the network is simulated 103 times, starting with a random state for each trial. The states of the networks evolve according to Skellam distributions with λn = 0.4, 0.3, 0.25, and νn = 0.5, 0.4, 0.3, for WiFi 1, WiFi 2, and the WiMAX network, respectively. According to these numbers, simulations are performed for a duration of time where the departure rate of the users in all networks is greater than the corresponding arrival rates. The location of the target user is selected such that the largest WiFi signal strength it receives comes from WiFi 2 at all times. Therefore, WiFi 2 is always selected by the predefined selection method. The value of α in the reward function is varied to evaluate the algorithm performance in terms of the number of handovers. B. Discussion The QoS results (averaged over 103 simulations) are depicted in Figure 3. The horizontal axis represents time and the vertical axis represents the average throughput as a result of selecting networks based on the four different selection methods described: Dynamic selection based on an Infinite Horizon MDP (IMDP), dynamic selection based on a Finite Horizon MDP (FMDP), opportunistic algorithm, and fixed selection of WiFi 2. The results of the FMDP and IMDP algorithms evaluated at α = 0.55, 0.75 and 0.95 appear in this figure. The general increasing trend in throughput for all selection methods is the result of the greater departure rates compared to arrival rates. Decreasing the number of users in a WiFi network, increases the average throughput per user and hence the increasing trend seen in the figure. The average number of handovers over the 6 time steps appear in Table I.


TABLE I AVERAGE N UMBER OF H ANDOVERS : C OMPARISON OF D IFFERENT S TRATEGIES FMDP IMDP 1.074 1.088 0.452 0.507 0.145 0.200 1.285

Figure 3 verifies that the QoS performance of the MDPbased algorithms comes close to the performance of the optimum opportunistic algorithm, while the number of handovers as depicted in Table I and hence the communication overhead reduces. The FDMP and IMDP average throughput plots associated with α = 0.55 almost overlap the opportunistic performance plot, while as seen in Table I, the average number of handovers reduces. At the other extreme, while the performance of the IMDP and FMDP algorithms corresponding to α = 0.95 move further away from the optimum QoS performance, the average number of handovers reduces dramatically. By varying α, a tradeoff between number of handovers and average throughput takes place. The desired balance can be achieved by setting α equal to the correct value. Furthermore, Figure 3 and Table I indicate that the FMDP and IMDP algorithms for each value of α perform similarly in terms of both QoS and number of handovers. The slightly different performance of the FMDP and IMDP, results from the FMDP algorithm knowing that there are only 6 time steps over which the actions must be optimized. On the other hand, the IMDP model, finds an optimal policy assuming the states and actions continue forever. Figure 3, demonstrates that at the final time steps, the FMDP with α = 0.55 performs almost equally as well as the optimum opportunistic algorithm, while FMDP with α = 0.75 and 0.95 deviate away from their corresponding IMDP performances and hence away from the optimum throughput performance. Such behavior is the result of the FMDP algorithm knowing that it is at the final time steps of its decision making process and makes a decision that optimizes its final reward independent of the future. FMDP with α = 0.55 approaches the optimum throughput because its reward is highly weighted by throughput performance instead of the number of handovers (α close to 0.5). On the other hand, FMDP with α = 0.95, heavily favors decreasing the number of handovers, and hence it deviates away from the optimum QoS while decreasing the number of handovers. V. S UMMARY This paper examines network selection in a heterogeneous wireless setting through dynamic programming. The goal is to maximize user QoS and minimize the number of handovers by dynamically selecting networks based on the current and future states of the network. It is found that the user QoS obtained through the proposed MDP algorithm is comparable to the QoS obtained through the optimum QoS-achieving opportunistic selection algorithm. It is also found that the number of handovers obtained through the MDP algorithm can be significantly reduced, while maintaining high QoS.

4.5 Average Throughput (Mbps)

α 0.55 0.75 0.95 Opportunistic

5.0

4.0

3.5 IMDP α = 0.55 IMDP α = 0.75 IMDP α = 0.95 FMDP α = 0.55 FMDP α = 0.75 FMDP α = 0.95 opportunistic fixed WiFi

3.0

2.5

2.0

1.5 t1

t2

t3

t4

t5

t6

Time Steps

Fig. 3.

QoS comparison of network selection algorithms.

R EFERENCES [1] Q. Y. Song and A. Jamalipour, “Network selection in an integrated wireless LAN and UMTS environment using mathematical modeling and computing techniques,” IEEE Trans. Wireless Commun., vol. 12, no. 3, pp. 42-48, Jun. 2005. [2] F. Bari and V. C. M. Leung, “Automated network selection in a heterogeneous wireless network environment,” IEEE Network, vol. 21, no. 1, pp. 34-40, Jan. 2007. [3] Z. Dai, R. Fracchia, J. Gosteau, P. Pellati, G. Vivier, “Vertical handover criteria and algorithm in IEEE 802.11 and 802.16 hybrid networks,” in Proc. IEEE Int. Conf. Commun. (ICC), Beijing, China, pp. 2480 - 2484, May 2008. [4] W. Luo and E. Bodanese, “Radio access network selection in a heterogeneous communication environment,” in Proc. IEEE Wireless Commun. and Networking Conf. (WCNC), Budapest, Hungary, pp. 1-6, Apr. 2009. [5] Y. Wang, L. Zheng, J. Yuan, W. Sun, “Median based network selection in heterogeneous wireless networks,” in Proc. IEEE Wireless Commun. and Networking Conf. (WCNC), Budapest, Hungary,pp. 1-5, Apr. 2009. [6] P. Demestichas, G. Dimitrakopoulos, K. Tsagkaris, K. Demestichas, and J. Adamopoulou, “Reconfigurations selection in cognitive beyond 3G, radio infrastructures,” in Proc. Int. Conf. Cognitive Radio Oriented Wireless Networks and Commun., pp. 1-5, Jun. 2006. [7] Y. Xue, Y. Lin, Z. Feng, H. Cai, and C. Chi, “Autonomic joint session scheduling strategies for heterogeneous wireless networks,” in Proc. IEEE Wireless Commun. and Networking Conf. (WCNC), Las Vegas, Nevada, pp. 2045-2050, Apr. 2008. [8] L. Saker, S. B. Jemaa, and S. E. Elayoubi, “Q-learning for joint access decision in heterogeneous networks,” in Proc. IEEE Wireless Commun. and Networking Conf. (WCNC), Budapest, Hungary, pp. 1-5, Apr. 2009. [9] E. Stevens-Navarro, Y. Lin, V. W. S. Wong, “An MDP-based vertical handoff decision algorithm for heterogeneous wireless networks,” IEEE Trans. Veh. Technol., vol. 57, no. 2, pp. 1243-1254, Feb. 2008. [10] J. G. Skellam, “The frequency distribution of the difference between two Poisson variates belonging to different populations,” J. Royal Stat. Soc., Series A, no. 3, pp. 109-296, 1946. [11] H. Takagi and L. Kleinrock, “Throughput analysis for persistent CSMA systems,” in IEEE Trans. Commun., vol. COM-33 , no. 7, pp. 627-638, Jul. 1985. [12] WiMAX Forum website, “Mobile WiMAX Part I: A technical and performance evaluation”, 2006. [13] J. G. Andrews, A. Ghosh, and R. Muhamed, Fundamentals of WiMAX: Understanding broadband wireless networking, Pearson Edu., 2007. [14] Motorola and Intel, Wimax and WiFi together: deployment models and user scenarios, white paper. [15] D. Pareit, V. Petrov, B. Lannoo, E. Tanghe, W. Joseph, I. Moerman, P. Demeester, and L. Martens, “A throughput analysis at the MAC layer of mobile WiMAX,” in Proc. IEEE Wireless Commun. and Networking Conf. (WCNC), Sydney, Australia, Apr. 2010.