Reinforcement Learning for Multiple Access Control in ...

Reinforcement Learning for Multiple Access Control in Wireless Sensor Networks: Review, Model, and Open Issues Mohammad Fathi, Vafa Maihami & Parham Moradi

Wireless Personal Communications An International Journal ISSN 0929-6212 Wireless Pers Commun DOI 10.1007/s11277-013-1028-9

1 23

Your article is protected by copyright and all rights are held exclusively by Springer Science +Business Media New York. This e-offprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to self-archive your work, please use the accepted author’s version for posting to your own website or your institution’s repository. You may further deposit the accepted author’s version on a funder’s repository at a funder’s request, provided it is not made publicly available until 12 months after publication.

1 23

Author's personal copy Wireless Pers Commun DOI 10.1007/s11277-013-1028-9

Reinforcement Learning for Multiple Access Control in Wireless Sensor Networks: Review, Model, and Open Issues Mohammad Fathi · Vafa Maihami · Parham Moradi

© Springer Science+Business Media New York 2013

Abstract Wireless sensor networking is a viable communication technology among low-cost and energy-limited sensor nodes deployed in an environment. Due to high operational features, the application area of this technology is extended significantly but with some energy related challenges. One main cause of the nodes energy wasting in these networks is idle listening characterized with no communication activity. This drawback can be mitigated by the means of energy-efficient multiple access control schemes so as to minimize idle listening. In this paper, we discuss the applicability of distributed learning algorithms namely reinforcement learning towards multiple access control (MAC) in wireless sensor networks. We perform a comparative review of relevant work in the literature and then present a cooperative multi agent reinforcement learning framework for MAC design in wireless sensor networks. Accordingly, the paper concludes with some major challenges and open issues of distributed MAC design using reinforcement learning. Keywords Multiple access control · Reinforcement learning · Scheduling · Wireless sensor networks · Optimization

1 Introduction Wireless sensor networks (WSNs) are becoming a promising technology in environmental monitoring and distributed data processing [1]. A WSN consists of a number of small and low-cost nodes deployed in an environment. These nodes sense and forward target data to a central sink in a collaborative and distributed manner. WSNs are typically characterized

M. Fathi (B) · V. Maihami · P. Moradi Department of Electrical and Computer Engineering, University of Kurdistan, Sanandaj, Iran e-mail: [email protected] V. Maihami e-mail: [email protected] P. Moradi e-mail: [email protected]

123

Author's personal copy M. Fathi et al.

by energy-constrained nodes [2]. In general, these nodes are battery-powered and cannot be recharged after being deployed. This issue highlights the importance of energy-efficient communication protocols. One key protocol in wireless networking is medium access control (MAC) protocol. This protocol coordinates the nodes access to a shared channel in order to avoid interference using a centralized or a distributed manner. Following a MAC protocol, one WSN node switches its own radio on to be in active mode. In this mode, a node can either transmit, or receive or be in idle listening. After a period of time with no activity, the node switches its radio off and enters the sleep mode to save energy. Ideally, a sensor node should be active only during transmitting and receiving. This can be achieved by the means of an energy-aware MAC protocol to dynamically determine the operation mode of individual radios within a WSN [3]. Towards this end, distributed learning algorithms have been employed in the literature [4]. Distributed learning in the context of WSNs can be employed in the view point of both application and networking issues. The former deals with providing the network with a set of distributed training data for applications such as distributed estimation, inference and detection [4,5]. The latter deals with distributed decision making approaches to provide a communication infrastructure handling data transmission in the network. The possibility of distributed learning from application perspective needs efficient communication protocols considering energy, bandwidth and delay limits. Recognizing this demand, the question is how to customize existing network protocols or to design new ones for WSNs. Inspired by the same idea of using distributed learning rather than centralized learning in WSNs [4], in particular we evaluate the the possibility of employing distributed learning approaches in the MAC layer of WSNs. The question is how the constraints of energy, bandwidth and delay can be addressed in the context of this learning. In this paper, we first revisit the basic protocols in the design of MAC in WSNs. Having discussed distributed learning, we focus on reinforcement learning as a model-free learning scheme. Proposed MAC designs using reinforcement learning are surveyed and accordingly we propose a cooperative multi agent reinforcement learning framework. Finally, major challenges and open issues in the area are presented. The rest of this paper is organized as follows. In Sect. 2, required backgrounds of this paper including multiple access techniques, basic MAC protocols in WSNs, and learning approaches are described. Section 3 describes how reinforcement learning can be employed in WSNs. MAC protocols in WSNs based on this learning are then surveyed and a cooperative multi agent reinforcement learning framework is proposed in Sect. 4. The paper is concluded with open issues and challenges in Sect. 5.

2 Background 2.1 MAC Protocol Concepts Multiple access techniques in wireless networks are classified into contention-free and contention-based schemes. In the contention-free access, an entity is in charge to allocate radio resources such as time, frequency, and code to network nodes. To enable simultaneous transmissions in the network, resource units should be scheduled so that co-channel interference could be avoided. Below are contention-free schemes [6]: • Time-division multiple access (TDMA): In this scheme, time is typically divided into fixed-length slots. These slots are then allocated to different nodes to eliminate the

123

Author's personal copy Review, Model, and Open Issues

interference between simultaneous transmissions. As a common difficulty in TDMA, network nodes are required to be synchronized with each other. • Frequency-division multiple access (FDMA): this scheme divides the entire frequency bandwidth into a set of orthogonal channels. These channels are allocated to network nodes for interference-free transmission. • Code-division multiple access (CDMA): In CDMA, transmitted data by each node is encoded using a unique spreading code. Spreading codes for different users are orthogonal to each other. One receiver is able to decode its own data successfully if the signal-tointerference-plus-noise ratio (SINR) is maintained above a threshold value. In contrast to TDMA and FDMA, multiple nodes can transmit simultaneously on the same channel in CDMA. The number of supported users is limited by the number of orthogonal codes. Unlike the contention-free channel access, in contention-based access, network nodes compete with each other in a distributed manner to access the channel. In the case of simultaneous transmissions from several nodes, collision is occurred and all ongoing transmissions will be failed. A node might try to retransmit its own data later when the channel is sensed to be idle following a specific MAC protocol. On the other hand, if there is no collision with a signal from another node, the packet transmitted by one node will be received successfully. Compared to the contention-free access, contention-based access is not only robust to the network topology and traffic load, but also is scalable to the network size and node density. Below, we summarize two basic contention-based schemes [6]: • ALOHA: once a node has a packet to send, it accesses the channel for transmission. If the packet collides with packets from another transmitter, the node try to retransmit the packet again. The ALOHA protocol can be operated in a slotted fashion (slotted-ALOHA), in which time is divided into slots and packet transmissions are triggered at the beginning of each time-slot. • Carrier sense multiple access (CSMA): CSMA is a probabilistic medium access method in which a node senses the status of the channel to see if the channel is idle or busy, prior to initiate a transmission. In the case of idle channel, the node owns it for a certain amount of time. Otherwise, if the channel is busy or collision occurs during transmission, the node waits for a random interval time and senses the channel again. Two enhanced versions of CSMA are CSMA with collision detection (CSMA/CD) and CSMA with collision avoidance (CSMA/CA). In CSMA/CD, assuming that a node is able to detect a collision, the ongoing transmission is terminated once a collision is detected. Furthermore, CSMA/CA as an intelligent scheme makes use of back off methods to extend the retransmission interval and therefore to avoid collision. 2.2 Basic MAC Protocols in WSNs Following is a review of basic MAC protocols in WSNs. • S-MAC: the first seminal work in this area is sensor-MAC (S-MAC) scheme proposed in [7]. As shown in Fig. 1, S-MAC divides the time into frames, where each frame in turn consists of equal-length active and sleep intervals with time slots denoted by ts . This scheme partitions the entire network into virtual clusters. Each cluster consists of a number of neighboring nodes so as to set up a common active-sleep scheduling within each cluster. Every node follows the scheduling of its own cluster head and uses RTS/CTS signalling for transmission in the active mode. Due to a static scheduling, S-MAC is energy-efficient in that it avoids idle listening and conversely it results in high

123


Fig. 1 S-MAC scheduling scheme

latency, especially in multihop transmissions. This is due to the fact that fixed scheduling intervals cannot be adapted with variable traffic load. In WSNs, nodes closer to the sink node are highly expected to relay high and time-varying traffic load. • T-MAC: to overcome the S-MAC drawbacks, timeout-MAC (T-MAC) proposed in [8] adaptively ends-up the listening mode within each frame and enters the sleep mode if no activation event has been occurred after a timeout interval. This property suits the T-MAC for heavy traffic loads. Despite this, early sleep problem might occur when a node backs off at the beginning of a frame due to high contention. A comparative study of the aforementioned schemes has been done in [9] and recently in [10]. As a result of distributed environment, scheduling schemes in WSNs necessitate modeling internal interactions between nodes. Due to the difficulty of such modeling and at the same time the advantage of RL that eliminates the need of environment modeling, this learning has recently been used for distributed MAC protocols in WSNs. 2.3 Learning Approaches Learning is the ability to improve a certain behaviour based on the past experiences from an environment. Machine learning in general is a learning in which an intelligent agent takes the previous observations to extend its knowledge for better decision making. Three learning approaches are as follows: • Supervised learning: in this approach, an agent utilizes a stream of educational data from a supervisor to learn a specific behavior. This type of learning generates a function from labeled training data that maps inputs to desired outputs. For example, in a classification problem, the learner approximates a function of mapping a vector into some classes by looking at input-output data of the function. In supervised learning, each data is a pair consisting of an input object and a desired output value. This method is also called learning with teacher because the training data is labeled by human experts. Artificial Neural Networks, Decision Trees, Bayesian Networks, Support Vector Machines, and Learning Automates are well know supervised learning algorithms. • Unsupervised learning: in this approach, an agent learns without any human intervention or feedback from the environment. In other words, this learning refers to the problem of exploring a hidden structure in an unlabeled data set. Since the given data to the learner is unlabeled, there is no an error evaluation and therefore a potential solution. Unsupervised learning is also called learning without teacher because the desired output is not available in the training data. In fact, the basic task of this learning is to develop classification labels automatically. Unsupervised algorithms seek out a similarity between pieces of data in order to determine whether they can be characterized as forming a group or a cluster. The goal of these algorithms is to discover the structure of data. • Reinforcement Learning (RL): this learning is inspired by principles of human and animal learning and is concerned with how an intelligent agent should take their actions in

123


an unknown environment, to maximize its cumulative reward. In this learning, an agent would become intelligent in making actions through interacting with an environment continually. The agent takes actions in order to affect the current state of the environment. The execution value of the action is transmitted to the agent via a scalar reinforcement signal, called reward. As a result of rewards from the environment, the agent asymptotically reaches the optimal policy [11]. The agent then utilizes this information towards better decision making translated in maximizing the state value function, i.e. the expected total reward an agent can accumulate in the long-term. The performance of this learning is between supervised and unsupervised learning schemes as there is only inputs with rewards, instead of educational data containing input-output pairs. There are two main approaches to find the optimal policy function in RL. The first approach is model-based requiring known reward and transition functions. The optimal policy is found using dynamic programming or value iteration techniques. The second approach is model-free with the assumption of unknown reward and transition functions. In this approach, an agent tries to find the optimal policy by interacting with an environment. In many applications dealing with complex environments, it is not possible for the system designer to provide the reward and transition functions of the environment, so the second approach is more applicable in such environments. Temporal Difference, Monte Carlo, and Q-Learning are well known model-free reinforcement learning techniques [12]. Q-Learning algorithm is the most popular technique that converges to the optimal policy with unit probability. Therefore, it has been recently used to design MAC protocols in WSNs [13]. Generally, in Q-learning, the interaction of an agent and the environment can be represented using Markov Decision Process (MDP). A finite MDP is a tuple S, A, P, R, where S is a finite set of states, A is a finite set of actions, P : S × A × S → [0 1] is state transition probability function, and R : S × A → R is a reward function. Q-Learning pseudo code based on MDP for a single agent is described in Algorithm 1. T is the number of iterations. At each decision time t, the agent observes a state st ∈ S and executes an action at ∈ A with the maximum Q-value Q(st , at ). This action results in a state transition to st+1 ∈ S at time t + 1. The agent obtains a scalar reward rt ∈ R that is a function of the current state and the action performed by the agent. The agent’s goal is to find a map from statesto called a policy, to maximize the actions, ∞ expected discounted reward over the time, E t=0 γ rt , where 0 < γ < 1 is a discount factor. This factor gives preference either to immediate reward (γ 1) or to rewards in the future (γ 0). In line 6, based on the obtained reward rt , the agent then updates the corresponding Q-value. 0 < α < 1 tunes the rate of exploration and exploitation of the learning process.

Algorithm 1 Single agent Q-learning 1: Set t = 0 and initialize Q-values Q(st , at ) for all st ∈ S and at ∈ A. 2: while t < T do 3: Observe the current state st . 4: Select next action at = arg maxa ∈A Q(st , a ). 5: Apply at , observe next state st+1 and reward rt r (st , at ). 6: Update Q-value Q(st , at ) ← (1 − α)Q(st , at ) + α[rt + γ maxa ∈A Q(st+1 , a )] 7: t = t + 1. 8: end while

123


3 Reinforcement Learning for MAC Design in WSNs Learning with low and incremental information from an environment is a desirable property of RL well suited for wireless networks [14]. RL has been applied in a variety of schemes such as routing [15–17], resource management and QoS provisioning [18,19], and dynamic channel selection [20]. The possibility of exchanging enough information in WSNs for distributed learning has been investigated in [21] to characterize the limits of distributed learning. RL approach based on Q-learning has been employed to dynamically choose a radio in multiradio sensor nodes based on the link layer statistics in [22]. A review of employing RL to provide intelligence in wireless networks has been presented in [23]. 3.1 Components of Q-Learning Due to its model-free property, Q-learning can be also employed for MAC design in WSNs. As a typical scenario, Fig. 2 illustrates the interaction between a WSN node as an agent and the rest of the network as an environment. The components of Q-learning for this scenario depends entirely on the network operator. However, following are some guidelines towards this end: • State: in a heterogenous network, the state at each node depends on its priorities. One well-known measure is the queue length of packets generated by the node or received from the other nodes. Another possible measure can be considered as the amount of available energy in the node, representing the node life time. • Event: each node is able to sense the channel for a certain period of time. One option to consider as event is the proportion of this time in which the channel is idle. The number of reserved time slots in this period is another option. • Action: the most probable action for a sensor node is to transmit, receive, or get sleep. Choosing each one depends on the Q-value of the observed (state, action) pair. • Reward: reward is the effect of action. It can be evaluated in terms of some performance measures, e.g., throughput as the number of successfully transmitted packet with positive

Fig. 2 Reinforcement learning in a WSN

123


Fig. 3 State diagram

Fig. 4 RL-MAC scheduling scheme

weight and the consumed energy for this transmission with negative weight. The amount of energy to transmit a packet is a function of packet size, channel statistics, the number of retransmission attempts, and etc. A sample state diagram of Q-learning in WSNs with three states is shown in Fig. 3. In each state, there are three possible actions: the node transmits (at ), receives (ar ), or switches of its radio for sleeping (as ). The effect of doing an action will be evaluated using some performance measure or reward function. Details are provided in Sect. 4. In the following, we first survey existing RL-based MAC protocols in WSNs and then propose a cooperative multi agent RL model for distributed MAC in subsequent sections. 3.2 Survey of RL-based MAC protocols Below is a list of contributions in RL-based MAC in WSNs. • Adaptive power and rate allocation: in [14], reinforcement learning is employed to adapt the transmission parameters with the incoming traffic, buffer and channel conditions. Each node chooses the modulation level and the transmit power as its own action to maximize the achieved throughput per consumed energy. Reported results demonstrate near optimal transmission strategy. • RL-MAC: this scheme proposed in [24] is similar to the frame-based and duty-cycle structure in S-MAC but with adaptive active duration per node, as illustrated in Fig. 4. In the beginning of every frame, each node reserves a number of time slots as its own active mode and switches to the sleep mode afterwards. RL-MAC formulates active mode period optimization as a Markov decision process and solves it using a Q-learning algorithm. In this context, the state is the number of packets queued for transmission and the action is the number of reserved time slots within each frame. Reasonably, the reward function is the ratio of the efficient transmit/receive packets to the total reserved active time. Reported results demonstrate the effectiveness of RL-MAC compared to S-MAC and T-MAC.

123

Author's personal copy M. Fathi et al. Table 1 Comparison of MAC protocols Scheme

Performance measure

Active/sleep mode

Algorithm

S-MAC

Energy efficiency

Static and periodic schedule

Predetermined

T-MAC

Latency

time-out based mechanism

RL-MAC

Energy efficiency (high throughput per total consumed energy)

SSA

Energy efficiency and latency

Scheme in [25]

Energy efficiency (maximize the network life time)

Dynamic active duration to be adapted with variable traffic load Dynamic active duration to be adapted with nodes traffic load Probabilistic transition between transmission and sleep modes Learn the probability distribution of the sleep mode

Q-learning

Q-learning

Distributed learning

• SSA: As an underlying assumption in this scheme, when a node is in idle state, it either enters the sleep mode or remains in the idle state. Each node in self-learning scheduling (SSA) scheme proposed in [11] is to learn the transmission probability, when the node is in idle listening. This learning is performed using a Q-learning algorithm with the objective of reducing energy consumption and achieving low latency. Reported results are claimed to outperform S-MAC scheme in terms of energy consumption and latency. • Probabilistic modeling: authors in [25] employ a probabilistic model to determine the sleep duration. They define a measure to evaluate the average energy efficiency of a node within each frame. The node then employs a distributed learning algorithm to update the probability distribution of the sleep duration based on the achieved energy efficiency in the previous frames. In summary, almost all the proposed schemes follow a contention-based channel access method such as CSMA and synchronous duty-cycling schedule for active/sleep modes. This approach is reasonable in WSNs as it has more flexibility with the network topology, the node density and more importantly to avoid energy wasting as a result of idle listening and overhearing. As a summary, in Table 1, we list the characteristics of the above mentioned schemes, except the scheme in [14] because of its different context; choosing optimal transmit rate and power. Characteristics are expressed in terms of the network performance measure and the way a given node determines the active and sleep durations per duty cycle. The employed algorithms toward this end have been also reported. As two prominent measures in WSNs, energy efficiency and latency have been investigated in Table 1. Reported results in SSA and [15] indicate the improvement of both energy efficiency and latency simultaneously. Based on the authors’ knowledge, this achievement might not be possible as there is a trade-off between these performance measures. Indeed, low latency is possible if the network nodes mostly be in active and listening modes. This scheduling is off course energy consuming due to high possible idle listening and overhearing. Moreover, except to S-MAC, the length of active or sleep mode within each duty cycle is dynamic to be adapted with variable traffic load that is a common case in WSNs. To model the stochastic nature of the network, two last schemes in Table 1 employ probabilistic models to be renewed over the time. Finally, as implied by the last column, distributed and dynamic

123


nature of the proposed scheduling schemes has been achieved through distributed learning algorithms such as well-known Q-learning algorithm. 4 Cooperative Multi Agent RL for MAC Design in WSN In the single agent RL model for WSNs, each agent takes their actions regardless of the other agents’ actions. In this model, if two agents wish to transmit data, both of them achieves positive reward, while in WSN collision occurs and both transmitted data are corrupted. Indeed, they should get negative rewards. Therefore, the single agent RL model for WSN possibly does not converge to optimal Q-values. To overcome this difficulty, actions of the other agents should be considered in the model of a given agent. Consequently, a cooperative multi agent RL is considered for MAC design in WSNS in the following. RL has originally been treated using MDP in which a single agent has to learn a policy that maximizes the long-term reward in a stochastic environment. Since this framework does not allow multiple agents working in the same environment, the MDP model has been extended to multi agent MDP (MAMDP). In this framework, several agents with different reward functions make decisions in order to maximize their own long-term reward. MAMDP can be categorized into cooperative and competitive schemes. In the competitive one, the reward function for competitor agents are different, while in the cooperative one the reward functions for all agents are the same. Cooperative MAMDP is considered for MAC design in WSNs with the same reward model for all nodes. This is based on the fact that all sensor nodes are supposed to cooperate with each other to send a maximum number of packets to the sink node. MAMDP extends MDP framework to a set of N agents. Each agent n has its own set of local states Sn and actions An . MAMDP is a tuple N , S , A, P , R, where N is a set of agents, S is a global finite set of states, and A is joint action space. The joint action space is a Cartesian product of actions of all N agents, i.e. A = A1 × A2 ×. . . An . P : S × A × S → [0 1] is a state transition probability function and R : S × A → R is a reward function. Single agent Q-Learning in Algorithm 1 is extended for cooperative multi agent Q-Learning in Algorithm 2. The procedure is mostly similar to that of Algorithm 1 but with some message passing among the nodes. Agent n selects it’s own action as

atn

= arg max n

a ∈An

n

Q (st , (a , a , . . . , a , . . . , a )) . n

1

2

n

N

(1)

n=1

The main difference with single agent Q-learning is that Q-values Q(s, a¯ ) should be calculated for all joint actions of all agents. Moreover, agent n updates its table of Q-values as n ¯ Q n (st , a¯ t ) = (1 − α)Q n (st , a¯ t ) + α r (st , a¯ t ) + γ max Q (s , a ) t+1 a¯ ∈A

(2)

where st is current state of agent n, a¯ t = (at1 , at2 , . . . , atN ) is the joint action of all agents at time t, 0 < α < 1 is the learning rate and 0 < γ < 1 is discount factor. Agents should provide each other with joint actions by message passing. If two agents try to transmit data concurrently (in the same frame), collision occurs and both data packets are corrupted. So in this case the negative reward will be considered for

123


Algorithm 2 Multi agent Q-learning at agent n 1: Set t = 0 and initial Q-values Q(st , a¯ t ) for all s ∈ S and a¯ t ∈ A. 2: while t < T do 3: Observe the current state st .

n n 1 2 n N 4: Select next action atn = arg maxa n ∈An n=1 Q (st , (a , a , . . . , a , . . . , a )) . 5: Apply atn , observe state st+1 and reward r (st , a¯ t ).

6: Update Q n (st , a¯ t ) = (1 − α)Q n (st , a¯ t ) + α r (st , a¯ t ) + γ maxa¯ ∈A Q n (st+1 , a¯ ) . 7: t = t + 1. 8: end while

both agents. Following is the reward value of agent n with action atn and state st in different situations: ⎧ αb − βp if atn = at and there is no coliision with node n ⎪ ⎪ ⎨−βp if atn = at and there is at least one coliision with node n r n (st , atn ) =r (st , a¯ t ) = ˆ ⎪ α b − β pˆ if atn = ar ⎪ ⎩ 0 if atn = as (3) where a¯ t is a joint action at time t, b is the number of transmitted data packets in the frame and p is the amount of consumed power of the sensor in the frame. Moreover, bˆ is the number of received data packets in the frame and pˆ is the amount of consumed power for this reception. α and β are controlling parameters which controls the effect of data packets and consumed power in calculating the reward, respectively. In summary, RL makes MAC protocol as a distributed decision making process. Even though this process is not as optimal as a centralized scheduling scheme, the required computational complexity is distributed over the whole network nodes. This help to increase the network life time as the processing load on individual nodes decreases. 5 Challenges and Open Issues Modeling interactions between nodes in a distributed environment such as WSNs is of high complexity. This difficulty generally rises as a result of time-varying communication channels, stochastic arrival traffic and heterogenous quality of service requirements. Despite some general issues such as collision avoidance, quality of service, mobility, and security, there are some open issues directly related in employing RL in WSNs as follows: • Space complexity: this complexity is the amount of memory required by sensor nodes to save their Q-values. In the single agent Q-learning, the size of Q-table is O(|S|K ), where |S| is the number states and K is the number of actions. However, in the cooperative Q-learning with N nodes, the complexity is exponential in the number of nodes as each sensor node should takes into account the performed actions of all other nodes. Indeed, the size of the Q-table of each sensor node is O(|S|K N ). This requires that each sensor node should have enough memory to save the Q-table values. In WSNs with resource-limited nodes, this requirement is of high challenge. • Time complexity: According to Algorithm 1 and Algorithm 2 pseudo codes, the Q-learning procedure stops after T time steps or iterations, i.e. the length of the time horizon. So the time complexity of the algorithm depends on the value of T , i.e. O(T ). The value of T should be large enough to reach the optimal policy. It is well-known that the optimal policy can be achieved as the number of iterations goes to infinity [26].

123


• Update Q-table: the second challenge is about updating Q-tables in each sensor node. Each sensor node to update it’s Q-table needs to aware of the actions which have been performed by the other sensor nodes. So each node should send their executed action information to the other N − 1 sensor nodes. In the other words, a message passing of length N (N −1) is done in each frame which results in energy consumption and decrease in WSN life time. • Reward function: the third challenge is to define appropriate reward function model. Suppose that if two sensor nodes in WSN transmit data concurrently, both of them get negative reward. But when two sensor nodes are fairly far from each other, the collision will not be occurred so both of them should take positive reward. So there is needed to define a collision area in the reward model that takes into account the position of the sensor nodes. Achieved throughput with positive weight and consumed energy and delay with negative weights are known options for reward function. • Step size: another challenge is on step size selection. Step size as the learning rate in Q-learning algorithm is used to tune the speed of learning and accordingly to achieve better convergence and stability. If that small, learning is too slow to converge, and if that big, learning is not appropriate. In case of WSNs, step size selection depends entirely on the network dynamics such as the channel variation, traffic conditions, node mobility, and network topology [27].

References 1. Boushaba, M., Hafid, A., & Benslimane, A. (2009). High accuracy localization method using AoA in sensor networks. Computer Networks, 53(18), 3076–3088. 2. Akyildiz, I. F., Su, W., Sankarasubramaniam, Y., & Cayirci, E. (2002). Wireless sensor networks: A survey. Computer Networks, 38(4), 393–422. 3. Akyildiz, I. F., & Vuran, M. (2010). Wireless sensor networks. London: Wiley. 4. Predd, J. B., Kulkarni, S. R., & Poor, H. V. (2006). Distributed learning in wireless sensor networks. IEEE Signal Processing Magazine, 23(4), 56–69. 5. Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2005). Nonparametric decentralized detection using kernel methods. IEEE Transactions on Signal Processing, 53(11), 4053–4066. 6. Akkarajitsakul, K., Hossain, E., Niyato, D., & Kim, D. (2011). Game theoretic approaches for multiple access in wireless networks: A survey. IEEE Communications Surveys and Tutorials, 13(3), 372–395. 7. Ye, W., Heidemann, J., & Estrin, D. (2004). Medium access control with coordinated adaptive sleeping for wireless sensor networks. IEEE/ACM Transactions on Networking, 12(3), 493–506. 8. Dam, T. V., & Langendoen, K. (2003, November). An adaptive energy-efficient MAC protocol for wireless sensor networks. 1st ACM Conference on Embedded Networked Sensor Systems. CA: LosAngeles. 9. Demirkol, I., Ersoy, C., & Alagöz, F. (2006). MAC protocols for wireless sensor networks: A survey. IEEE Communications Magazine, 44(4), 115–121. 10. Huang, P., Xiao, L., Soltani, S., Mutka, M., & Xi, N. (2012). The evolution of MAC protocols in wireless sensor networks: A survey. IEEE Communications Surveys Tutorials, PP(99), 1–20. 11. Niu, J., & Deng, Z. (2011, in press). Distributed self-learning scheduling approach for wireless sensor network. Ad Hoc Networks. 12. Kaelbling, L. P., & Littman, M. L. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285. 13. Kulkarni, R. V., Forster, A., & Venayagamoorthy, G. K. (2011). Computational intelligence in wireless sensor networks: a survey. IEEE Communications Surveys and Tutorials, 13(1), 68–96. 14. Pandana, C., & Liu, K. J. R. (2005). Near-optimal reinforcement learning framework for energy-aware sensor communications. IEEE Journal on Selected Areas in Communications, 23(4), 788–797. 15. Chang, Y. H., Ho, T., & Kaelbling, L. P. (2004). Mobilized ad-hoc networks: A reinforcement learning approach. In Proceedings of International Conference on Autonomic Computing. 16. Dong, S., Agrawal, P., & Sivalingam, K. (2007). Reinforcement learning based geographic routing protocol for UWB wireless sensor network. In Proceedings of global telecommunications conference.

123

Author's personal copy M. Fathi et al. 17. Förster, A., & Murphy, A. L. (2007). FROMS: Feedback routing for optimizing multiple sinks in WSN with reinforcement learning. In Proceedings of international conference intelligent sensors, sensor Network Information Processing (ISSNIP). 18. Yu, F. R., Wong, V., & Leong, V. (2008). A new QoS provisioning method for adaptive multimedia in wireless networks. IEEE Transactions on Vehicular Technology, 57(3), 1899–1909. 19. Alexandri, E., Martinez, G., & Zeghlache, D. (2002). A distributed reinforcement learning approach to maximize resource utilization and control hand over dropping in multimedia wireless networks. In Proceedings of indoor and mobile radio communication: International symposium on personal. 20. Yau, K., Komisarczuk, P., & Teal, P. (2010). Enhancing network performance in distributed cognitive radio networks using single-agent and multi-agent reinforcement learning. In Proceedings of conference on local computer networks. 21. Predd, J. B., Kulkarni, S. R., & Poor, H. V. (2006). Consistency in models for distributed learning under communication constraints. IEEE Transactions on Information Theory, 52(1), 52–63. 22. Gummeson, J., Ganesan, D., Corner, M. D., & Shenoy, P. (2010). An adaptive link layer for hetrogeneous multi-radio mobile sensor networks. IEEE Journal on Selected Areas in Communications, 28(7), 1094–1104. 23. Yau, K. A., Komisarczuk, P., & Teal, P. D. (2012). Reinforcement learning for context awareness and intelligence in wireless networks: Review, new features and open issues. Journal of Network and Computer Applications, 35(1), 253–267. 24. Liu, Z., & Elahanany, I. (2006). RL-MAC: A reinforcement learning based MAC protocol for wireless sensor networks. Internationa Journal on Sensor Networks, 1(3/4), 117–124. 25. Mihaylov, M., Tuyls, K., & Nowé, A. (2009). Decentralized learning in wireless sensor networks. In Proceedings of adaptive and learning agents workshop Budapest, Hungary. 26. Sutton, & Barto, (1998). Reinforcement learning: An introduction. Cambridge: MIT Press. 27. Ye, W., Heidemann, J., & Estrin, D. (2002). Mobility increases the capacity of ad hoc wireless networks. IEEE/ACM Transactions on Networking, 10(4), 477–486.

Author Biographies Mohammad Fathi received the M.Sc. and the Ph.D. degrees in electrical engineering from Amirkabir University of Technology, Tehran, Iran in 2003 and 2010, respectively. From 2003 to 2006, he was a Lecturer with the Department of Electrical Engineering, University of Kurdistan, Sanandaj, Iran, where he is currently working as an Assistant Professor. He conducted a part of his Ph.D. research work in the Communications and Networking Theory Laboratory, Royal Institute of Technology, Stockholm, Sweden form February 2010 to November 2010. His current research interests include network resource allocation, power scheduling, and smart grid control.

Vafa Maihami was born in 1987 and received the M.S. degree in computer engineering from the University of Kurdistan, Sanandaj, Iran, in September 2012. Currently, he is a lecturer in Maad Institute and other institutes in Sanandaj. His research interests include Wireless sensor networks, machine learning, machine vision and image processing.

123

Author's personal copy Review, Model, and Open Issues Parham Moradi received his Ph.D. degree in Computer Science from Amirkabir University of Technology in March 2011. Moreover, he has been received his M.Sc. and B.S. degrees in Software engineering and Computer Science from Amirkabir University of Technology, Tehran, Iran, in 1998 and 2005 respectively. He conducted a part of his Ph.D. research work in the Laboratory of Nonlinear Systems, Ecole Polytechnique Federal de Lausanne (EPFL), Lausanne, Switzerland, form September 2009 to March 2010. Currently he is working as an Assistant Professor in the Department of Computer Engineering and Information Technology, University of Kurdistan, Sanandaj, Iran. His current research areas include Reinforcement Learning, Graph Clustering, Social Network Analysis and Recommender Systems.

123

Reinforcement Learning for Multiple Access Control in ...

Reinforcement Learning for Multiple Access Control in ...

Suggest Documents

MPRL: Multiple-Periodic Reinforcement Learning for ...

Distributed Reinforcement Learning for Multiple ... - Semantic Scholar

Deep Reinforcement Learning for Dynamic Multichannel Access

Reinforcement Learning for Active Length Control

Reinforcement Learning for UAV Attitude Control - arXiv

Reinforcement Learning-Based Predictive Control for

Efficient Reinforcement Learning for Motor Control

Evolutionary Reinforcement Learning for Neurofuzzy Control - CiteSeerX

Reinforcement Learning for Building Environmental Control - CiteSeerX

Neuroevolutionary reinforcement learning for generalized control of

Reinforcement Learning for Structural Control - CiteSeerX

Reinforcement Learning for Building Environmental Control - CiteSeerX

Improving Reinforcement Learning Speed for Robot Control

A DISTRIBUTED REINFORCEMENT LEARNING CONTROL ...

A DISTRIBUTED REINFORCEMENT LEARNING CONTROL ...

REINFORCEMENT LEARNING IN THE CONTROL OF ...

Control Delay in Reinforcement Learning for Real ... - Lucian Busoniu

Deep Reinforcement Learning for Coordination in Traffic Light Control

Continuous control with deep reinforcement learning

Variable Impedance Control A Reinforcement Learning ... - CiteSeerX

Reinforcement learning based dual-control ... - Semantic Scholar

Convergent Reinforcement Learning Control with Neural Networks

Power Systems Stability Control : Reinforcement Learning ... - CiteSeerX

Human-level control through deep reinforcement learning