Globecom 2013 Workshop - Broadband Wireless Access
Reinforcement Learning based Secondary User Transmissions in Cognitive Radio Networks † ♭
Senthuran Arunthavanathan, † ♭ Sithamparanathan Kandeepan and ♭ Robin J. Evans † School of Electrical Engineering, RMIT University, Melbourne, Australia, ♭ National ICT Australia, Victoria Research Laboratory, Melbourne, Australia, Email:
[email protected],
[email protected],
[email protected]
Abstract— In this paper, we address the decision making criteria of a secondary user (SU) for deciding whether to transmit or not upon performing spectrum sensing and detecting the presence of any primary user (PU) in the environment in a cognitive radio network (CRN). We propose a reinforcement learning (RL) based approach by a Markov process at the SU node and present novel analytical methods to analyze the performance of such approaches. In particular, we define the probability of interference Pi and the probability of wastage Pw , and compare these metrics with a RL based and a non-RL based approach for SU transmission. The simulations show the presence of a tradeoff in the two probability metrics Pw and Pi , based on the Markov process. The simulation results are compared in the form of the transmitter operating characteristics (ToC) curves. Using our approach, one could control the interference to the PU by trading off with the spectral wastage.
I. I NTRODUCTION Cognitive radio (CR) is defined as an intelligent wireless communication system that is able to sense the operational electromagnetic spectrum, be able to dynamically adjust its radio operational parameters and thus result in an improvement in system operations such as maximize throughput, interference reduction, inter-operability facilitation and access unused spectral portions called spectral holes in an opportunistic manner [1]-[4]. Focusing on the current problem, the electromagnetic spectrum being of an infinite amount is an illusion. The electromagnetic spectrum is a natural finite resource in terms of communication technology, where the number of transmitters and receivers are limited and are licensed by major authorities, thus leading to the physical shortage of spectrum access [4][5]. The Federal Communications Commission (FCC) states that fixed spectrum allocation is not always efficient and the licensed spectrum remains unoccupied for long durations of time. Therefore, the concept of CR based on Dynamic Spectrum Access (DSA), was considered. This allows the secondary user (SU) in the environment to dynamically access the licensed spectrum that is allocated to the primary user (PU) for temporary periods when the licensed spectrum is not being utilized. According to Haykin[3], CRs are projected to be brain-empowered wireless devices that are aimed at improving the utilization of the electromagnetic spectrum. A CR should be aware of its surrounding environment and identify the various devices and activities. This is done by the methodology of spectrum sensing. There are various spectrum sensing methods that have been pro-
978-1-4799-2851-4/13/$31.00 ©2013IEEE
374
posed in literature such as energy detection, cyclostationary detection, waveform detection, covariance detection, and cooperative distributed sensing, moreover the performance of these techniques are well presented under various assumptions, models and scenarios [6]-[24]. However, current literature does not provide much knowledge on the use of intelligent learning by the application of information obtained through spectral sensing. Reinforcement learning (RL) is a method of programming the agents by reward and punishment without the need to specify how the task should be accomplished in a specific scenario[25]. The agent which is the SU must learn the behaviour of the system through trial and error interactions within a dynamic environment, which are the PU transmissions in the wireless environment. The scenario is approached with the Markov process that leads to an optimal solution by applying RL. In this paper, we propose that within a wireless system consisting of PUs deployed in a given frequency band, an independent SU containing CR capabilities, dynamically fills the spectrum holes. II. LITERATURE REVIEW Past research in the RL-cognitive area is not heavily focused. However, there are few papers that consider the concept of RL in cognitive networks. One paper that strongly focuses on this area is the paper by U. Berthold[1]. This paper focuses on the detection of the PU system’s allocation and detection of spectral resources. The detection is performed frequently on all sub channels of the PU transmission spectrum by using Fast Fourier Transform (FFT). This paper mainly focuses on the allocation and management of the frequency bands. One paper produced by B. Lo is based on the idea of RL in co-operative cognitive Ad Hoc networks[27] for solving the overhead problem by introducing RL to the co-operative cognitive network. The major focus is on the abilities of the SU to minimize sensing delay and find an optimal set of cooperating neighbours. There are few other papers considering RL on wireless networks such as the ones mentioned [28][29]. K. Yau presented a paper on the application of context awareness in wireless networks. J. Okansen also, has presented a paper on the concept of RL in energy efficient networks using a sensing policy optimization methodology. The key difference of our paper is that it proposes the application of RL in the periodic sensing of the cognitive radio to detect PU in a periodic manner and determine the presence of their
Globecom 2013 Workshop - Broadband Wireless Access
transmission, before allowing SU transmission to begin. This allows one to consider the wastage of available time slots and the behavior of interference between the PU and SU in the model. III. S YSTEM AND N ETWORK M ODEL This section describes the system and network model considered in this paper. The scenario considered here, is for a single PU and a single SU operating in the same frequency band. In order to include the RL strategy within the SU transmission, we consider periodic sensing [10] with energy detection [6]. The SU utilizes the spectrum sensing decisions to feed the RL strategy in order to decide whether to perform its transmissions or not. We provide the PU temporal model together with the SU transmission model using a Markov process shown below. A. Temporal Behavior of the PU
PU transmission only during δt over Tw . The SU continuously senses the spectrum periodically in order to detect the presence of PU in the environment. The energy detector used for sensing here is not presented and the readers are referred to [6], [8] for further details. We are only interested in the detection and the false alarm probabilities of the energy detector in an AWGN channel with periodic sensing for which cases are theoretically quantified in literature [6], [8]. The definitions for the detection (Pd ) and the false alarm (Pf ) probabilities are given by; Pd = P r[deciding that PU is present|H1 ] Pf = P r[deciding that PU is present|H0 ]
(2)
Note that the decisions made by the SU whether a PU is present or not only happens during the sensing duration δt only. The decision is maintained throughout the Tw . Therefore, the SU is unable to learn the PU transmission states during the non-sensing durations. C. Secondary User model without Reinforcement Learning
Fig. 1.
PU System Markov Model
The PU temporal behavior is modelled as a two-state Markov process as illustrated in Figure-1, representing the binary hypotheses H0 and H1 as defined below. H0 : PU does not transmit H1 : PU transmits
(1)
The state-transitions are modelled as exponential random processes with a mean λ for the transition from H0 to H1 and with a mean of µ for the transition from H1 to H0 . In other words, the PU transmission arrival rate is given by λ and the death rate is given by µ assuming a simple spectrum occupancy model [26]. B. Periodic Detection of the Cognitive Radio The SU performs periodic spectrum sensing as illustrated in Figure-2. The figure shows the sensing period and the sensing duration of the SU to detect PU transmission.
Fig. 3.
SU system Markov Model for H1 & H0 - Absence of RL
Based on the PU model described in Figure-1, we consider a Markov process to describe the behaviour of SU transmission. The action set A for the SU comprises of two elements only(i.e. two actions) defined by: A0 :SU does not transmits
(3)
A1 :SU transmits
(4)
The set of states S is defined by the four states in the SU transmission model under the different states of the PU H0 and H1 , as given below. Fig. 2. Different scenarios of PU detection by CR, where the CR is represented as blue boxes and PU transmission as red
The SU node has a sensing period of Tw seconds and a fixed sensing duration δt seconds. The SU detects the presence of
375
S00 : SU does not transmit given H0 S01 : SU transmits given H0 S10 : SU does not transmit given H1 S11 : SU transmits given H1
(5)
Globecom 2013 Workshop - Broadband Wireless Access
IV. REINFORCEMENT LEARNING ALGORITHM In this section, we describe the proposed RL strategy to assist the SU to make decisions on whether to transmit or not upon detecting the presence of the PU. Since, we adopt a single PU single SU environment, the proposed RL model is a single agent model with centralized decision making (i.e. the decision making is also locally done at the SU node). In describing the RL model, let us initially explain how the SU would perform its transmissions in the absence of the RL strategy. In the absence of RL strategy, the SU simply transmits whenever a PU is not detected and stops or does not transmit upon the detection of a PU, as described in the Markov model depicted in Figure-3. The RL strategy basically helps the SU to make decisions to perform transmissions or not by learning the PU’s temporal behavior. We define a threshold Γ and a cost function C(τ ), as described in detail later, where the cost function is derived based on the detection of the PU. The SU decides to transmit only, if the cost function C(τ ) is less than the threshold Γ. In other words, the SU transmission decision criteria is given as follows;
when β is large the cost function will show a faster reduction with τ . The variation of C(τ ) with respect to β is shown in Figure-4.
1.2 1
Decreasing values of β = [10, 8, 5, 2]
0.8 C(τ)
As shown in Figure-3, under H1 , SU transits from state S11 to S10 or remains in the state S10 with a transition probability of Pd . In other words, the SU stops transmitting when it detects the PU with a detection probability of Pd . But on the other hand, the SU will transmit if it miss detects the presence of the PU which are described by the transition probabilities from states S10 to S11 and S11 to S11 by the miss detection probability Pm = 1−Pd . Under the H0 state, the SU transmission depends entirely on the false alarm probability Pf as further shown in Figure-3. Under this condition, the SU transmits when it decides a PU is not present in the environment and does not transmit when it falsely decides that a PU is present in the environment. These state transition conditions are clearly represented in Figure-3. At this point, we note the Markov model for the SU transmission described in Figure-3, changes when we include the RL strategy as further described in Section IV.
0.6 0.4 0.2 0 0
0.1
0.2
0.3
0.4
0.5
0.6
τ
Fig. 4. The cost function C(τ ) for various values of the shape parameter β.
We now describe the way the cost function is updated by the RL method. Every time the SU detects the presence of a PU, the variable τ is initialized to 0 which gives a cost of C(τ ) = 1. Whenever the SU does not detect a PU, the cost function is updated by incrementing τ . By this means, when the condition C(τ ) ≤ Γ is satisfied, the SU performs its transmission. The cost function is updated every sensing period based on the above mentioned criteria depending on whether the PU was detected or not. Figure 5 depicts the algorithm to perform RL based SU transmissions. It should be noted here that the
If Γ ≥ C(τ ) then SU does not transmits If Γ < C(τ ) then SU transmits
(6)
Therefore, based on the proposed approach as described above, the SU does not transmit as soon as it sees a vacant spot in the spectrum, instead, it waits for some time as defined by the RL algorithm and then transmits (when Γ ≤ C(τ )). We propose the following cost function in this paper. ( cos (πτ /β) for 0 ≤ τ ≤ β/2 C(τ ) = (7) 0 for τ > β/2 where β > 0 is termed as the shape parameter and τ ∈ N0 (i.e. τ belongs to the natural number set including zero). We identify here that the cost function takes values between [0, 1] which extrinsically means that the cost of action is high when C(τ ) = 1 and is low when C(τ ) = 0. In general when β is small the cost function will show a slow reduction with τ and
376
Fig. 5.
RL algorithm for performing SU transmissions
cost function is a metric for predicting the likelihood of the presence of the PU in the successive sensing periods based on the prior detection knowledge using RL. In other words, the cost function can also be described as an indicator to the two consequences (due to SU actions), namely (i) interference to the PU and (ii) the spectral wastage. A particular threshold
Globecom 2013 Workshop - Broadband Wireless Access
value Γ will imply a trade-off between the two cost indicators and we further describe this in the subsequent section. Figures 6 and 7 show typical scenarios for the interference and the spectral wastage respectively. We iterate here again that the SU maintains the decision performed in the sensing duration δt for the entire period Tw .
Fig. 6. Scenario for the SU interfering with the PU whilst performing periodic sensing
V. P ERFORMANCE I NDICATORS FOR RL BASED SU T RANSMISSIONS The performance indicators for the RL based SU transmission strategy considers the two cost indicators interference to the PU and spectral wastage as mentioned above. We can further define these two cost indicators with the corresponding probabilities, interference probability Pi and spectral wastage probability Pw , defined as, Pi = P r[SUtransmitting in a period Tw
(8)
|PU transmits at least once in Tw ] and Pw = P r[SUnot transmitting in a period Tw |No PU transmissions in Tw ]
(9)
Based on the definitions for Pi and Pw , we can then refine the SU Markov decision process model accordingly. The modified SU transmission model is represented for the two PU scenarios H1 and H0 respectively in Figure 8. The inclusion of RL strategy results in the use of prior information and the current status of both PU transmissions. The modified Markov decision model follows the same transitions as that of the traditional SU model, but with different transition probabilities. In order to comprehend, we define slots as the time between successive CR scans or represented as Tw which is the same as the sensing periods. The interference Pi between the PU and SU is the measure of number of slots when there is PU and SU transmission taking place simultaneously at least once within a sensing period of Tw with respect to the overall number of slots considered. On the other hand the wastage of slots Pw is the measure of the number of slots available for the SU transmissions to occur but SU does not transmit. This is due to the cost C(τ ) being higher than the decision threshold Γ which described in the previous section.
377
VI. REINFORCEMENT LEARNING SIMULATION Simulations were conducted to emulate the network model for PU and SU transmissions with periodic spectrum sensing in a noisy sensing environment. The energy detector for all of our simulations were maintained at the operating point where Pd = 0.95 and Pf = 0.37. Figure-9 shows the cost function C(τ ) together with the PU transmissions. An example value of the decision threshold level Γ is also indicated in the figure for reference. As expected, we observe that the cost C(τ ) = 1 decrements rapidly after the absence of the PU transmissions for successive SU sensing periods. We use the simulations to study the dependency of the cost indicators Pi and Pw for factors such as δt, Tw , λ and µ, as we present subsequently.
1
0.9
0.8
C(τ)
Fig. 7. Scenario for the SU wasting transmission slots whilst performing periodic sensing with RL
Fig. 8. Modified SU transmission model based on Markov decision process under H1 and H0 respectively
0.7
0.6
0.5
Γ
C(τ) Noisy Case: P = 0.95 , d P = 0.37 f
0.4
PU Transmission 140
145
150
155
160
165
170
175
Time (s)
Fig. 9. Simulation results showing the variation in the cost function C(τ )) based on PU transmissions.
Figure-10 illustrates the relationship between Pi and Γ for different values δt. From the figure we observe that the
Globecom 2013 Workshop - Broadband Wireless Access
interference is drastically reduced by changing the transmission decision threshold Γ and increasing the sensing duration δt further reduces Pi since an increase in δt would reduce the miss detection of the PU. Figure-11 demonstrates the relationship between Pw and Γ for varying δt. From the figure we observe that the spectral wastage can be varied/controlled by changing the transmission threshold Γ, and in this instance we observe that the change in δt does not affect Pw since it does not impact the false alarm probability. Note that even though the sensing duration δt was changed in the simulations we still maintain the same detection and false probabilities for the energy detector 1 . Figures-10 and 11 show that Pi increases with increasing Γ and Pw decreases with increasing Γ due to the SU being able to transmit after a fewer number of successive sensing periods based on the RL learning strategy.
the parameters Tw , λ and µ respectively. The TOC comprises of Pi and Pw and one should note the curves show the trend where when Pw increases, Pi decreases displaying the tradeoff between the two. From teh results we observe that when λ or µ increases the interference probability increases due to the increased occupancy of the PU. Changing the threshold Γ therefore can control the required interference for a given wastage parobability.
0.2
Sensing Period TW − 0.4
0.18
Sensing Period TW − 0.6 0.16
Sensing Period TW − 0.8
0.14
Noisy Case : Pd = 0.95, Pf=0.37 Other Factors : λ = 1, µ = 0.6, δt = 0.2
P
i
0.12
0.1
0.08 0.16
Sensing Duration δt − 0.2 Sensing Duration δt − 0.3 Sensing Duration δt − 0.4
0.14
0.06
0.04
0.12
Noisy Case : Pd=0.95, Pf= 0.37
0.02
Other Factors : µ = 0.6, λ = 1, TW = 0.5
0.4
0.5
0.6
i
0.1
0.7
0.8
0.9
1
P
Pw
0.08
Fig. 12.
0.06
TOC for various sensing periods
0.04
0.18
0.02
Arrival Rate λ − 0.5 Arrival Rate λ − 0.8 Arrival Rate λ − 1.0
0.16
0 0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Γ 0.14
Fig. 10.
Noisy Case : P =0.95, P =0.37
Plot of Pi vs. Γ for various Sensing durations
d
f
Other Factors : µ = 0.6, δt = 0.2, T = 0.5
0.12
W
P
i
0.1
1
0.08
0.9
0.8
0.06
0.7
0.04
0.6
Pw
0.02
0.5
0.4
Noisy Case : Pd= 0.95, Pf = 0.37
0.4
0.6
Fig. 13.
Sensing Duration δt − 0.2 Sensing Duration δt − 0.3
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Γ
Fig. 11.
0.8
0.9
1
TOC for various arrival rates
VII. CONCLUSION
Sensing Duration δt − 0.4 0.1
0.2
0.7
Pw
0.3
0 0.1
0.5
Other Factors − λ = 1, µ = 0.5, TW = 0.5
Plot of Pw vs. Γ for various Sensing durations
At this point of time we also define the SU Transmitter Operating Characteristics (TOC) curve as the plot of Pi with respect to Pw for various values of Γ. Figure-12, 13 and 14 provide the SU transmission operating characteristic curves for 1 The detection and false alarm probabilities can be maintained for different sensing durations for an energy detector by changing the received signal top noise ratio or the detection threshold [6].
378
In this paper, we present the performance and methodology of SU transmissions in the absence of PU, in the spectrum by use of CRs. Continuous periodic scanning is used to detect the temporal behavior of the PU. RL has been successfully applied to the SU model in this paper for the SU transmissions when PU is not transmitting. The results show that RL model has significantly improved the performance of SU transmissions by considering the possibility of interference and wastage. The probability of interference and wastage depend on the scanning periods and sensing durations of SU, arrival and death rates of the PU transmissions.This paper provides a method to control
Globecom 2013 Workshop - Broadband Wireless Access
0.16
Death Rate µ − 0.2
0.14
Death Rate µ − 0.6 Death Rate µ − 1.0 0.12
Noisy Case : Pd = 0.95, Pf=0.37 Other Factors : λ = 1, δt = 0.2, T = 0.5
0.1
P
i
W
0.08
0.06
0.04
0.02
0 0.4
0.5
0.6
0.7
0.8
0.9
1
Pw
Fig. 14.
TOC for various death rates
the interference to the PU by controlling ehe transmission threshold. Our furze work in this space includes theoretical modeling of the interference and the wastage probabilities and developing similar solutions for networks with multiple SU and multiple PU. ACKNOWLEDGMENTS The research was partially funded by the National ICT Australia (NICTA) and the ABSOLUTE project from the European Commission’s Seventh Framework Programme (FP72011-8) under the Grant Agreement FP7-ICT-318632. NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence Program. R EFERENCES [1] U.Berthold, F. Fu, M. Schaar, F. Jondral, ”Detection of Spectral Resources in Cognitive Radios using Reinforcement Learning,” IEEE DySPAN, pp. 1-5, Oct 2008. [2] M.Bkassiny, Y. Li and S. Jayaweera, A Survey on Machine-Learning Techniques in Cognitive Radios, IEEE Commun. Surveys and Tutorials, vol. , 99, pp. 1-24, Oct. 2012. [3] S. Haykin, Cognitive radio: brain-empowered wireless communications, IEEE Journal on Selected Areas of Communications Vol. 23(2), pp.201220, 2005. [4] J.Mitola and G. Maguire Jr., Cognitive Radio: Making Software Radios more Personal, IEEE Personal Comms, vol. 6, no.4, pp. 13-18, Aug. 1999. [5] Federal Communications Commission, Facilitating Opportunities for Flexible, Efficient, and Reliable Spectrum Use Employing Cognitive Radio Technologies, NPRM and Order, ET Docket No. 03-322, Dec. 2003. [6] H. Urkowitz, Energy Based Detection of Unknown Deterministic Signals, IEEE Proceedings, Vol. 55, No. 4, pp. 523-531, April 1967. [7] T. Yucek and H. Arslan, ”A Survey of Spectrum Sensing Algorithms for Cognitive Radio Applications,” IEEE Comm on Surveys & Tut., Vol.11, pp.116-130, March 21, 2009. [8] S. Kandeepan and A. Giorgetti, Cognitive Radio Techniques: Spectrum Sensing, Interference Mitigation and Localization, Artech House, London, 2013. [9] S. Kandeepan, A. Giorgetti and M. Chiani, Periodic Spectrum Sensing Performance and Requirements for Legacy Users with Temporal and noise Statistics in Cognitive Radios, GLOBECOM Workshops IEEE, pp. 1-4, Dec. 2009.
379
[10] S. Kandeepan, R. Piesiewicz, T. Aysal, A. Biswas and I. Chlamtac, Spectrum Sensing for Cognitive Radios with Transmission Statistics: Considering Linear Frequency Sweeping, EURASIP Journal or Wireless and Communication Networks, Vol. 2010, January 2010 Article No. 6. [11] S. Kandeepan, L. Nardis, M. G. Benedetto, G. Corazza, Alessandro, Cognitive Satellite Terrestrial Radios, IEEE Globecom, Dec 2010, Florida. [12] M. Mueck et. al, ETSI Reconfigurable Radio Systems Status and Future Directions on Software Defined Radio and Cognitive Radio Standards IEEE Communications Magazine, Sep 2010. [13] S. Kandeepan et. al, Experimentally Detecting IEEE 802.11n WiFiBased on Cyclostationarity Features for Ultra-Wide Band Cognitive Radios, IEEE PIMRC 2009, Tokyo. [14] S. Kandeepan, et. al, Time-Divisional and Time-Frequency Divisional Cooperative Spectrum Sensing, IEEE Crowncom, Hannover, June 2009. [15] T. Aysal, Kandeepan. S, Radoslow. P, Cooperative Spectrum Sensing over Imperfect Channels, IEEE BWA-WS, Globecom, 2008 [16] T. Aysal, Kandeepan. S, Radoslaw. P, Cooperative Spectrum Sensing with Noisy Hard Decision Transmissions, International Conference on Communications (ICC), 14-18 June , Dresden 2009 [17] A. Rahim, T. Aysal, Kandeepan. S, Dzmitry. K, Radoslaw. P, Cooperative Shared Spectrum Sensing for Dynamic Cognitive Radio Network, International Conference on Communications (ICC), 14-18 June , Dresden 2009 [18] S. Kandeepan et. al, Periodic Sensing in Cognitive Radios for Detecting UMTS/HSDPA Based on Experimental Spectral Occupancy Statistics, IEEE WCNC, April 2010 [19] S. Kandeepan et. al, Distributed ring-around Spectrum Sensing for Cognitive Radio Networks, IEEE ICC 2011, Kyoto [20] S. Kandeepan, et. al, Preliminary Experimental Results on the Spectrum Sensing Performances for UWB-Cognitive Radios for Detecting IEEE 802.11n Wi-Fi Systems, IEEE ISWCS, Sienna, Sep 2009 [21] S. Kandeepan et. al, Spectrum Sensing for Cognitive Radios with Transmission Statistics: Considering Linear Frequency Sweeping, JWCN EURASIP, April 2010. [22] Kandeepan. S, et al, Bayesian Tracking in Cooperative Localization for Cognitive Radio Networks, IEEE VTC, 26-29 April, Barcelona 2009. [23] A. Mariani, S. Kandeepan, A. Giorgetti and M. Chiani, Cooperative Weighted Centroid Localization for Cognitive Radio Networks, IEEE ISCIT, 2-5 October, 2012, Gold Coast. [24] S.Arunthavanathan, S. Kandeepan, R.J. Evans, ’Spectrum Sensing and Detection of Incumbent-UEs in Secondary-LTE based Aerial-Terrestrial Networks for Disaster Recovery’,IEEE CAMAD, Sept 2013, Berlin. [25] L. Kaelbling, M. Littman and A. Moore, Reinforcement Learning: A Survey, Journal of Artificial Intelligence Research, vol. 4, pp.237-285, May 1996. [26] S. Kay, Intuitive Probability and Random Processes using MATLAB, Springer, New York, 2006. [27] B. Lo and I. Akyildiz, Reinforcement Learning-based Cooperative Sensing in Cognitive Ad Hoc Networks, IEEE PIMRC, Sept. 2010. [28] J. Okansen, J. Lunden, and V.Koivunen, Reinforcement Learning based sensing policy for energy efficient cognitive radio networks, NeuroComputing, ELSEVIER, Mar. 2012. [29] K. Yau, P. Komisarczuk, and P. Teal, Reinforcement Learning for context awareness and intelligence in wireless networks: Review, new features and open issues, Journal of Network and Computer Applications, ELSEVIER, Jan 2012.