990
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 4, AUGUST 2011
Dynamic Robust Games in MIMO Systems Hamidou Tembine
Abstract—In this paper, we study dynamic robust powerallocation games in multiple-input–multiple-output systems under the imperfectness of the channel-state information at the transmitters. Using a robust pseudopotential-game approach, we show the existence of robust solutions in both discrete and continuous action spaces under suitable conditions. Considering the imperfectness in terms of the payoff measurement at the transmitters, we propose a COmbined fully DIstributed Payoff and Strategy Reinforcement Learning (CODIPAS-RL) in which each transmitter learns its payoff function, as well as the associated optimal covariance matrix strategies. Under the heterogeneous CODIPAS-RL, the transmitters can use different learning patterns (heterogeneous learning) and different learning rates. We provide sufficient conditions for the almost-sure convergence of the heterogeneous learning to ordinary differential equations. Extensions of the CODIPAS-RL to Itô’s stochastic differential equations are discussed. Index Terms—Dynamic games, learning, multiple-input– multiple-output (MIMO), robust games.
I. I NTRODUCTION
M
ULTIPLE-input–multiple-output (MIMO) links use antenna arrays at both ends of a link to transmit multiple streams in the same time and frequency channel. Signals transmitted and received by array elements at different physical locations are spatially separated by array-processing algorithms. It is well known that, depending on the channel conditions and under specific assumptions, MIMO links can yield large gains in the capacity of wireless systems [1], [2]. Multiple links, each of which with different transmitter–receiver pairs, are allowed to transmit in a given range possibly through multiple streams per link. Such a multiaccess network with MIMO links is referred to as a MIMO interference system. The Shannon rate maximization is an important signalprocessing problem for power-constrained multiuser systems. It involves solving the power-allocation problem for mutually interfering transmitters operating across multiple frequencies. The classical approach to the Shannon rate maximization has been finding globally optimal solutions based on waterfilling [1], [3]. However, the major drawback of this approach is that these solutions require centralized control and knowledge of full information. These solutions are inherently unstable in a competitive multiuser scenario since a gain in the performance
Manuscript received May 9, 2010; revised October 21, 2010; accepted December 15, 2010. Date of publication January 24, 2011; date of current version July 20, 2011. This paper was recommended by Associate Editor T. Vasilakos. The author is with the Telecommunications Department, Ecole Supérieure d’Electricité (Supélec), 91192 Gif-sur-Yvette Cedex, France (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCB.2010.2102751
for one transmitter may result in a loss of the performance for others. Instead, a distributed game-theoretic approach is desirable and is being increasingly considered only over the past decade. The seminal works on the competitive Shannon rate maximization use a game-theoretic approach to design a decentralized algorithm for dynamic power allocation. These works proposed a sequential iterative waterfilling algorithm for reaching the Nash equilibrium in a distributed manner. A Nash equilibrium of the rate-maximization game is a powerallocation configuration such that, given the power allocations of other transmitters, no transmitter can further increase the achieved information rate unilaterally. However, most existing works on power allocation assume perfect channel state information (CSI). This is a very strong requirement and cannot be generally met by practical wireless systems. The traditional game-theoretic solution for systems with imperfect information is the Bayesian game model, which uses a probabilistic approach to model the uncertainty in the system. However, a Bayesian approach is often intractable, and the results strongly depend on the nature of the probability distribution functions. Thus, a relaxation of the use of the initial probability distribution is needed. There are two classes of models frequently used to characterize the imperfect CSI: the stochastic and deterministic models. One of the deterministic approaches is the pessimistic or maximin robust approach modeled by an extended game in which the nature chooses the channel states. The pessimistic model consists to see the nature as a player who minimizes over all the possible states (worst state case for the transmitters). A similar approach of the incompleteinformation finite games has been modeled as a distributionfree robust game where the transmitters use a robust approach to the bounded payoff uncertainty [4]. This robust game model also introduced a distribution-free equilibrium concept called the robust equilibrium. However, the results in [4] for the robust game model are limited to finite games, which need to be adapted to the continuous power allocation (when the action space is a continuous set). The authors in [5] proposed a robust equilibrium for the Shannon rate maximization under the bounded channel uncertainty. However, the discrete-powerallocation problem is not addressed in [5]. Moreover, their uniqueness conditions mainly depend on the choice of the norm of the correspondence operator, the best response correspondence can be dilatant, and multiple equilibria may exist. At this point, it is important to mention that the uniqueness of an equilibrium does not necessarily imply the convergence to this equilibrium (an example of the nonconvergence is given in Section II). We expect that the robust game theory [4] and the robust optimization [6] are more appropriate to analyze the achievable equilibrium rates under imperfectness and timevarying channel states.
1083-4419/$26.00 © 2011 IEEE
TEMBINE: DYNAMIC ROBUST GAMES IN MIMO SYSTEMS
Many works have been done under the following assumptions: 1) assumption of the perfect CSI at both transmitter and receiver sides of each link assumed; 2) each receiver is also assumed to measure with no errors the covariance matrix of the noise plus the multiple-user interference generated by the other transmitters. However, these assumptions may not be satisfied in many wireless scenarios. It is natural to ask if some of these assumptions can be relaxed without loosing the expected performance. Motivations to consider the imperfect channel state are the following. The CSI is usually estimated at the receiver by using a training sequence or semiblind estimation methods. Obtaining the CSI at the transmitter requires either a feedback channel from the receiver to the transmitter or exploiting the channel reciprocity such as in time-division duplexing systems. While it is a reasonable approximation to assume the perfect CSI at the receiver, usually, the CSI at the transmitter cannot be assumed perfect due to many factors, such as inaccurate channel estimation, erroneous or outdated feedback, and time delays or frequency offsets between the reciprocal channels. Therefore, the imperfectness of the channel state from the transmitter side has to be taken into consideration in many practical communication systems. We address the following question: Given a MIMO-interference dynamic environment, is there a way to achieve the expected equilibria without coordination and with minimal information and memory? To answer to this question, we adopt learning and dynamical system approaches. We say that a learning scheme is fully distributed if the updating scheme of the player needs only its own action and perceived payoff also called the target value, which is a numerical noisy (and possibly delayed) value. In particular, transmitter j does not know the mathematical structure of its payoff function nor its own channel state. The actions and the payoffs of the other players are unknown, too. Under such conditions, the question becomes as follows: Is there a fully distributed learning scheme to achieve an expected payoff and an equilibrium in the MIMO-interference dynamic environment? In the next sections, we provide a partial answer to this question. A. Overview: Robust Approaches and Learning in the MIMO Waterfilling the MIMO Gaussian-interference channel have extensively been in the literature (see [7]–[9] and the references therein). A particular structure that is well used is the parallel Gaussian-interference channel, which corresponds to the case of the diagonal matrix transfer [3], [10]. In the continuouspower-allocation scenario, the existence of equilibria have been established (we recall this result in Section II). A classical sufficiency condition for the uniqueness is the strict-contraction condition, which uses the Banach–Picard fixed-point theorem. Another sufficiency condition is the strict monotonicity of gradient payoffs. Most of these works do not consider the robust approach, which considers the uncertainty of the channel state. The di-
991
agonal case has been recently analyzed in [5], which shows the improvement of the spectrum efficiency of the network by adopting a maximin approach. However, a fully distributed learning algorithm is not proposed in their work. In contrast to the classical approach in which the payoff is assumed to be perfectly measured and in which time delays are neglected, in this paper, imperfectness and time delays in the payoff measurement are considered. Delayed evolutionary game dynamics have been studied in [11] but in continuous time. The authors in [11] showed that an evolutionarily stable strategy (which is robust to invasions by a small fraction of users) can be unstable under time delays, and they provided sufficient conditions of stability of delayed Aloha-like systems under replicator dynamics. However, the stability conditions in continuous time differ from the stability of the discrete-time models that are considered in this paper. This paper generalizes the work in [12]–[14] for the heterogeneous COmbined fully DIstributed Payoff and Strategy Reinforcement Learning (CODIPAS-RL) in the parallel interference case and for zero-sum network security games. B. Objective The objective of this paper is twofold. First, we present a maximin robust game formulation for the rate-maximization game between noncooperative transmitters and formulate best responses to the interference under the channel uncertainty. This approach is distribution free in the sense that the interference considered by transmitters is the worse case over all the possible channel-state distributions. Hence, the resulting payoff function is independent of the channel-state distributions. Our interest on studying the maximin robust power-allocation scenarios stems from the fact that, in the decentralized MIMOinterference channel, the robust equilibrium does not use the perfect CSI assumption and it can increase the network spectral efficiency compared with the Bayesian Nash equilibrium. The network spectral efficiency at the maximin robust equilibrium can be higher than that at the Nash equilibrium due to users being more conservative about causing interference under uncertainty, which encourages better partitioning of the transmitcovariance matrix among the users. Then, an expected robust game without a private CSI to transmitter is considered. The existence of equilibria and CODIPAS-RL [15] algorithms are provided. The second objective of this paper is the heterogeneous learning under the uncertain channel state and with delayed observations. Our motivation for the heterogeneous learning comes from the heterogeneity of current wireless systems. The rapid growth in the density of wireless access devices, accompanied by increasing heterogeneity in wireless technologies, requires the adaptive allocation of resources. The traditional learning schemes use homogeneous behaviors and do not allow for a different behavior associated with the type of the users and the technologies. These shortcomings call for new ways of learning schemes, for example, the one by the different speeds of learning and/or the one by the different learning patterns. The heterogeneity leads into a new class of evolutionary game dynamics with different behaviors [16], [17]. In this context, we use the generic term dynamic robust game to refer a long-run
992
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 4, AUGUST 2011
game under uncertainty. We develop an heterogeneous fully distributed learning framework for the dynamic robust powerallocation games. The idea of the CODIPAS-RL goes back to Bellman [18] for the joint computation of the optimal value and the optimal strategies. The idea has been extensively developed in [19] with less information requirement. Since then, different versions of the CODIPAS-RLs have been studied. These are Bush–Mosteller-based CODIPAS-RLs, Boltzman–Gibbs-based CODIPAS-RLs, imitative Boltzman–Gibbs-based CODIPASRLs, multiplicative weighted imitative CODIPAS-RLs, weakened fictitious-play-based CODIPAS-RLs, logit-based payoff learning, imitative Boltzmann–Gibbs-based payoff learning, multiplicative weighted payoff learning, no-regretbased CODIPAS-RLs, pairwise-comparison-based CODIPASRLs, projection-based CODIPAS-RLs, excess-payoff-based CODIPAS-RL, etc. Our CODIPAS-RL scheme considers imperfectness in the payoff measurement and time delays. The instantaneous payoffs are not available. Each transmitter has only a numerical value of its delayed noisy payoff. Although transmitters observe only the delayed noisy payoffs of the specific action chosen on a past time slot, the observed values depend on some specific dependence parameters determined by the other transmitters’ choices revealing implicit information on the system. The natural question is whether such a learning algorithm based on a minimal piece of information may be sufficient to induce coordination and make the system stabilize to an equilibrium. The answer to this question is positive for some classes of power-allocation games under the channel uncertainty. C. Contribution Our main contributions can be summarized as follows. • We overview the existence and the uniqueness (or the nonuniqueness) of the pure and mixed Nash equilibria for discrete and continuous power allocations with any number of transmitters and any number of receivers. This result is obtained by exploiting the fact that the games modeling both scenarios are in the class of multiple robust pseudopotential games, which is an extension of the work by Monderer [20] on q-potential games. We extend the results for static power-allocation games into the dynamic robust power-allocation-game context. • Learning schemes with a minimum feedback are introduced such that transmitters can achieve the robust solutions. a) In the continuous-power-allocation case, it turns out that our algorithm considerably improves the classical iterative waterfilling algorithm where the receiver uses successive interference cancelation and the decoding order is known by all transmitters. Also, our algorithm considerably improves the standard algorithms based on the gradient descent/ascent, Jacobi/Gauss, and best response where the payoff function is assumed to be perfectly known by the transmitters and the exact value of the gradient and its projection are known. These assumptions are relaxed in this paper. When the gradient vector is not observed/measured, an alternative iterative scheme is proposed. Under additional
assumptions, the strategy learning is well adapted to the continuous dynamic power-allocation game if only a numerical noisy and delayed value of its own payoff is observed at the transmitter. This extends the standard strategy-reinforcement learning in which the payoff learning is not considered. We expect that the imperfectness in the measurement of payoff functions are more appropriate in many wireless scenarios and should be more exploited when modeling dynamic wireless scenarios.In the continuous-powerallocation case, the classical gradient-based method (plus the projection if constrained) can be used. If only a delayed estimated gradient of the payoff is available, the convergence is conditioned, i.e., the estimated gradients should be good enough and time delays should be small enough. An extension to the case where an estimated gradient is not available is also discussed. Then, we propose a combined delayed payoff and strategy learning (CODIPAS-RL) to capture the delayed feedback and uncertainty in morerealistic wireless scenarios. b) In the discrete dynamic robust power-allocation game, we examine the heterogeneous and delayed CODIPAS-RL with different timescales. Using the Dulac criterion and the Poincaré–Bendixson theorem [21] for planar dynamical systems, we show that the heterogeneity in the learning schemes can help in the convergence in a generic setting. The result is directly applicable to two-transmitter MIMO systems with two actions [22] or three-transmitter MIMO with symmetric constraints and noises. Numerical examples are illustrated with or without feedback delay in both homogeneous and heterogeneous learning for the two-transmitter two-/three-receiver case. D. Structure The rest of this paper is organized as follows. In Section II, we present the signal model and introduce the robust game theory and the reinforcement learning. In Section III, we overview static power-allocation games in MIMO-interference systems. After that, we present dynamic power-allocation games under the channel uncertainty and the heterogeneous learning framework in Section IV. Numerical examples are illustrated in Section V. Section VI concludes this paper. II. P RELIMINARIES Here, we introduce the robust game theory and the reinforcement learning, and present the signal model. We first introduce some of the notations in Table I. A. Model We consider a J-link communication network, which can be modeled by a MIMO Gaussian-interference channel. Each link is associated with a transmitter–receiver pair. Each transmitter and receiver is equipped with nt and nr antennas, respectively. The sets of transmitters is denoted by J . The cardinality of J is J. Transmitter j transmits a complex signal vector
TEMBINE: DYNAMIC ROBUST GAMES IN MIMO SYSTEMS
993
TABLE I SUMMARY OF NOTATIONS
s˜j,t ∈ Cnt of dimension nt . Consequently, a complex baseband signal vector of dimension nr denoted by y˜j,t is received as output. The vector of received signals from j is defined by y˜j,t = Hj,j,t s˜j,t + j =j Hj,j ,t s˜j ,t + zj,t , where t is the time index ∀j ∈ J , Hj,j ,t is the complex channel matrix of dimension nr × nt from transmitter j to receiver j , and vector zj,t represents the noise observed at the receivers; it is a zeromean circularly symmetric complex Gaussian noise vector with an arbitrary nonsingular covariance matrix Rj . For all j ∈ J , matrix Hj,j,t is assumed to be nonzero. We denote it by Ht = (Hj,j ,t )j,j . The vector of transmitted symbols s˜j,t ∀j ∈ J is characterized in terms of power by the sj,t s˜†j,t ), which is an Hermitian covariance matrix Qj,t = E(˜ (self-adjoint) positive semidefinite matrix. Now, since transmitters are power limited, we have ∀ j ∈ J ∀t ≥ 0,
tr(Qj,t ) ≤ pj,max .
(1)
Note that the constraint at each time slot can be relaxed to a long-term time-average power budget. We define a transmit power covariance matrix for transmitter j ∈ J as matrix Qj ∈ M+ satisfying (1), where M+ denotes the Hermitian positive matrix. The payoff function of j is its mutual information I(˜ sj ; y˜j )(H, Q1 , . . . , QJ ). Under the above assumption, the maximum information rate [1] is † −1 log det I + Hjj Γj (Q−j )Hjj Qj † where Γj (Q−j ) = Rj + j =j Hjj Qj Hjj is the multiuser interference plus the noise observed at j and Q−j = (Qk )k=j is the collection of the users’ covariance matrices, except the jth one. The robust individual optimization problem of player j is then j ∈ J sup inf I(˜ sj ; y˜j )(H, Q1 , . . . , Qj ) Qj ∈Qj
H
where
Qj := Qj ∈ Cnt ×nt |Qj ∈ M+ , tr(Qj,t ) ≤ pj,max .
B. One-Shot Game in Strategic Form The basic components of a strategic game with a complete information are the following: (a) the set of players J = {1, . . . , J}, with J being the number of players; (b) the action spaces Q1 , . . . , QJ ; and (c) the preference structure of the
players. We represent them by the payoff (cost, utility, reward, benefit, etc.) functions U1 , . . . , UJ . In this paper, both discrete and continuous action spaces are considered. This paper covers scenarios where players (users, mobile devices, and transmitters) are able to choose their actions by themselves. An action is typically a covariance matrix. The payoff function is the maximum transmission rate. If the game is played once, the game is called a one-shot game, and its strategic form (or normal form) is represented by collection G¯ = (J , {Qj }j∈J , {Uj }j∈J ). In the case of the discrete action space (a set of fixed covariance matrices from set Qj ), when player j ∈ J chooses action sj ∈ Qj according to a probability distribution xj = (xj (sj ))sj ∈Qj over Qj , the choice of xj is called a mixed strategy of the oneshot game. When xj is on a vertex of simplex Xj = Δ(Qj ), the mixed strategy boils down to a pure strategy, i.e., the deterministic choice of an action. Since there are random variables Hj,j , which determine the state of the game, we add a state space H, and the payoff function will be defined on the product space, i.e., H × j Xj . We denote by G(H) the normal-form mixed game ({H}, J , {Qj }j∈J , {Uj (H, .)}j∈J ). The extension payoff function will be defined on X j j , i.e., Q , . . . , Q ). Then, uj (H, x1 , . . . , xJ ) = Ex1 ,...,xJ Uj (H, J 1 an action profile (Q∗1 , . . . , Q∗J ) ∈ j∈J Qj is a (pure) Nash equilibrium of a one-shot game G(H) if ∀j ∈ J , Uj H, Qj , Q∗−j ≤ Uj H, Q∗j , Q∗−j , ∀ Qj ∈ Qj . (2) Here, we have identified the set of mappings from {H} to Qj with the state-independent action space. At this point, the knowledge of H may be required to compute the payoff function. We do not assume that in our analysis. A strategy profile (x∗1 , . . . , x∗J ) ∈ j∈J Xj is a (mixed) Nash equilibrium of a one-shot game G(H) if ∀j ∈ J , uj H, xj , x∗−j ≤ uj H, x∗j , x∗−j ∀xj ∈ Xj . (3) Following [8] and [9], given H, a Nash equilibrium of G(H) is given by the MIMO waterfilling solution described as follows: † −1 Γj (Q−j )Hjj is written as Ej† (Q−j ) • Term Hjj Dj (Q−j )Ej (Q−j ) by eigendecomposition, where Ej (Q−j ) ∈ Cnt ×nt is a unitary matrix containing the eigenvectors and Dj (Q−j ) is a diagonal matrix with the nt positive eigenvalues. • Nash equilibria are characterized by the (Q∗1 , . . . , Q∗J ) solution of the MIMO waterfilling operator [9], [23], [24], i.e.,
+ WFj (Q−j ) = Ej (Q−j ) μj I − Dj−1 (Q−j ) Ej (Q−j ) where μ is chosen in order to satisfy
+ tr μj I − Dj−1 (Q−j ) = pj,max where x+ = max(0, x). • Note that WFj (Q−j ) is exactly the best response to Q−j , i.e., BRj (Q−j ) = WFj (Q−j ). This operator is continuous in the sense of norm 2 or norm sup (in the
994
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 4, AUGUST 2011
finite-dimensional vectorial space, they are topologically equivalent). • The existence of solution Q∗ of the above fixedpoint equation is guaranteed by the Brouwer fixed-point theorem,1 which states that a continuous mapping from a nonempty compact convex set into itself has at least one fixed point, i.e., ∃ Q∗ ∈ Cnt ×nt , Q∗j ∈ Qj , and Q∗j = WFj (Q∗−j ). • In specific cases where WF is strict-contracting, one can use the Banach–Picard iterative procedure to show the convergence. However, in general, the best response (namely, the iterative waterfilling methods, i.e., simultaneous, sequential, and asynchronous versions) may not converge or require an additional information such as own channel state and total interference. A well-known and simple example of nonconvergence is obtained when considering a cycling behavior between receivers, i.e., J = 3, nr = 2, p1,max = p2,max = pmax , and the transfer matrices are ⎛ ⎞ 1 0 2 H1 = H2 = ⎝ 2 1 0 ⎠ , R1 = (σ 2 ), R2 = (σ 2 + pmax ). 0 2 1 Starting from the first channel, the three players cycle between the two channels indefinitely. In this paper, we provide how to eliminate this cycling phenomena using the CODIPAS-RL. C. Static Robust Game Formulations We now examine the static robust game. The robust game with the state space H is given by (H, J , {Qj }j∈J , {Uj (H, .)}j∈J ). We develop two approaches of the robust oneshot power-allocation game. • The first one is based on the expectation over the channel states, called the expected robust game, with a payoff function defined by u1j (x1 , . . . , xj ) := EH Ex1 ,...,xj Uj (H, Q1 , . . . , Qj ) in the discrete power allocation and vj1 (Q1 , . . . , QJ ) = EH Uj (H, Q1 , . . . , QJ ) for the continuous power allocation. We denote the associated static games by G 1,d for the discrete power allocation and G 1,c for the continu∗ ∗ ous power allocation. A strategy profile (x1 , . . . , xJ ) ∈ j∈J Xj is a state-independent Nash equilibrium of the expected game G 1,d if ∀ j ∈ J , EH uj H, xj , x∗−j ∀x∗j ∈ Xj . (4) ≤ EH uj H, x∗j , x∗−j The existence of the solution of (4) is equivalent to the existence of the solution of the following variational inequality problem: find x∗ such that
x∗ − x, V (x∗ ) ≥ 0, ∀x ∈ Xj j
1 See also Kakutani, Glicksberg, Ky Fan, and Debreu fixed-point theorems for the set-valued fixed points.
where ., . is the inner product, V (x) = [V1 (x), . . . , VJ (x)], and Vj (x) = [EH uj (H, esj , x−j )]sj ∈Qj . An equilibrium of the expected G 1,c is similarly defined. For the continuous-action case, it is well known that, if the payoff function are continuous and the action spaces are compacts in finite-dimensional spaces, then the existence of mixed equilibria follows. • The second approach is a pessimistic approach also called the maximin robust approach. It consists to consider the payoff function as u2j (x) = inf H∈H Ex1 ,...,xJ Uj (H, Q1 , . . . , QJ ) and vj2 (Q) = inf H∈H Uj (H, Q). We denote the associated static maximin robust games by G 2,d for the discrete power allocation and G 2,c for the continuous power allocation. Profile (x∗j , x∗−j ) ∈ j Xj is a maximin robust equilibrium of G 2,d if ∀j ∈ J , inf H uj (H, xj , x∗−j ) ≤ inf H uj (H, x∗j , x∗−j ) ∀xj ∈ Xj . Next, we define the dynamic robust power-allocation game. In it, the transmitters play several times and all under a channelstate uncertainty. The behavioral-strategy case, where each transmitter j chooses the probability distribution xj,t at each time slot t based on its history up to t, is considered. In the dynamic game with uncertainty, the joint channel states randomly change from a time slot to another. In the robust power-allocation scenarios, the state corresponds to the current joint channel states, e.g., the matrix of channel-state matrices Ht = (Hj,j ,t )j,j ∈ H. In that case, we will denote the instantaneous payoff function by uj (Ht , xt ). In our setting, the payoff function of player j is I(˜ sj ; y˜j )(H, Q1 , . . . , QJ ), which is the mutual information. We would like to mention that, in the learning part of this paper, each player is assumed to follow an heterogeneous CODIPAS-RL scheme but does not need to know whether the other players are present or not or whether they are rational or not. D. Standard Reinforcement Learning Algorithm The payoff function at a given game step depends on the current state matrices Ht and on the actions played by the different players. By denoting Qj,t as the action played by j at time slot t, the payoff for j writes Uj (Ht , Q1,t , . . . , QJ,t ). We denote the perceived payoff of j at time slot t by Uj,t . xj,t (sj ) = Pr[Qj,t = sj ], and sj ∈ Qj . The classical reinforcement learning of [25]–[28] consists in updating the probability distribution over the possible actions as follows: ∀j ∈ J and ∀sj ∈ Qj xj,t+1 (sj ) = xj,t (sj )+λj,t uj,t 11{Qj,t =sj } −xj,t (sj ) (5) where 11{} is the indicator function and λj,t > 0 is the learning rate (step size) of player j at time t, satisfying 0 ≤ λj,t uj,t ≤ 1. The learning rate can be constant or time varying. Term uj,t is a numerical value of the measured payoff of j at time t. The increment in the probability of each action sj depends on the corresponding observed or measured payoff and its learning rate. More importantly, note that, in (5), for each player, only the value of his individual payoff function at time slot t is required. Therefore, the knowledge of the mathematical
TEMBINE: DYNAMIC ROBUST GAMES IN MIMO SYSTEMS
995
expression of the payoff function Uj (.) is not assumed for implementing the algorithm. In addition, the random state Ht is unknown to the players. This is one of the reasons why gradient-like techniques are not directly applicable here. The update scheme has the following form: Newestimate ←− Oldestimate + Stepsize (Target − Oldestimate) where the target plays the role of the current strategy. Expression [Target − Oldestimate] is an error in the estimation. It is reduced by taking a step size toward the target. The target is presumed to indicate a desirable direction to move. III. S TATIC ROBUST P OWER -A LLOCATION G AMES A. Robust Pseudopotential Approach Following the work of [29], we define a potential game in the context randomness. Game G(H) is an exact potential game if there exists function φ(H, Q) for all Q ∈ j Qj such that, for all player j ∈ J , it holds that ˜j (H, Q−j ) Uj (H, Q) = φ(H, Q) + B ˜j (.). We say that game G(H) is for some function B the best response potential game if there exists function φ(H, Q) such that, ∀j, ∀Q−j , arg maxQj φ(H, Qj , Q−j ) = arg maxQj Uj (H, Qj , Q−j ) and game G(H) is a pseudopotential game if there exists function φ(H, Q) such that, ∀j, ∀Q−j , arg maxQj φ(H, Qj , Q−j ) ⊆ arg maxQj Uj (H, Qj , Q−j ). A direct consequence is that an exact potential game is the best response potential game, which is a pseudopotential game. We define a robust pseudopotential game as follows. Definition 1 (Robust Pseudopotential Game): The family of games indexed by H is a robust pseudopotential game if there exists function φ defined on H × j∈J Qj such that ∀j, ∀Q−j arg max EH φ(H, Q) ⊆ arg max EH Uj (H, Q). Qj
Proof: • Since the joint action space is finite, there exists an action profile that maximizes term EH φ(H, .). This is an equilibrium of the expected game. • Since Q −→ φ(H, Q) is continuous on a compact and nonempty set, for any fixed value of H, the function is maximum. By the absolute integrability of φ(., Q), a global maximizer of Q −→ EH φ(H, Q) is an equilibrium of the expected game. • Now, the concavity of Q −→ EH φ(H, Q) = H φ(H, Q)˜ ν (dH), where ν˜ is the probability law of the states, gives the monotonicity (not necessarily strict) of D{EH φ(H, Q)}. By reversing the order between D and E in DEH and using the fact that φ is a continuously differentiable function, one has the monotonicity of EH D{φ(H, Q)}. Hence, it is a robust stable game [16]. If the expected potential function is strictly concave in the joint actions, then one gets a strict stable robust game. Hence, the global convergence to the unique (stateindependent) equilibrium follows. Note that robust pseudopotential games are more general than standard pseudopotential games. As a corollary, one has the following result by choosing a Dirac of a single state: • Assume that the action spaces are compact and nonempty. Then, a pseudopotential game with a continuous function φ has at least one Nash equilibrium in pure strategies. • In addition, assume that the action spaces are convex. Then, a potential concave game is a stable game. In the context static power-allocation game, the authors in [9] showed that the best response BR may not be monotone. Similarly, we define a maximin robust potential game. Definition 2 (Maximin Robust Pseudopotential Game): The family of games indexed by H is a maximin robust pseudopo tential game if there exists function ξ defined on j∈J Qj such that ∀j ∈ J , ∀ Q−j , arg max ξ(Q) ⊆ arg max inf Uj (H, Q). Qj
Qj
H
(7)
Qj
(6) Particular cases of robust pseudopotential games are pseudopotential games that are ordinal potential (sign preserving in Definition 1), which are indexed by a singleton state. Proposition 1: Assume that the payoff functions are absolutely integrable with the respect to H. • Every finite robust potential game has at least one pure Nash equilibrium. • Assume that the action spaces are compact and nonempty. Then, a robust pseudopotential game with a continuous function φ has at least one pure Nash equilibrium. • In addition, assume that the action spaces are convex. Then, almost surely, every robust potential concave game with a continuously differentiable potential function is a stable robust game, i.e., operator −EH Dφ(H, .) is monotone where D is the differential operator with respect to the second variable. Moreover, if function EH Dφ(H, .) is strictly concave in the joint actions, then the global convergence to the unique equilibrium holds.
Static Power Allocation Under the Channel Uncertainty: Here, we focus on the static robust power-allocation games. By rewriting the payoff as ˜j (H, Q−j ) where Uj (H, Q) = φ(H, Q) + B ⎛ ⎞ † ⎠ φ(H, Q) = log det ⎝R + Hj,t Qj,t Hj,t ⎛
j∈J
˜j (H, Q−j ) = − log det ⎝R + B
⎞ † ⎠ (8) Hl,t Ql,t Hl,t
l=j
we deduce that the power-allocation game is a robust pseudopotential game with a potential function EH φ. Corollary 1: The robust power allocation is a robust pseudopotential game with a potential function given by ξ : Q −→ EH φ(H, Q) = φ(H, Q) ν(dH). H
996
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 4, AUGUST 2011
Proof: By taking the expectation of (8), one gets ∀j ∈ J , ∀Q−j , arg max EH I(˜ sj ; y˜j )(H, Q) Qj
= arg max EH φ(H, Q). Qj
Using the fact that function log det is concave on positive matrices, function φ is continuously differentiable and concave in Q = (Q1 , . . . , QJ ). Thus, the following corollary holds. Corollary 2 (Existence in G 1,d , G 1,c ): Assume that all the random variables Hj,j have a compact support in Cnr ×nt , then the robust power allocation (discrete or continuous) has at least one pure robust equilibrium (stationary and state independent). Proof: By the compactness assumption and by the continuity of function φ(., .), mapping ξ defined by ξ : Q −→ EH φ(H, Q) is continuous over j Qj . Thus, ξ has maximizer Q∗ , which is a pure robust equilibrium. Note that we do not need H to be a compact set; the uniform integrability of the payoffs with the resection of H is sufficient. We now focus on the maximin robust solutions. A maximin robust solution is a solution of sup inf I(˜ sj ; y˜j )(H, Q), j ∈ J .
Qj ∈Qj
H
Proposition 2 (Existence in G 2,d , G 2,c ): Assume that all the random variables Hj,j have a compact support in Cnr ×nt , then the maximin power allocation has at least one pure maximin robust equilibrium. sj ; Proof: By compactness assumptions, inf H I(˜ sj ; y˜j )(H, Q) = I(˜ sj ; y˜j )(H∗ , Q) = y˜j )(H, Q) = minH I(˜ vj2 (Q), and by the continuity of function ξ˜ : Q −→ φ(H∗ , Q) ˜ ∗ , which is also a pure over j Qj , there is maximizer Q maximin robust equilibrium. Proposition 3 (Existence in G 1,d , G 2,d ): Any robust finite game with a bounded and closed uncertainty set has an equilibrium in mixed strategies. The proof immediately follows from Theorem 2 in [4]. As a corollary, the existence of maximin solutions of the robust discrete-power-allocation game is guaranteed if the support of the distribution of channel states is a compact set (in finitedimensional vectorial spaces, a bounded and closed set is equivalent to be compact). From the fact that our game is a robust pseudopotential game with a potential function EH φ, we deduce that the local maximizers of EH φ are equilibria of the expected robust game. IV. L EARNING IN DYNAMIC ROBUST P OWER -A LLOCATION G AMES Here, we develop a combined and fully distributed learning framework for the dynamic robust power-allocation games. We consider a class of a dynamic robust game indexed by G(Ht ), t ≥ 0. Since the transmitters do not observe the past actions of the other transmitters, we consider strategies used by the players to be only dependent on their current perceived own payoffs and past own histories. Denote by xj,t (sj ) the probabilities of transmitter j choosing the power allocation
sj at time t, and let xj,t = [xj,t (sj )]sj ∈Qj ∈ Xj be the mixed state-independent strategy of transmitter j. Payoffs Uj,t are random variables, and the payoff functions are unknown to the transmitters. We assume that the distribution or the law of the possible payoffs are also unknown. We do not use any Bayesian assumptions on the initial beliefs formed over the possible states. We propose a CODIPAS-RL to learn the expected payoffs simultaneously with the optimal strategies during a longrun interaction, i.e., a dynamic game. The dynamic robust game is described as follows. • At time slot t = τ , each transmitter j chooses a power allocation Qj,τ ∈ Qj and perceives a numerical noisy value of its payoff, which corresponds to a realization of the random variables depending on the actions of the other transmitters and the channel state, etc. He initializes its ˆ j,0 . estimation to u • At time slot t, each transmitter j has an estimation of its own payoffs, chooses action Qj,t based its own experiences, and experiments a new strategy. Each transmitter j receives a delayed output uj,t−τj from the old experiment. Based on this target uj,t−τj , transmitter j updates ˆ j,t and builds strategy xj,t+1 for its estimation vector u the next time slot. Strategy xj,t+1 is a function only of ˆ j,t , and the target value. Note that the exact value xj,t , u of the channel state at time t is unknown by transmitter j, the exact value of the delayed own payoffs are unknown, and the past strategies x−j,t−1 := (xk,t−1 )k=j of the other transmitters and their past payoffs u−j,t−1 := (uk,t−1 )k=j are also unknown to transmitter j. • The game moves to t + 1. We focus on the limiting of the average payoff, i.e., the limit of Fj,T = (1/T ) Tt=1 uj,t . First, we introduce preliminary notions that define the dynamic game. Histories A transmitter’s information consists of his past own actions and perceived own payoffs. A private history of length t for transmitter j is collection hj,t = (Qj,1 , uj,1 , . . . , Qj,t , uj,t ) ∈ Mj,t = (Qj × R)t . Behavioral Strategy A strategy for transmitter j is mapping σj : t≥0 Mj,t −→ Xj . The set of complete histories of the dynamic game after t stages is Mt = (H × j Qj × RJ )t ; it describes the states, the chosen actions, and the received payoffs for all the transmitters at all past stages before t. A strategy profile σ = (σj )j∈J and an initial state H induce a probability distribution PH,σ on the set of plays M∞ = (H × j Qj × RJ )∞ . Given an initial state H and a strategy profile σ, the payoff of transmitter j is the superior limiting of the Cesaro-mean payoff EH,σ Fj,T . We assume the ergodicity of payoff EH,σ Fj,T . Stationary strategies A simple class of strategies is the class of stationary strategies. A strategy profile (σj )j∈N is stationary if, ∀j, τj (Mj,t , Ht ) depends only on the current state Ht . A stationary strategy of player jcan be identified with the element of the product space of H∈H Δ(Qj (H)). In our setting, Qj (H) = Qj (independent of H). Since the value of H is not observed
TEMBINE: DYNAMIC ROBUST GAMES IN MIMO SYSTEMS
by the player, a state-independent stationary strategy of player j is an element of Xj . A. CODIPAS-RL in the MIMO Under the Channel Uncertainty Inspired from the heterogeneous combined learning for a two-player zero-sum stochastic game with an incomplete information developed in [13] and [15], and from the Boltzmann–Gibbs-based reinforcement learning, we develop a heterogeneous, delayed, and CODIPAS-RL framework for the discrete power allocation under uncertainty and delayed feedback. In this paper, the general learning pattern has the following form: ⎧ ˆ j,t , xj,t ), ⎨ xj,t+1 = fj (λj,t , Qj,t , uj,t−τj , u ˆ j,t+1 = gj (νj,t , uj,t−τj , xj,t , u ˆ j,t ) (player - j) u ⎩ ∀ j ∈ J , t ≥ 0, Qj,t ∈ Qj where • Functions f and λ are based on the estimated payoff and the perceived measured payoff (with delay) such that the invariance of the simplex is preserved. Function fj defines the strategy learning pattern of transmitter j, and λj is its learning rate. If at least two of functions fj are different, then we refer to the We heterogeneous learning. will assume that λj,t ≥ 0, t λj,t = ∞, and t λ2j,t < ∞, that is, λj ∈ l2 \l1 . If all fj are identical but the learning rates λj are different, we refer to learning with different speeds, i.e., slow learners, medium learners, fast learners, etc. • Functions g and ν are well chosen in order to have a good estimation of the payoffs. We assume that ν ∈ l2 \l1 and τj ≥ 0 is a feedback delay associated to the payoff of transmitter j. Let us give two examples of delayed fully distributed learning algorithms [see (9)-(10) at the bottom of this page], where β˜j,j : R|Qj | −→ Xj and β˜j,j (ˆ uj,t )(sj ) = (1/j )ˆuj,t (s ) (1/j )ˆ uj,t (sj ) j e / s e is the Boltzmann–Gibbs j strategy. An example of the heterogeneous learning with two
997
transmitters is then obtained by combining (CRL0) and (CRL1) [ see (11) at the bottom of the page]. Convergence to ordinary differential equation The stochastic fully distributed reinforcement learning has been studied in [25], [28], and [30]. These works used stochastic-approximation techniques [31]–[35] to derive ordinary differential equations (ODEs) that are equivalent to the adjusted replicator dynamics [36]. By studying the orbits of the replicator dynamics, one can get some convergence, divergence, and stability properties of the system. However, in general, the replicator dynamics may not lead to approximate equilibria even in simple games [11]. Convergence properties in a special class of games such as weakly acyclic games and best response potential games can be found in [37]. Most often, the limiting behavior of the stochastic iterative schemes are related to the well-known evolutionary game dynamics, i.e., multitype replicator dynamics, Maynard–Smith replicator dynamics, Smith dynamics, projection dynamics, etc. The evolutionary game-dynamic approaches have been applied in IEEE 802.16 [38] and in wireless mesh networks [39], resource pricing [40], P2P soft-security incentive mechanism [41], access control [42], [43], hybrid rate control [44], and power control [45]. Homogeneous learning Strategies {xj,t }t≥0 generated by these learning schemes are in the class of behavioral strategies σ described above. Proposition 4: The ODE of the CODIPAS-RL (CRL0) is ⎧ uj,t )(sj ) − xj,t (sj ), ⎨ x˙ j,t (sj ) = β˜j,j (ˆ
d ˆ ˆ j,t (sj ) u (s ) = x j,t j j,t (sj ) EH Uj (H, esj , xj,t ) − u dt ⎩ sj ∈ Qj , j ∈ J . Moreover, if the payoff learning rate is faster than the strategy learning rate, then the system of ODEs reduces to x˙ j,t (sj ) = β˜j,j (EH Uj (H, esj , x−j,t )) − xj,t (sj ), sj ∈ Qj , j ∈ J . Proof: The proof follows the same lines as in Proposition 6 using multiple time-scale stochastic approximation techniques.
⎧ ˜ j, (ˆ u ) ⎨ xj,t+1 = (1 − λj,t )xj,t + λj,t β j j,t (CRL0) u ˆ j,t + νj,t 11{Qj,t =sj } uj,t−τj − u ˆ j,t (sj ) ⎩ ˆ j,t+1 (sj ) = u j ∈ J , sj ∈ Qj ⎧ − xj,t (sj ) ⎨ xj,t+1 (sj ) = xj,t (sj ) + λj,t uj,t−τj 11{Q =s } j,t j (CRL1) u ˆ j,t (sj ) + νj,t 11{Qj,t =sj } uj,t−τj − u ˆ j,t (sj ) ˆ (s ) = u ⎩ j,t+1 j sj ∈ Qj , j ∈ J
⎧ x1,t+1 = (1 − λ1,t )x1,t + xλ1,t β˜1,1 (ˆ u1,t ) ⎪ ⎪ ⎨ˆ ˆ 1,t (s1 ) + ν1,t 11{Q1,t =s1 } (u1,t−τ1 − u ˆ 1,t (s1 )) u1,t+1 (s1 ) = u (HCRL) ⎪ 1 1 (s ) = x (s ) + λ u − x2,t (s2 ) xx 2,t+1 2 2,t 2 2,t 2,t−τ2 {Q2,t =s2 } ⎪ ⎩ ˆ 2,t (s2 ) + ν2,t 11{Q2,t =s2 } (u2,t−τ2 − u ˆ 2,t (s2 )) ˆ 2,t+1 (s2 ) = u u
(9)
(10)
(11)
998
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 4, AUGUST 2011
Note that, for any xj (sj ) > 0, the second system of ODEs, i.e.,
d ˆ j,t (sj ) = xj (sj ) EH Uj (H, esj , x−j ) − u ˆ j,t (sj ) u dt is globally convergent to EH Uj (H, x), which is the ergodic capacity under strategy x. We conclude that the expected payoff is learned when t is sufficiently large. The asymptotic behaviors of the CODIPAS-RLs (CRL1) are related to the multitype replicator dynamics [36] combined with a payoff ODE, i.e., ⎧ ⎪ ˆ j,t sj ˆ j,t (sj )− s ∈Qj xj,t sj u ⎨ x˙ j,t (sj ) = xj,t (sj ) u j
d ˆ ˆ j,t (sj ) u (s ) = x (s ) E U (H, e j,t j j,t j H j sj , x−j,t )− u ⎪ ⎩ dt sj ∈ Qj , j ∈ J . By choosing fast learning rates ν, the system reduces to ⎧ ⎪ ⎪ ⎪ x˙ j,t (sj ) = xj,t (sj ) EH Uj (H, esj , x−j,t )− s ∈Qj xj,t sj ⎪ ⎪ j ⎪ ⎨ ⎪ ⎪ × EH Uj H, esj , x−j,t , ⎪ ⎪ ⎪ ⎪ ⎩ sj ∈ Qj , j ∈ J . For the one-player case, an explicit solution of the replicator dynamics is given by xj,t (sj ) =
xj,0 (sj )etEH uj (H,esj ) = β˜j, 1 (EH uj (H, .)) . t tE u H,e H j s e j s x j s j,0 j
An important result in the evolutionary game theory is the socalled folk theorem (evolutionary version), It states that, under the replicator dynamics of the expected two-player game, one has the following properties. Proposition 5 (Folk Theorem): Consider the replicator dynamics for a two-player bilinear game. • Every Nash equilibrium of the expected game is a rest point. • Every strict Nash equilibrium of the expected game is asymptotically stable. • Every stable rest point is a Nash equilibrium of the expected game.
• If an interior orbit converges, its limit is a Nash equilibrium of the expected game. For a proof of all these statements, we apply [46] and [47] to the expected game. We use this result for the strategy reinforcement learning of (CRL1) and obtain the following properties of the ODEs. • If the starting point is at a relative interior point of the simplex, the dominated strategies will be eliminated. • If the starting point is at the relative interior and if the trajectory goes to the boundary, then the outcome is an equilibrium. • If there is a cyclic orbit of the dynamics, the limit cycle contains an equilibrium at its interior. • Moreover, the expected payoff is learned if the CODIPASˆj,t (sj ) −→ RL (CRL1) is used; xj (sj ) > 0 implies that u EH uj (H, esj , x−j ) when t goes to infinity. Heterogeneous learning By combining the standard reinforcement learning algorithm with the Boltzmann–Gibbs learning for which the rest points are approximated equilibria, we prove the convergence of the nondelayed heterogeneous learning to hybrid dynamics. Proposition 6: Assume that τj = 0 ∀j. Then, the asymptotic pseudotrajectory of (HCRL) is given the system of differential equations in (12) (shown at the bottom of the page). Moreover, if (λj,t /νj,t ) −→ 0, then the system reduces to (13) (shown at the bottom of the page). Proof: We first look at the case of the same learning rate λt for the strategies but different than νt . Assume that ratio (λt /νt ) −→ 0. The scheme can be written as xt+1 = xt + (1) (2) ˆ t ) + Mt+1 ] and u ˆ t+1 = u ˆ t + νt [˜ ˆ t ) + Mt+1 ], λt [f˜(xt , u g (xt , u (k) where Mt+1 and k ∈ {1, 2} are noises. By rewriting the first equation as xt+1 = xt + νt
λt ˜ (1) ˜ (1) ˆ t ) + Mt+1 = xt + νt M f (xt , u t+1 νt
(1)
(1)
˜ ˜ ˆ t ) + Mt+1 ), and by taking the where M t+1 = (λt /νt )(f (xt , u ˜ (1) ; conditional expectation, we obtain (xt+1 − xt )/νt = M t+1 (2) ˆ t )/νt = g˜(xt , u ˆ t ) + Mt+1 . For t that is sufficiently (ˆ ut+1 − u large, it is plausible to view mapping xt as quasi-constant when
⎧ d ˆ1,t (s1 ) = x1,t (s1 ) (EH u1 (H, es1 , x2t ) − u ˆ1,t (s1 )) ⎪ ⎪ dt u ⎪ ⎨ x˙ = β˜ (ˆ 1,t 1,1 u1,t ) − x 1t ⎪ x ˙ (s ) = k x (s ) u12 (x1,t , es2 ) − s ∈Q2 u12 (x1,t , es2 )x2,t (s2 ) , ⎪ 2,t 2 2 2,t 2 ⎪ 2 ⎩ s1 ∈ Q1 , s2 ∈ Q2
(12)
⎧ ⎪ ⎨ x˙ 1,t (s1 ) = β˜1,1 (EH u1 (H, es1 , x2,t )) − x1,t (s1 ), 1 1 x ˙ x x (s ) = k x (s ) u (x , e ) − u , e (s ) , 2,t 2 2 2,t 2 1,t s2 2t 2 2 1,t s2 s2 ∈Q2 2 ⎪ ⎩ s1 ∈ Q1 , s2 ∈ Q2
(13)
TEMBINE: DYNAMIC ROBUST GAMES IN MIMO SYSTEMS
999
ˆ t , i.e., the drift (expected change in analyzing the behavior of u one time slot), as ⎧ ˜ (1) | Ft −→ 0 ⎨E xt+1−xt | Ft = E M t+1 νt (2) ⎩E uˆ t+1−ˆut | Ft = E g˜(xt , u ˆ t )+Mt+1 | Ft −→ E˜ ˆ t) g (xt , u νt ˆ t }t≤t . where Ft is the filtration generated by {xt , ut , Ht , u ˆ t ). Since compout = E˜ g (xt , u Equivalently, x˙ t = 0; (d/dt)ˆ ˆ t times nent sj of function g˜ is EH Uj (H, esj , x−j,t ) − u xj,t (sj ), the second system is globally convergent to EH Uj (H, ˆ t )t −→ esj , x−j ). Then, one gets that sequences (xt , u {(x, EH uj (H, esj , x−j )), x ∈ j Xj }. Now, consider the first (1) ˆ t ) + Mt+1 ). This can be reequation xt+1 = xt + λt (f˜(xt , u written as xj,t+1 = xj,t +λt (f˜(xt , EH uj (H, esj , x−j,t ))+ (1) ˆ t )− f˜(xt , EH uj (H, esj , x−j,t ))+Mt+1 ). By denoting f˜(xt , u (3) (1) ˆ t ) − f˜(xt , EH uj (H, esj , x−j,t )) + Mt + 1 , Mt+1 : = f˜(xt , u which goes to zero when taking the conditional expectation in Ft , the equation can be asymptotically approximated by (3) xj,t+1 = xj,t + λt (f˜(xt , EH Uj (H, esj , x−j,t )) + Mt+1 ). This last learning scheme has the same asymptotic pseudotrajectory as the ODE x˙ j = f˜(xt , EH uj (H, esj , x−j,t )). For same rate or proportional learning rates λ and ν, the dynamics are multiplied by the ratio. Hence, the announced results are as u1,t ) − x1,t follows: the first equation is for f˜1 = β˜1,1 (ˆ and the second equation is obtained for f˜2 (s2 ) = 1 1 k2 x2t (s2 )[u2 (x1t , es2 ) − s ∈Q2 u2 (x1t , es2 )x2t (s2 )]. This 2 completes the proof. The convergence to the ODE for small time delays follows the same lines. Since our power-allocation game is a robust pseudopotential game, the almost-sure convergence to equilibria follows. 1) Expected Robust Games With Two Actions: For twoplayer expected robust games with two actions, i.e., A1 = {s11 , s21 } and A2 = {s12 , s22 }, one can transform the system of ODEs of the strategy learning into a planar system under the following form: α˙ 1 = Q1 (α1 , α2 ),
α˙ 2 = Q2 (α1 , α2 )
(14)
where we let αj = xj (s1j ). The dynamics for transmitter j can be expressed in terms of α1 and α2 only as x1 (s21 ) = 1 − x1 (s21 ) and x2 (s22 ) = 1 − x2 (s22 ). We use the Poincaré–Bendixson theorem and the Dulac criterion [21] to establish a convergence result for (14). Theorem 1 [21]: For an autonomous planar vector field as in (14), the Dulac criterion states the following. Let γ(.) be a scalar function defined on the unit square [0, 1]2 . If (∂[γ(α))α˙ 1 ]/∂α1 ) + (∂[γ(α)α˙ 2 ]/∂α2 ) is not identically zero and does not change sign in [0, 1]2 , then there are no cycles entirely lying in [0, 1]2 . Corollary 3: Consider a two-player two-action game. Assume that each transmitter adopts the Boltzmann–Gibbs CODIPAS-RL with λi,t /νi,t = (λt /νt ) −→ 0. Then, the asymptotic pseudotrajectory reduces to a planar system in the form of α˙ 1 = β˜1,1 (u1 (es1 , α2 )) − α1 and α˙ 2 = β˜2,2 (u2 (α1 , es2 )) − α2 . Moreover, the system satisfies the Dulac criterion.
Proof: We apply Theorem 1 with γ(·) ≡ 1 and obtain the divergence of −2, which is strictly negative. Hence, the result follows. Note that, for the replicator dynamics, the Dulac criterion reduces to (1 − 2α1 )(u1 (es11 , α2 ) − u1 (es21 , α2 )) + (1 − 2α2 )(u2 (α1 , es12 ) − u2 (α1 , es22 )), which vanishes for (α1 , α2 ) = (1/2, 1/2). It is possible to have an oscillating behavior and limit cycles in replicator dynamics, and the Dulac criterion does not apply in general. However, the stability of the replicator dynamics can be directly studied in the two-action case by identifying the game by one of the following types: coordination, anticoordination, prisoner dilemma, and hawkand-dove (or chicken) games [46]. The following corollary follows from Theorem 1. Corollary 4 (Heterogeneous Learning): If transmitter 1 is with the Boltzmann–Gibbs CODIPAS-RL and transmitter 2 is with a CODIPAS-RL scheme leading to replicator dynamics, then the Dulac criterion for the convergence condition reduces to (1 − 2α2 )(u2 (α1 , es12 ) − u2 (α1 , es22 )) < 1 for any (α1 , α2 ). V. N UMERICAL I NVESTIGATION Here, we provide some numerical results illustrating our theoretical findings. We start by the two-receiver case and illustrate the convergence to the global optimum under the heterogeneous CODIPAS-RL (HCRL). Next, we study the impact of the delayed feedback of the system in the three-receiver case. A. Two Receivers In order to illustrate the algorithm, a simple example with two transmitters and two channels is considered. The discrete set of actions for each transmitter is described as follows. Each transmitter chooses among two possible actions, i.e., s1 = diag[pmax , 0] and s2 = diag[0, pmax ], where diag denotes the diagonal matrix. Each transmitter follows the CODIPAS-RL algorithm as described in Section IV. The only one-step delay feedback received by the transmitter is the noisy payoff. A mixed strategy xj,t in this case corresponds to the probabilities of selecting elements in Q1 = Q2 = {s1 , s2 }, while the payoff perceived by transmitter j, i.e., u ˆj,t , is the achievable capacity. We normalize the payoffs to [0, 1]. The transmitters will learn the estimated payoff and strategy as described with β˜j,j , which is the Boltzmann–Gibbs distribution with j = 0.1; λt and νt are given by λt = 1/(1 + t) and νt = 1/(1 + t)3/5 , respectively. It is clear that the game has many equilibria, i.e., (s1 , s2 ), (s2 , s1 ), and ((1/2, 1/2), (1/2, 1/2)). The action profiles (s1 , s2 ) and (s2 , s1 ) are the global optima of the normalized expected game. We observe below the convergence to one of the global optima using the heterogeneous learning. Heterogeneous learning CODIPAS-RL: HCRL The ODE convergence of the strategy and the payoff is shown in Figs. 1 and 3, respectively, where the game is played several times. We observe that, when the two transmitters use different learning patterns as in (HCRL), the convergence time are different, as well as the outcome of the game. It is important to notice that, in this example,
1000
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 4, AUGUST 2011
Fig. 1. Heterogeneous CODIPAS-RL. Convergence of the ODE of Strategies. Tx2 learns faster than Tx1.
the CODIPAS-RL converges to the global optimum of the robust game, which is also a strong equilibrium (resilient to any coalition of transmitters of any size). B. Three Receivers Here, we illustrate the learning algorithm with two transmitters and three channels. The discrete set of actions for each transmitter is described as follows. Each transmitter chooses among three possible actions, i.e., s∗1 = diag[pj,max , 0, 0], s∗2 = diag[0, pj,max , 0], and s∗3 = diag[0, 0, pj,max ]. These strategies correspond to the case where each transmitter put its total power to one of the channels. Each transmitter follows the CODIPAS-RL algorithm as described in Section IV. The only one-step delay feedback received by the transmitted is the noisy payoff, which was obtained after allocating power to the pair of receivers. A mixed strategy xj,t in this case corresponds to the probability of selecting an element in Qj = {s∗1 , s∗2 , s∗3 }, while the payoff perceived by transmitter j, i.e., u ˆj,t , is the imperfect achievable capacity. We fix parameters nt = 2, nr = 3, T = 300, λt = 2/T , and τj = 1. In Fig. 2, we represent the strategy evolution of transmitters Tx1 and Tx2. We observe that the CODIPASRL converges to a global optimum of the expected long-run interaction. The total number of iterations needed to guarantee a small error tolerance is relatively small. In the long term, transmitter Tx1 will put its maximum power with frequency 1, which corresponds to action s∗1 , and Tx2 will be using frequency 3. At a small fraction of time, frequency 2 is used. Thus, the transmitters will not interfere, and the equilibrium is learned (see Fig. 3). Impact of Time-Delayed Noisy Payoffs Next, we keep the same parameters but change the time delays to τj = 2. In Fig. 4, we represent the strategy evolution of transmitters Tx1 and Tx2 under the delayed CODIPAS-RL. As we can see, the convergence time and the stability of the system changed. The transmitters use action s∗2 more, as compared with the scenario in Fig. 4. This is because the estimated payoff under two-step time delays are uncertain, and the prediction is not good enough compared with the actual payoffs. The horizon to have a good prediction is much bigger than the first scenario (2000 versus 300). This scenario tell us how much the feedback delay is important at the transmitter; the time delay τ can change the outcome of the interaction.
Fig. 2. CODIPAS-RL. Convergence to equilibria. The global optimum of the expected game is achieved.
Fig. 3. CODIPAS-RL. Convergence of payoff estimations. Tx2 learns faster than Tx1. (Below) Zoom.
VI. D ISCUSSIONS Here, we discuss how to extend our algorithm when an approximated gradient is not available. In other words, the question is given as follows: Can we extend the CODIPAS-RL into dynamic robust games with continuous action spaces, nonlinear payoffs, and the only observation of the numerical value of own payoffs? To answer to this question, we observe that, if, instead of the numerical value of the payoffs, a value of the gradient of the
TEMBINE: DYNAMIC ROBUST GAMES IN MIMO SYSTEMS
1001
A number of further issues are under consideration. It would be of great interest to develop theoretical bounds for the rate of convergence of CODIPAS-RLs. Also, it would be natural to extend the analysis of our CODIPAS-RL algorithms to more classes of wireless games, including nonpotential games, outage probability under uncertain channel states, and the dynamic robust games with the energy efficiency function as the payoff function. Also, we aim to generalize the CODIPAS-RL in the context of Itô’s stochastic differential equation (SDE). Typically, the case where the strategy ˆ t ) + Mt+1 ) + learning has the form of xt+1 = xt + λt (f (xt , u √ ˆ t ) can be seen as an Euler scheme of Itô’s SDE, λt σ(xt , u ˆ t )dt + σ j (xt , u ˆ t )dBj,t , where Bj,t is a i.e., dxj,t = fj (xt , u standard Brownian motion in R|Qj | . Note that the distribution of the above SDE can be expressed as a solution of a Fokker–Planck–Kolmogorov forward equation [48]. ACKNOWLEDGMENT The author would like to thank three anonymous reviewers and Prof. T. Vasilakos for their valuable comments for the improvement of this paper. Fig. 4.
CODIPAS-RL under two-step delayed payoffs. Effect of time delays.
R EFERENCES payoff is observed, then a descent–ascent and projection-based method can be used. Under monotone gradient payoffs, the stochastic gradient-like algorithms are known to be convergent (almost surely or weakly depending on the learning rates). However, if the gradient is not available, these techniques cannot be used. Sometimes, one needs to estimate the gradient from the past numerical values as in Robbins–Monro. Alternatively, the following CODIPAS-RL scheme can be used for the unconstrained problem: ˆ j,kl,t+1 = Q ˆ j,kl,t +λj,t j kj Pj,kl,t uj,t + j λj,t σj Zj,t (15) Q Pj,kl,t = aj,kl sin(wj,kl t+φj,kl ), Qj,kl,t ˆ j,kl,t +Pj,kl,t =Q u ˆj,t+1 = u ˆj,t +νt (uj,t − u ˆt )
(16) (17)
where Qj,kl,t denotes entry (k, l) of matrix Qj,t ; Zj,t is an independent and identically distributed Gaussian process; and aj , wj , and φj are positive real-valued matrices. We do not have a general convergence proof of this new CODIPAS-RL scheme and postpone it for future work. VII. C ONCLUSION AND F UTURE W ORK In this paper, we have proposed novel robust game theoretical formulations to solve one of the challenging and unsolved power-allocation problems in wireless communication systems, i.e., how to allow in a decentralized way communications over the MIMO Gaussian-interference channel among multiple transmitters, under uncertain channel states and delayed noisy Shannon rates (delayed imperfect payoffs). We have provided a heterogeneous delayed CODIPAS-RL algorithm for the corresponding dynamic robust games. We have provided an ODE approach and have illustrated the CODIPAS-RL numerically.
[1] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991. [2] E. Telatar, “Capacity of multi-antenna Gaussian channels,” Bell Labs., Murray Hill, NJ, Tech. Rep., 1995. [3] W. Yu, W. Rhee, S. Boyd, and J. Cioffi, “Iterative water-filling for Gaussian vector multiple-access channels,” IEEE Trans. Inf. Theory, vol. 50, no. 1, pp. 145–152, Jan. 2004. [4] M. Aghassi and D. Bertsimas, “Robust game theory,” Math. Program., vol. 107, no. 1/2, pp. 231–273, Jun. 2006. [5] A. J. Anandkumar, A. Anandkumar, S. Lambotharan, and J. Chambers, “Robust rate maximization game under bounded channel uncertainty,” in Proc. IEEE ICASSP, Mar. 2010, pp. 3158–3161. [6] Y. Wu, K. Yang, J. Huang, X. Wang, and M. Chiang, “Distributed robust optimization part II: Wireless power control,” J. Optim. Eng., 2009, submitted for publication. [7] G. Scutari, D. P. Palomar, J. S. Pang, and F. Facchinei, “Flexible design of cognitive radio wireless systems: From game theory to variational inequality theory,” IEEE Signal Process. Mag., vol. 26, no. 5, pp. 107– 123, Sep. 2009. [8] G. Arslan, M. F. Demirkol, and Y. Song, “Equilibrium efficiency improvement in MIMO interference systems: A decentralized stream control approach,” IEEE Trans. Wireless Commun., vol. 6, no. 8, pp. 2984–2993, Aug. 2007. [9] G. Arslan, M. F. Demirkol, and S. Yüksel, “Power games in MIMO interference systems,” in Proc. GameNets, Istanbul, Turkey, May 2009, pp. 52–59. [10] L. Lai and H. El Gamal, “The water-filling game in fading multipleaccess channels,” IEEE Trans. Inf. Theory, vol. 54, no. 5, pp. 2110–2122, May 2008. [11] H. Tembine, E. Altman, R. El-Azouzi, and Y. Hayel, “Evolutionary games in wireless networks,” IEEE Trans. Syst., Man, Cybern. B, Cybern.—Special Issue on Game Theory, vol. 40, no. 3, pp. 634–646, Jun. 2010. [12] H. Tembine, A. Kobbane, and M. El Koutbi, “Robust power allocation games under channel uncertainty and time delays,” in Proc. IFIP Wireless Days, 2010, pp. 1–5. [13] Q. Zhu, H. Tembine, and T. Baar, “Heterogeneous learning in zero-sum stochastic games with incomplete information,” in Proc. 49th IEEE CDC, 2010, pp. 1–6. [14] S. M. Perlaza, H. Tembine, and S. Lasaulce, “How can ignorant but patient cognitive terminals learn their strategy and utility,” in Proc. IEEE SPAWC, 2010, pp. 1–5. [15] H. Tembine, “Distributed strategic learning in dynamic robust games: Dynamics, algorithms and applications,” Lecture Notes, Supelec, Jan. 2010.
1002
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 4, AUGUST 2011
[16] H. Tembine, E. Altman, R. El-Azouzi, and W. H. Sandholm, “Evolutionary game dynamics with migration for hybrid power control in wireless communications,” in Proc. 47th IEEE CDC, Dec. 2008, pp. 4479–4484. [17] H. Tembine, “Population games with networking applications,” Ph.D. dissertation, Univ. Avignon, Avignon, France, Sep., 2009. [18] R. Bellman, “A Markov decision process,” J. Math. Mech., vol. 6, pp. 679– 684, 1957. [19] A. Barto, R. Sutton, and C. Anderson, “Neuron-like adaptive elements that can solve difficult learning control problems,” IEEE Trans. Syst., Man, Cybern., vol. SMC-13, no. 5, pp. 834–846, Sep./Oct. 1983. [20] D. Monderer, “Multipotential games,” in Proc. 20th IJCAI, 2007, pp. 1422–1427. [21] J. Guckenheimer and P. Holmes, Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields. New York: Springer-Verlag, 1983. [22] E. Belmega, H. Tembine, and S. Lasaulce, “Learning to precode in outage minimization games over MIMO interference channels,” in Proc. IEEE Asilomar Conf. Signals, Syst., Comput., Pacific Grove, CA, Nov. 2010, pp. 1–6. [23] G. Scutari, D. Palomar, and S. Barbarossa, “The MIMO iterative waterfilling algorithm,” IEEE Trans. Signal Process., vol. 57, no. 5, pp. 1917– 1935, May 2009. [24] G. Scutari, D. P. Palomar, and S. Barbarossa, “Asynchronous iterative waterfilling for Gaussian frequency-selective interference channels,” IEEE Trans. Inf. Theory, vol. 54, no. 7, pp. 2868–2878, Jul. 2008. [25] M. Thathachar, P. Sastry, and V. V. Phansalkar, “Decentralized learning of Nash equilibria in multiperson stochastic games with incomplete information,” IEEE Trans. Syst., Man, Cybern., vol. 24, no. 5, pp. 769–777, May 1994. [26] W. Arthur, “On designing economic agents that behave like human agents,” J. Evol. Econ., vol. 3, no. 1, pp. 1–22, Feb. 1993. [27] T. Borgers and R. Sarin, “Learning through reinforcement and replicator dynamics,” Mimeo, Univ. College London, London, U.K., 1993. [28] A. Roth and I. Erev, “Learning in extensive form games: Experimental data and simple dynamic models in the intermediate term,” Games Econ. Behavior, vol. 8, no. 1, pp. 164–212, 1995. [29] D. Monderer and L. S. Shapley, “Potential games,” Games Econ. Behavior, vol. 14, pp. 124–143, 1996. [30] Y. Xing and R. Chandramouli, “Stochastic learning solution for distributed discrete power control game in wireless data networks,” IEEE/ACM Trans. Netw., vol. 16, no. 4, pp. 932–944, Aug. 2008. [31] H. J. Kushner and G. Yin, Stochastic Approximation and Recursive Algorithms and Applications., 2nd ed. New York: Springer-Verlag, 2003, pp. xxii–474. [32] H. J. Kushner and G. Yin, Stochastic Approximation Algorithms and Applications, 1st ed. New York: Springer-Verlag, 1997, pp. xxi–417. [33] M. Benaıuml;m, “Dynamics of stochastic approximations,” in Le Seminaire de Probabilites. New York: Springer-Verlag, 1999, pp. 1–68. [34] V. Borkar, “Stochastic approximation: A dynamical systems viewpoint,” in Texts and Readings in Mathematics 48. New Delhi, India: Hindustan Book Agency, 2008. [35] D. S. Leslie and E. J. Collins, “Convergent multiple timescales reinforcement learning algorithms in normal form games,” Ann. Appl. Probab., vol. 13, no. 4, pp. 1231–1251, 2003. [36] P. D. Taylor and L. B. Jonker, “Evolutionarily stable strategies and game dynamics,” Math. Biosci., vol. 40, no. 1/2, pp. 145–156, Jul. 1978.
[37] J. R. Marden, H. P. Young, G. Arslan, and J. S. Shamma, “Payoffbased dynamics for multi-player weakly acyclic games,” SIAM J. Control Optim., vol. 48, no. 1, pp. 373–396, 2009. [38] M. P. Anastasopoulos, D. K. Petraki, R. Kannan, and A. V. Vasilakos, “TCP throughput adaptation in WiMax networks using replicator dynamics,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 40, no. 3, pp. 647– 655, Jun. 2010. [39] A. V. Vasilakos and M. P. Anastasopoulos, “Application of evolutionary game theory to wireless mesh networks,” in Advances in Evolutionary Computing for System Design, L. Jain, Ed. New York: Springer-Verlag, 2007. [40] B. An, A. V. Vasilakos, and V. Lesser, “Evolutionary stable resource pricing strategies,” in Proc. ACM SIGCOMM, Barcelona, Spain, Aug. 2009. [41] Y. Wang, A. Nakao, A. V. Vasilakos, and J. Ma “P2P soft security: On evolutionary dynamics of P2P incentive mechanism, Comput. Commun. (COMCOM). Feb. 2010. [Online]. Available:http://dx.doi.org/10.1016/ j.comcom.2010.01.021 [42] H. Tembine, E. Altman, R. El-Azouzi, and Y. Hayel, “Evolutionary games with random number of interacting players with application to access control,” in Proc. WiOpt, 2008, pp. 344–351. [43] H. Tembine, E. Altman, and R. El-Azouzi, “Delayed evolutionary game dynamics applied to medium access control,” in Proc. 4th IEEE Int. Conf. MASS, Pisa, Italy, 2007, pp. 1–6. [44] Q. Zhu, H. Tembine, and T. Baar, “Evolutionary games for hybrid additive white Gaussian noise multiple access control,” in Proc. IEEE GLOBECOM, 2009, pp. 1–6. [45] E. Altman, R. El-Azouzi, Y. Hayel, and H. Tembine, “Evolutionary power control games in wireless networks,” in Proc. 7th Int. IFIP-TC6 Netw. Conf. AdHoc Sens. Netw., Wireless Netw., Next Gener. Internet, 2008, pp. 930–942. [46] J. Weibull, Evolutionary Game Theory. Cambridge, MA: MIT Press, 1995. [47] J. Hofbauer and K. Sigmund, Evolutionary Games and Population Dynamics. Cambridge, U.K.: Cambridge Univ. Press, 1998. [48] H. Tembine, “Mean field stochastic games: Simulation, dynamics and network applications” Lecture notes, Unpublished manuscript, Supelec, Oct. 2010.
Hamidou Tembine received two M.S. degrees from the Ecole Polytechnique, Paris, France and from Joseph Fourier University, Grenoble, France, both in 2006, and the Ph.D. degree in computer science from University of Avignon, Avignon, France, in 2009. The Ph.D. thesis was entitled “Population Games With Networking Applications.” From 2007 to 2009, he was a Research Assistant with the Department of Computer Science, University of Avignon, and a Teacher Assistant with Aix-marseille University, Aix-en-Provence, France. He is currently an Assistant Professor with the Ecole Superieure d’Electricte (Supélec), Paris, France. His main research interests are population games, mean-field stochastic games, differential population games, and their applications.