Stochastic Online Learning for Network ... - Semantic Scholar

8 downloads 14655 Views 166KB Size Report
A. Stochastic Online Learning based on Multi-Armed Bandit ...... “Online Linear Optimization and Adaptive Routing,” Journal of Computer and System Sciences,.
1

Stochastic Online Learning for Network Optimization under Random Unknown Weights Keqin Liu,

Qing Zhao

Abstract We consider network optimization problems under random weights with unknown distributions. We first consider the shortest path problem that aims to optimize the quality of communication between a source and a destination through adaptive path selection. Due to the randomness and uncertainties in the network dynamics, the state of each communication link varies over time according to a stochastic process with unknown distributions. The link states are not directly observable. The aggregated endto-end cost of a path from the source to the destination is revealed after the path is chosen for communication. The objective is an adaptive path selection algorithm that minimizes regret defined as the additional cost over the ideal scenario where the best path is known a priori. This problem can be cast as a variation of the classic multi-armed bandit (MAB) problem with each path as an arm and arms dependent through common links. We show that by exploiting arm dependencies, a regret polynomial with the network size can be achieved while maintaining the optimal logarithmic order with time. This is in sharp contrast with the exponential regret order with the network size offered by a direct application of the classic MAB policies that ignores arm dependencies. Furthermore, these results are obtained under a general model of link state distributions (including heavy-tailed distributions). We then extend the results to general network optimization problems (e.g., minimum spanning tree and dominating set) and stochastic online linear optimization problems. These results find applications in cognitive radio and ad hoc networks with unknown and dynamic communication environments.

This work was supported by the National Science Foundation under Grant CCF-0830685, by the Army Research Office under Grant W911NF-08-1-0467, and by the Army Research Lab under Grant W911NF-09-D-0006. Part of the results was presented at the 10th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks and 2012 Information Theory and Applications Workshop. Keqin Liu and Qing Zhao are with the Department of Electrical and Computer Engineering, University of California, Davis, CA, 95616, USA. Email: {kqliu, qzhao}@ucdavis.edu.

2

I. I NTRODUCTION We consider network optimization problems under random weights with unknown distributions. We first focus on finding the shortest path between a source and a destination and then extend the results to general network optimization problems including minimum spanning tree and minimum (connected) dominating set. Consider a communication network where the state of each communication link is modeled as a random cost (or reward) that evolves according to an i.i.d. random process with an unknown distribution. At each time, a path from the source to the destination is selected as the communication route, and the total end-to-end cost, given by the sum of the costs of all links on the path, is subsequently observed. The cost of each individual link is not observable. The objective is to design an optimal sequential path selection policy to minimize the long-term total cost. A. Stochastic Online Learning based on Multi-Armed Bandit The above problem can be modeled as a variation of the classic Multi-Armed Bandit (MAB) problem. In the classic MAB [1]–[5], there are N independent arms and a player needs to decide which arm to play at each time. An arm, when played, incurs random cost drawn from an unknown distribution. The performance of a sequential arm selection policy is measured by regret, defined as the difference in expected total cost with respect to the optimal policy in the ideal scenario with known cost models where the player always plays the best arm. The optimal regret was shown by Lai and Robbins to be logarithmic with time [1]. Since arms are assumed independent, observations from one arm do not provide information about other arms. The optimal regret thus grows linearly with the number of arms. The adaptive shortest-path routing problem considered here can be cast as an MAB by treating each path as an arm. The difference is that paths are dependent through common links. While the dependency across paths can be ignored in learning and the policies for the classic MAB proposed in [1]–[5] directly apply, such a naive approach yields poor performance with a regret growing linearly with the number of paths, thus exponentially with the network size (i.e., the number of links which determines the number of unknowns) in the worst case. In this paper, we show that by exploiting the structure of the path dependencies, a regret polynomial with the network size can be achieved while preserving the optimal logarithmic order with time. Specifically, we propose an algorithm that achieves O(md3 log T ) regret for

3

all light-tailed cost distributions, where m is the number of edges, d the dimension of the path set that is upper bounded by m, and T the time horizon length. We further show that by sacrificing an arbitrarily small regret order with time, the proposed algorithm achieves regret linear with d. Specifically, the algorithm offers O(d f (T ) log T ) regret where f (T ) is an arbitrarily slowly diverging function with f (T ) → ∞ as T → ∞. The proposed algorithm thus offers a performance tradeoff in terms of the network size and the time horizon length. For heavy-tailed cost distributions, a regret linear with the size of the network and sublinear with time can be achieved. Specifically, the algorithm offers O(dT 1/q ) regret when the moments of the cost distributions exist up to the qth order. We point out that any regret sublinear with time implies the convergence of the time-average cost to the minimum one of the best path. These results also apply to the general problem of finding a minimum-weight subgraph satisfying certain properties (e.g., minimum spanning tree and minimum connected dominant set) in a network. The results thus find a broad area of applications, e.g., broadcasting in wireless networks. We further generalize the weight minimization problem to stochastic online linear optimization problems, where the action space is a compact subset of vector space Rd with dimension d ≥ 1. At each time, an action is chosen from the action space and a random cost with an unknown distribution (depending on the action) is incurred. The expectation of the random cost is given by an unknown linear function over the action space. We propose a policy that achieves O(d T 2/3 log1/3 T ) regret for all light-tailed cost distributions. In cases when the expected cost of the optimal action is bounded away from that of other actions (e.g., when the action set is a polytope or finite), the same regret orders obtained for the weight minimization problem can be achieved. B. Applications One application example of the shortest-path problem is adaptive routing in cognitive radio networks where secondary users communicate by exploiting channels temporarily unoccupied by primary users [6]. In this case, the availability of each link dynamically varies according to the communication activities of nearby primary users. The delay on each link can thus be modeled as a stochastic process unknown to the secondary users. The objective is to route through the path with the smallest latency (i.e., the lightest primary traffic) through stochastic online learning. The application of the general weight minimization problem includes broadcasting/multicasting

4

in communication networks [7]. For example, consider the wireless ad hoc networks where link states vary stochastically due to channel fading or random contentions with other users. Under this scenario, the objective is to find the minimum-weight spanning tree such that the message from a node can be most efficiently delivered to every node in the network. Note that we can also define weight on each node instead of each link. Another application example is the minimum-weight dominating set problem where the weight of a dominating set is given by the sum of weights of all nodes in the set [8]. In an ad hoc network, such problem arises in adaptive selection of the communications backbone while each node is subject to random degradations and/or intrusions. In a more general context, the stochastic linear optimization problem finds applications when the action space is uncountable, e.g., power control in energy and communication systems or dynamic pricing in social economic networks [9]. C. Related Work Several classic policies exist for the classic MAB with independent arms [1]–[4]. These policies achieve the optimal logarithmic regret order for certain specific light-tailed cost/reward distributions. In particular, Auer et al. proposed in [4] the so-called UCB policy that achieves the logarithmic regret order for all distributions with bounded support. In [10], [11], the UCB policy was extended to all light-tailed distributions. In [5], we proposed a learning algorithm based on a deterministic sequencing of exploration and exploitation (DSEE) that achieves the optimal logarithmic regret order for all light-tailed distributions and sublinear regret order for heavy-tailed distributions. While the policies proposed in this paper build upon the basic structure of DSEE, the problems addressed in this paper are more general and more complex than the classic MAB problems considered in [5]. Specifically, we relax two main assumptions—the independence of arms and the finiteness of the action space—in the classic MAB and address a more general class of online learning problems particularly relevant to communication networks. In the context of general weight minimization problem, most existing work focuses on deterministic and known weight/cost functions (e.g., [7], [8], [12]–[17]). The work most relevant to this paper is [18], where the link states are considered as random with unknown distributions. Under the assumption of link-level observability and weight distributions with bounded support, a learning algorithm was proposed in [18] based on the UCB policy in [4] to achieve O(m4 log T )

5

regret, where m is the number of links in the network. The learning algorithm proposed in this paper improves the regret order under a less informative observation model and a more general class of cost distributions. Furthermore, our algorithm directly applies to the model with linklevel observability as adopted in [18] and achieves O(md log T ) regret. In [19], the problem was addressed under an adversarial bandit model in which the link costs are chosen by an adversary and are treated as arbitrary bounded deterministic quantities. An algorithm was proposed to achieve regret sublinear with time and polynomial with the network size [19]. In the context of general stochastic online linear optimization problem, the work most relevant to this paper is [20], [21]. In [20], the regret was shown to be lower bounded by the order of √ √ d T and an algorithm with O(d3/2 T log3/2 T ) regret was proposed for cost distributions with bounded support. A more complicated algorithm, which includes a subroutine of solving an √ NP-hard problem, was proposed to achieve an O(d T log3/2 T ) regret [20]. Both algorithms proposed in [20] require an optimization oracle as a subroutine and have high complexity √ in general. In [21], an algorithm was proposed to achieve an O(d T ) regret when the cost distributions are sub-Gaussian and the action space satisfies certain smoothness and boundedness conditions. By relaxing the smoothness condition, an algorithm was proposed to achieve an √ O(d T log3/2 T ) regret based on Dani et al. ’s algorithms. In this paper, we focus on the general compact action space and all light-tailed (i.e., locally sub-Gaussian) cost distributions and propose a low-complexity algorithm to achieve an O(d T 2/3 log1/3 T ) regret. Dani et al. [20] also considered the special case where the action space is a polytope or finite. An algorithm was proposed to achieve an O(d2 log3 T ) regret given a nontrivial and known lower bound on the difference in expected cost between the best and other actions [20]. The policy proposed in this paper with O(d f (T ) log T ) regret improves the regret order in both T and d without any knowledge on the cost models. The problem can also be cast in the broader context of continuumarmed bandit with an uncountable number of arms [22]–[25]. In [22], Agrawal proposed a policy with O(T 3/4) regret for d = 1 under the assumption that the expected cost is Lipschitz-continuous with the action. Note that this condition is satisfied in the problem considered in this paper. In [23], Kleinberg improved Agrawal’s policy to achieve an O(T 2/3 log1/3 T ) regret for d = 1. For the general case of d > 1, an O(d3T 3/4 ) regret is achieved under certain conditions that are not necessarily satisfied in the problem at hand. The policy proposed in this paper offers the same regret order as Kleinberg’s policy for d = 1. For d > 1, the proposed policy has

6

a better regret order in both T and d. The comparison, however, cannot be done on an equal footing since the two policies are developed under different assumptions on the cost model. Under certain more restrictive and complex conditions, Auer et al. [24] developed policies with a regret order close to T 1/2 (up to an additional sublogarithmic term) for d = 1, and Cope [25] closed the sublogarithmic gap after imposing additional assumptions. In a broader context, this paper studies a variation of the classic MAB by addressing learning from arm dependencies. Other variations of MABs that have been studied recently include decentralized MAB with multiple players [26]–[29] and MABs with restless Markovian rewards [30]– [33]. II. P ROBLEM F ORMULATION A. Adaptive Shortest-Path Routing Consider the communication between a source s and a destination r in a network. Let G = (V, E) denote the directed graph consisting of all simple paths from s to r (see Fig. 1). Let m and n denote, respectively, the number of edges and vertices in graph G. We (t) s

Fig. 1.

r

The network model for adaptive shortest-path routing.

Let We (t) denote the weight of edge e at time t. We assume that for each edge e, We (t) is an i.i.d. process in time with an unknown distribution. Edge weights may have different distributions and arbitrary unknown dependencies. At the beginning of each time slot t, a path i ∈ P (i ∈ {1, 2, . . . , |P|}) from s to r is chosen, where P is the set of all paths from s to r in G. Subsequently, the total end-to-end cost Ci (t) of path i, given by the sum of the weights of all edges on the path, is revealed at the end of the slot. The edge cost We (t) is not directly observable. The regret of a path selection policy π is defined as the expected additional cost incurred over time T compared to the optimal policy that always selects the best path (i.e., the path with

7

the minimum expected total end-to-end cost). The objective is to design a path selection policy π to minimize the growth rate of the regret with respect to both the horizon length T and the network size m (the number of unknowns). Let σ be a permutation of all paths such that E[Cσ(1) (t)] ≤ E[Cσ(2) (t)] ≤ · · · ≤ E[Cσ(|P|) (t)]. We define regret ∆

T X

Rπ (T ) = E[

t=1

(Cπ (t) − Cσ(1) (t))],

where Cπ (t) denotes the total end-to-end cost of the selected path under policy π at time t. B. Vector Representation We represent each path i as a vector pi with m entries consisting of 0’s and 1’s representing whether or not an edge is on the path. The vector space of all paths is embedded in a ddimensional (d ≤ m) subspace1 of Rm . The cost of path i at time t is thus given by the linear function Ci (t) =< [W1 (t), W2 (t), . . . , Wm (t)], pi > .

The vector space of all paths is a compact subset of Rd with dimension d ≤ m. For any

compact subset of Rd , there exists a barycentric spanner (defined below), which can be efficiently constructed [19]. Definition 1: A set B = {p1 , . . . , pd } is a barycentric spanner for a d-dimensional set P (B ⊂ P) if every p ∈ P can be expressed as a linear combination of elements of B using coefficients in [−1, 1]. III. A DAPTIVE S HORTEST-PATH ROUTING A LGORITHM

In this section, we propose an online learning algorithm for adaptive routing under unknown edge weights and analyze the regret performance of the proposed algorithm in both the network size and time horizon length. 1

If graph G is acyclic, then d = m − n + 2.

8

A. The Algorithm The algorithm first partitions time into interleaving exploration and exploitation sequences. In the exploration sequence, we sample the d basis vectors in the barycentric spanner in a roundrobin fashion to obtain an empirical mean of the cost of each basis path. In the exploitation sequence, we obtain an estimate of the cost of each path as a linear combination of the empirical mean of the basis paths and select the path with the minimum estimated cost. Specifically, in a slot t that belongs to the exploitation sequence, let θpk (t) denote the current empirical mean of the cost of the basis path pk (k = 1, 2, . . . , d), where we have labeled the paths in the barycentric spanner as the first d paths in P. For any path pi ∈ P, let {ak : |ak | ≤ 1}dk=1 be the coefficients such that pi =

Pd

k=1

ak pk . The estimated cost of this path is then given by θ pi (t) =

d X

ak θpk (t).

(1)

k=1

ˆ ∗ (t) with The path p ˆ ∗ (t) = arg min{θpi (t), i = 1, 2, . . . , |P|}. p

(2)

is then chosen at this exploitation time t. A detailed algorithm is given in Fig. 2. With the above structure, the only remaining issue is the design of the cardinality of the exploration sequence which balances the tradeoff between learning and exploitation. Specifically, we need sufficient exploration time to guarantee the learning accuracy. However, the regret order is also lower bounded by the cardinality of the exploration sequence since a fixed portion of the sequence would be spent on bad paths. To minimize the regret growth rate, the cardinality of the exploration sequence should be set to the minimum that ensures the cost incurred in the exploitation sequence (by choosing a wrong path) having an order no higher than the cardinality of the exploration sequence. As shown in the next subsection, the cardinality of the exploration sequence can be chosen according to concentration of the edge weight distribution to achieve the optimal logarithmic regret order with T for all light-trailed distributions and sublinear regret order with T for heavy-tailed distributions. For both light-tailed and heavy-tailed distributions, the regret order is polynomial with the network size. B. Regret Analysis

9

Adaptive Shortest-Path Routing Algorithm •

Notations and Inputs: Construct a barycentric spanner B = {p1 , p2 , . . . , pd } of the vector space of all paths. Let A(t) denote the set of time indices that belong to the exploration sequence up to (and including) time t and A(1) = {1}. Let |A(t)| denote the cardinality of A(t). Let θpk (t) denote the sample mean of path pk (k ∈ {1, . . . , d}) computed from the past cost observations on the path. For two ∆

positive integers k and l, define k ⊘ l = ((k − 1) mod l) + 1, which is an integer taking values from 1, 2, · · · , l. •

At time t, 1. if t ∈ A(t), choose path pk with k = |A(t)| ⊘ d; 2. if t ∈ / A(t), estimate the cost of each path pi ∈ P according to (1). Choose ˆ ∗ (t) as given in (2). path p

Fig. 2.

The adaptive shortest path routing algorithm.

1) Light-Tailed Edge Weight Distributions: We recall the definition of light-tailed distributions below. Definition 2: A random variable X is light-tailed if its moment-generating function exists, i.e., there exists a u0 > 0 such that for all u ≤ |u0 |, ∆

M(u) = E[exp(uX)] < ∞. Otherwise X is heavy-tailed. For a zero-mean light-tailed random variable X, we have [34], for all u ≤ |u0 |, M(u) ≤ exp(ζu2/2),

(3)

where ζ ≥ sup{M (2) (u), − u0 ≤ u ≤ u0 } with M (2) (·) denoting the second derivative of M(·) and u0 the parameter specified in Definition 2. From (3), we have the following extended

10

Chernoff-Hoeffding bound on the deviation of the sample mean from the true mean for lighttailed random variables [5]. s Lemma 1 [5]: Let {X(t)}∞ t=1 be i.i.d. light-tailed random variables. Let Xs = (Σt=1 X(t))/s

and θ = E[X(1)]. We have, for all δ ∈ [0, ζu0], a ∈ (0, 1/(2ζ)], Pr(|Xs − θ| ≥ δ) ≤ 2 exp(−aδ 2 s).

(4)

Note that the path cost is the sum of the costs of all edges on the path. Since the number of edges in a path is upper bounded by m, the bound on the moment generating function in (3) holds on the path cost by replacing ζ by mζ, so does the Chernoff-Hoeffding bound in (4). Theorem 1: Construct an exploration sequence as follows. Let a, ζ, u0 be the constants such that (4) holds for each edge cost. Let d ≤ m be the dimension of the path set. Choose a constant b > 2m/a, a constant c ∈ (0,

min

j:E[Cσ(j) (t)−Cσ(1) (t)]>0

{E[Cσ(j) (t) − Cσ(1) (t)]}),

and a constant w ≥ max{b/(mdζu0)2 , 4b/c2 }. For each t > 1, if |A(t − 1)| < d⌈d2 w log t⌉, then include t in A(t). Under this exploration sequence, the adaptive routing algorithm has regret R(T ) ≤ Amd3 log T for some constant A independent of d and m. Proof: Since A(T ) ≤ d⌈d2 w log T ⌉, the regret caused in the exploration sequence is at

the order of md3 log T . Now we consider the regret caused in the exploitation sequence. Let

Ek denote the kth exploitation period which is the kth contiguous segment in the exploitation sequence. Let Ek denote the kth exploitation period. Similar to the proof of Theorem 3.1 in [5], we have |Ek | ≤ htk

(5)

for some constant h independent of d and m. Let tk > 1 denote the starting time of the kth exploitation period. Next, we show that by applying the Chernoff-Hoeffding bound in (4) on the path cost, for any t in the kth exploitation period and i = 1, . . . , d, we have −ab/m

Pr(|E[Cpi (t)] − θpi (t)| ≥ c/(2d)) ≤ 2tk

.

11 ∆

q

To show this, we define the parameter ǫi (t) = b log t/τi (t), where τi (t) is the number of times that path i has been sampled up to time t. From the definition of parameter b, we have ǫi (t) ≤ min{mζu0, c/(2d)}.

(6)

Applying the Chernoff-Hoeffding bound, we arrive at Pr(|E[Cpi (t)] − θ pi (t)| ≥ c/(2d)) ≤ Pr(|E[Cpi (t)] − θ pi (t)| ≥ ǫi (t)) −ab/m

≤ 2tk

.

In the exploitation sequence, the expected times that at least one path in B has a sample mean deviating from its true mean cost by c/(2d) is thus bounded by −ab/m

Σ∞ k=1 2dtk

1−ab/m htk ≤ Σ∞ = gd. t=1 2hdt

(7)

for some constant g independent of d and m. Based on the property of the barycentric spanner, the best path would not be selected in the exploitation sequence only if one of the basis vector in B has a sample mean deviating from its true mean cost by at least c/(2d). We thus proved the theorem. We point out that under the model with link-level observability as adopted in [18], a slight modification to the proof of Theorem 1 leads to O(md log T ) regret2 . In Theorem 1, we need a lower bound (parameter c) on the difference in the expected costs of the best and the second best paths. We also need to know the bounds on parameters ζ and u0 such that the Chernoff-Hoeffding bound (4) holds. These bounds are required in defining w that specifies the minimum leading constant of the logarithmic cardinality of the exploration sequence necessary for identifying the best path. Similar to [5], we can show that without any knowledge of the cost models, increasing the cardinality of the exploration sequence of π ∗ by an arbitrarily small amount leads to a regret linear with d and arbitrarily close to the logarithmic order with time. 2

Specifically, after each basis vector has been played by O(log T ) times, each path has also been observed by the same

amount of time. We can thus choose a constant (linear with d) in O(log T ) to bound the error of the sample mean of each path by c/2 with high probability.

12

Theorem 2: Let f (t) be any positive increasing sequence with f (t) → ∞ as t → ∞. Define the exploration sequence in the adaptive routing algorithm as follows: include t (t > 1) in A(t) if |A(t − 1)| < d⌈f (t) log t⌉. The resulting regret is given by R(T ) = O(d f (T ) log T ). Proof: It is sufficient to show that the regret caused in the exploitation sequence is bounded by O(d), independent of T . Since the exploration sequence is denser than the logarithmic order as in Theorem 1, it is not difficult to show that the bound on |Ek | given in (5) still holds with a different value of h. We consider any positive increasing sequence b(t) such that b(t) = o(f (t)) and b(t) → ∞ as t → ∞. By replacing b in the proof of Theorem 1 with b(t), we notice that after some finite time T0 , the parameter ǫi (t) will be small enough to ensure (6) holds and b(t) will be large enough to ensure (7) holds. The proof thus follows. 2) Heavy-Tailed Edge Weight Distributions: We now consider the heavy-tailed cost distributions whose moment-generating functions do not exist and the moments exist up to the q-th (q > 1) order. From [5], we have the following bound on the deviation of the sample mean from the true mean for heavy-tailed cost distributions. Lemma 2 [5]: Let {X(t)}∞ t=1 be i.i.d. random variables drawn from a distribution with the

q-th moment (q > 1). Let X t = (Σtk=1 X(k))/t and θ = E[X(1)]. We have, for all ǫ > 0, Pr(|X t − θ| > ǫ) = o(t1−q ).

(8)

Based on Lemma 2, we arrive at the following theorem that states the sublinear regret order of the DSEE-based adaptive routing algorithm with a properly chosen exploration sequence. Theorem 3: Construct an exploration sequence as follows. Choose a constant v > 0. For each t > 1, if |A(t − 1)| < vdt1/q , then include t in A(t). the DSEE-based adaptive routing algorithm has regret R(T ) ≤ DdT 1/q for some constant D independent of d and T . Proof: Based on the construction of the exploration sequence, it is sufficient to show that the regret in the exploitation sequence is o(T 1/q ) · d. From (8), we have, for any i = 1, . . . , d, Pr(|E[Cpi (t)] − θpi (t)| ≥ c/(2d)) = o(|A(t)/d|1−q ).

13

For any exploitation slot t ∈ A(t), we have |A(t)| ≥ vdt1/q . We arrive at Pr(|E[Cpi (t)] − θ pi (t)| ≥ c/(2d)) = o(t(1−q)/q ). Since the best path will not be chosen only if at least one of the basis vector has the sample deviating from the true mean by c/(2d), the regret in the exploitation sequence is thus bounded by T X t=1

o(t(1−q)/q ) · d = o(T 1/q ) · d.

IV. G ENERAL W EIGHT M INIMIZATION P ROBLEMS In this section, we extend the results in Sec. III to the general weight minimization problem. Consider an arbitrary class of subgraphs (e.g., paths, spanning trees) of a network. The weight of a subgraph is given by the sum of weights of all edges in this subgraph. Similar to Sec. III, the weight of each edge evolves i.i.d. over time with an unknown distribution. After a subgraph is chosen, its weight is observed. The weight of each individual edge (if not the only edge) in the subgraph may not be directly observable. The objective is to minimize the regret which is similarly defined. An example is shown in Fig. 3-(a) where a spanning tree of a graph needs to be selected at each time. We point out that the weight minimization problem also includes the case where each node is associated with a weight, e.g., the dominating set problem as illustrated in Fig. 3-(b).

We (t) Wv (t)

(a) Fig. 3.

(a) The spanning-tree problem; (b) The dominating-set problem.

(b)

14

The algorithm in Sec. III can be directly applied to the general weight minimization problem. Specifically, given a class of subgraphs in a network with m links (or nodes), we can represent each subgraph as a vector in Rd (d ≤ m) indicating whether or not each edge (or each node) is in this subgraph. Subsequently, we can construct the barycentric spanner of the vector space consisting of all subgraphs and explore only the barycentric spanner in the exploration sequence. It is not hard to show that Theorem 1, Theorem 2, and Theorem 3 still hold. V. S TOCHASTIC O NLINE L INEAR O PTIMIZATION P ROBLEM We now consider the general Stochastic Online Linear Optimization (SOLO) problem that has a general action space (an arbitrary compact set) and and includes the wight-minimization problem as a special case. In a SOLO problem, we are given a compact set U ⊂ Rd with dimension d, referred to as the action set. At each time t, we choose a point x ∈ U and

observe a cost < C(t), x > where C(t) ∈ Rd is drawn from an unknown distribution3. We

assume that {C(t)}t≥1 are i.i.d. over t and the cost < C(t), x > is light-tailed for each x.

The objective is to design a sequential action selection policy π to minimize the regret Rπ (T ), defined as the total difference in the expected cost over T slots compared to the optimal action ∆

x∗ = arg min{E[< C(t), x >]}: π



T X

R (T ) = E[

t=1

(< C(t), x(π) > − < C(t), x∗ >)],

where x(π) denotes the action under policy π at time t. The structure of the policy remains the same as in the adaptive routing problem that has a finite action space. Specifically, time is partitioned into an exploration sequence and an exploitation sequence. In the exploration sequence, the d actions in the barycentric spanner are chosen in a round-robin fashion and an estimated cost of each action in U is computed by linearly

ˆ∗ ∈ U interpolating the sample means of these d actions. In the exploitation sequence, the action x with the minimum estimated cost is chosen.

The main difference is in the design of the cardinality of the exploration sequence. With an uncountable action space, it is much more difficult to learning which action is the optimal one. A perfect identification of the optimal action requires a perfect estimate of the expected 3

The following result holds in a more general scenario that only requires the expected cost is linear with x.

15

cost of each basis vector, which cannot be achieved over any finite horizon. As illustrated in Fig. 4, the basic idea is to ensure that the estimated best action approaches the true one at a sufficiently fast rate as the exploration time increases. Equivalently, we need the estimation error of each basis vector to approach zero at a sufficiently fast rate by choosing a sufficiently dense exploration sequence. This is significantly different from the special case of finite actions, where a constant estimation error is allowed to identify the best action. Consequently, we require a denser exploration sequence, as detailed in the following theorem.

Action Space ˆ∗ x x∗ ˆ∗ x

t=1

Fig. 4.

11 00 00 11 00 11 00 0011 11 00 11 00 11 00 0011 11 00 11

11 00 00 11 00 11 00 11 00 11

11 00 00 11 00 11 00 11 00 11

11 00 00 11 00 11

11 00 00 11 00 11 00 11 00 11

111 000 000 111 000 111 000 111 000 111

Exploration

T

The online linear optimization algorithm.

Theorem 4: Construct an exploration sequence as follows. Recall the parameters a, ζ, u0 ∆

defined in (4). Choose b > 2/a. Define c = d log T 1/3 /T 1/3 . Choose a constant w ≥ max{b/(dζu0)2 , 4b/c2 }. For each t > 1, if |A(t − 1)| < d⌈d2 w log T ⌉, then include t in A(t). Under this exploration

sequence, the resulting policy π s has regret s

Rπ (T ) = O(d T 2/3 log1/3 T ). Proof: Since the cardinality of the exploration phase is O(dT 2/3 log1/3 T ), the regret caused by exploration is bounded by the same order. We consider the regret incurred in the exploitation sequence. Similar to the proof of Theorem 1, in the exploitation phase, the expected times that

16

at least one action in B has a sample mean deviating from its true mean cost by c/(2d) is bounded by O(d). Based on the property of the barycentric spanner, the expected times that the chosen action is not in the c-neighborhood (consisting of all x such that | < C, x − x∗ > | ≤ c)

of x∗ is upper bounded by O(d), i.e., the regret caused by choosing an action outside the c-

neighborhood is O(d). By further noticing that the regret caused by choosing an action inside the c-neighborhood is O(cT ) = O(dT 2/3 log1/3 T ), we proved the theorem. As similar to the weight minimization problem, in Theorem 3, we need to know the bounds on parameters a, ζ and u0 such that the Chernoff-Hoeffding bound (4) holds. In the following, we show that without any knowledge of the cost models, increasing the cardinality of the exploration phase of π s by an arbitrarily small amount leads to a regret arbitrarily close to that given in Theorem 4. Theorem 5: Let f (t) be any positive increasing sequence with f (t) → ∞ as t → ∞. Revise

policy π s in Theorem 3 as follows: if |A(t − 1)| < d⌈f (T )T 2/3 log1/3 T ⌉, then include t in A(t). ′

Under this exploration sequence, the resulting policy π s has regret s′

Rπ (T ) = O(d f (T )T 2/3 log1/3 T ). We point out that for a special class of the problem when there exists a nonzero gap in expected cost between the best action and the rest of the action space, Theorem 1, Theorem 2, and Theorem 3 hold. One example in this class is when the action space is given by a polytope. It is then sufficient to focus on the set of extremal points since the optimal point must be an extremal point due to the linearity of the expected cost functions [20]. We can thus reduce the action space to a finite set and there must exist a non-zero gap in expected cost between the best and the second best actions. VI. C ONCLUSION In this paper, we considered the adaptive shortest-path routing problem in networks with unknown and stochastically varying link states, where only the total end-to-end cost of a path is observable after the path is selected for routing. By exploiting the dependency among the paths, we proposed a policy that reduces the learning complexity of identifying the optimal action among a large set of actions to sampling from only a small set of basis vectors. The proposed policy achieves a regret polynomial with problem size and logarithmic with time for

17

all light-tailed cost distributions. The regret order can be improved to linear with problem size by an arbitrarily small sacrifice with time. For heavy-tailed cost distributions, a regret linear with problem size and sublinear with time was achieved. These results hold for the general weight minimization problems and a special class of stochastic online liner optimization problems. For the general stochastic online linear optimization problems, we proposed a policy that achieves a regret linear with problem size and sublinear with time for all light-tailed cost distributions. R EFERENCES [1] T. Lai and H. Robbins, “Asymptotically Efficient Adaptive Allocation Rules,” Advances in Applied Mathematics, vol. 6, no. 1, pp. 4-22, 1985. [2] T. Lai, “Adaptive Treatment Allocation and The Multi-Armed Bandit Problem,” Ann. Statist., vol 15, pp. 1091-1114, 1987. [3] R. Agrawal, “Sample Mean Based Index Policies with O(logn) Regret for the Multi-armed Bandit Problem,” Advances in Applied Probability, vol. 27, pp. 1054-1078, 1995. [4] P. Auer, N. Cesa-Bianchi, P. Fischer, “Finite-time Analysis of the Multiarmed Bandit Problem,” Machine Learning, vol. 47, pp. 235-256, 2002. [5] K. Liu and Q. Zhao, “Multi-Armed Bandit Problems with Heavy Tail Reward Distributions,” in Proc. of Allerton Conference on Communications, Control, and Computing, September, 2011. [6] Q. Zhao and B. Sadler, “A Survey of Dynamic Spectrum Access,” IEEE Signal Processing Magazine, vol. 24, pp. 79–89, May 2007. [7] C. Oliveira, P. Pardalos, “A Survey of Combinatorial Optimization Problems in Multicast Routing,” Computers and Operations Research, vol. 32, no. 8, pp. 19531981, August 2005. [8] B. Das and V. Bharghavan, “Routing in Ad Hoc Networks Using Minimum Connected Dominating Sets,” in Proc. of IEEE International Conference on Communications (ICC), June, 1997. [9] M. Roozbehani, M. Dahleh, S. Mitter, “Dynamic Pricing and Stabilization of Supply and Demand in Modern Electric Power Grids,” in Proc. of the First IEEE International Conference on Smart Grid Communications (SmartGridComm), pp. 543-548, Oct. 2010. [10] R. Kleinberg, “Online Decision Problems with Large Strategy Sets,” Ph.D. Thesis, MIT, 2005. [11] K. Liu and Q. Zhao, “Extended UCB1 for Light-Tailed Reward Distributions,” available at http://arxiv.org/abs/1112.1768. [12] B. Awerbuch, “Optimal Distributed Algorithms for Minimum Weight Spanning Tree, Counting, Leader Election, and Related Problems,” in Proc. of the 19th Annual ACM Symposium on Theory of Computing, 1987. [13] J. Yu, J. Chong, “A Survey of Clustering Schemes for Mobile Ad Hoc Networks,” IEEE Communications Surveys and Tutorials, vol. 7, no. 1, pp. 32-48, 2005. [14] X. Cheng, B. Narahari, R. Simha, M. Cheng, D. Liu, “Strong Minimum Energy Topology in Wireless Sensor Networks: NP-Completeness and Heuristics,” IEEE Transactions on Mobile Computing, vol. 2, no. 3, pp. 248-256, 2003. [15] D. Li, X. Jia, H. Liu, “Energy Efficient Broadcast Routing in Static Ad Hoc Wireless Networks,” IEEE Transactions on Mobile Computing, vol. 3, no. 2, pp. 144-151, 2004. [16] P. Wan, K. M. Alzoubi, and O. Frieder, “Distributed Construction of Connected Dominating Set in Wireless Ad Hoc Networks,” Mobile Networks and Applications, vol. 9, No. 2, pp. 141-149, 2004.

18

[17] F. Dai, J. Wu, “An Extended Localized Algorithm for Connected Dominating Set Formation in Ad Hoc Wireless Networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 10, pp. 908-920, 2004. [18] Y. Gai, B. Krishnamachari, and R. Jain, “Combinatorial Network Optimization with Unknown Variables: Multi-Armed Bandits with Linear Rewards and Individual Observations,” IEEE/ACM Transactions on Networking, vol. 20, no. 5, 2012. [19] B. Awerbuch, R. Kleinberg, “Online Linear Optimization and Adaptive Routing,” Journal of Computer and System Sciences, pp. 97-114, 2008. [20] V. Dani, T. Hayes, S. Kakade, “Stochastic Linear Optimization under Bandit Feedback,” in Proc. of the 21st Annual Conference on Learning Theory, 2008. [21] P. Rusmevichientong, J. N. Tsitsiklis, “Linearly Parameterized Bandits,” Mathematics of Operational Research, vol. 35, no. 2, pp. 395-411, 2010. [22] R. Agrawal, “The Continuum-Armed Bandit Problem,” SIAM J. Control and Optimization, vol. 33, no. 6, pp. 1926-1951, 1995. [23] R. Kleinberg, “Nearly Tight Bounds for the Continuum-Armed Bandit Problem,” Advances in Neural Information Processing Systems 17 (NIPS), pp. 697-704, 2004. [24] P. Auer, R. Ortner, C. Szepesv´ ari, “Improved Rates for the Stochastic Continuum-Armed Bandit Problem,” Lecture Notes in Computer Science, vol. 4539, pp. 454-468, 2007. [25] E. W. Cope, “Regret and Convergence Bounds for a Class of Continuum-Armed Bandit Problems,” IEEE Transactions on Automatic Control, vol. 54, no. 6, Jun. 2009. [26] K. Liu and Q. Zhao, “Distributed Learning in Multi-Armed Bandit with Multiple Players,” IEEE Transactions on Signal Processing, vol. 58, no. 11, pp. 5667-5681, November, 2010. [27] A. Anandkumar, N. Michael, A. Tang, and A. Swami, “Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret,” IEEE JSAC on Advances in Cognitive Radio Networking and Communications, vol. 29, no. 4, pp. 781-745, Apr. 2011. [28] Y. Gai and B. Krishnamachari, “Decentralized Online Learning Algorithms for Opportunistic Spectrum Access,” IEEE Global Communications Conference (GLOBECOM), December, 2011. [29] D. Kalathil, N, Nayyar, and R. Jain, “Decentralized Learning for Multi-Player Multi-Armed Bandits,” in Proc. of Information Theory and Applications Workshop (ITA), Feb. 2012. [30] H. Liu, K. Liu, and Q. Zhao, “Learning and Sharing in A Changing World: Non-Bayesian Restless Bandit with Multiple Players,” in Proc. of Information Theory and Applications Workshop (ITA), January, 2011. [31] W. Dai, Y. Gai, B. Krishnamachari, and Q. Zhao, “The Non-Bayesian Restless Multi-Armed Bandit: a Case of NearLogarithmic Regret,” in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), May, 2011. [32] Y. Gai, B. Krishnamachari, and M. Liu, “Online Learning for Combinatorial Network Optimization with Restless Markovian Rewards,” in Proc. of the 9th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks (SECON), June, 2012. [33] C. Tekin and M. Liu, “Online Learning of Rested and Restless Bandits,” IEEE Trans. on Information Theory, vol. 58, no. 8, pp. 5588-5611, August, 2012. [34] P. Chareka, O. Chareka, S. Kennendy, “Locally Sub-Gaussian Random Variable and the Strong Law of Large Numbers,” Atlantic Electronic Journal of Mathematics, vol. 1, no. 1, pp. 75-81, 2006.