Probability Bounds for Polynomial Functions in Random Variables - Enet

Probability Bounds for Polynomial Functions in Random Variables Simai HE

∗

Bo JIANG

†

Zhening LI

‡

Shuzhong ZHANG

§

April 12, 2012

Abstract Random sampling is a simple but powerful method in statistics and the design of randomized algorithms. In a typical application, random sampling can be applied to estimate an extreme value, say maximum, of a function f over a set S ⊆ Rn . To do so, one may select a simpler (even finite) subset S0 ⊆ S, randomly take some samples over S0 for a number of times, and pick the best sample. The hope is to find a good approximate solution with reasonable chance. This paper sets out to present a number of scenarios for f , S and S0 where certain probability bounds can be established, leading to a quality assurance of the procedure. In our setting, f is a multivariate polynomial function. We prove that if f is d-th order homogeneous polynomial in n variables and F is its corresponding super-symmetric tensor, and ξi (1n≤ i ≤ n) are i.i.d. Bernoulli random o d variables taking 1 or −1 with equal probability, then Prob f (ξ1 , ξ2 , · · · , ξn ) ≥ c1 n− 2 kF k1 ≥ c2 , where c1 , c2 > 0 are two universal constants. Several new inequalities concerning probabilities of the above nature are presented in this paper. Moreover, we show that the bounds are tight in most cases. Applications of our results in optimization are discussed as well. Keywords: random sampling; probability bound; tensor form; polynomial function; polynomial optimization; approximation algorithm. Mathematics Subject Classification: 90C10, 90C11, 90C59, 15A69, 90C26.

∗

Department of Management Sciences, City University of Hong Kong, Kowloon Tong, Hong Kong. Email: [email protected] † Industrial and Systems Engineering Program, University of Minnesota, Minneapolis, MN 55455. Email: [email protected] ‡ Department of Mathematics, Shanghai University, Shanghai, 200444, China. Email: [email protected] § Industrial and Systems Engineering Program, University of Minnesota, Minneapolis, MN 55455. Email: [email protected]

1

1

Introduction

Let f (x) : Rn → R be a function, and S ⊆ Rn be a given set, wherewith we consider: maxx∈S f (x). A possible generic approximation method for solving this problem would be randomization and sampling. In particular, we may proceed as follows: (i) choose a suitable and well-structured subset S0 ⊆ S; (ii) design a suitable probability distribution ξ on S0 ; (iii) take some random samples and pick the best solution. The quality of this approach, of course, depends on the chance of hitting some ‘good solutions’ by the random sampling. In other words, a bound in the following format is of crucial importance to us: (1) Prob f (ξ) ≥ τ max f (x) ≥ θ, x∈S

ξ∼S0

where τ > 0 and 0 < θ < 1 are certain constants. In another situation, the original problem of interest is maxx∈S0 f (x). Replacing the constraint set to be x ∈ S is a relaxation and it can help to create an easier problem to analyze. In this setting, a bound like (1) is useful in terms of deriving an approximate solution for solving the problem. A good example of this approach is the max-cut formulation of Goemans and Williamson [6], where S0 is the set of rank-one positive semidefinite matrices with diagonal elements being all-ones, and S is S0 dropping the rank-one restriction. In [18, 17, 9], this technique helped in the design of efficient randomized approximation algorithms for solving quadratically constrained quadratic programs by semidefinite programming (SDP) relaxation. Motivated mainly due to its generic interest and importance, primarily in optimization, the current paper is devoted to the establishment of inequalities of type (1), under various assumptions. Of course such probability estimation cannot hold in general, unless some structures are in place. However, once (1) indeed holds, then with probability θ we will get a solution whose value is no worse than τ times the best possible value of f (x) over S. In other words, with probability θ we will be able to generate an τ -approximate solution. In particular, if we independently draw m trials of ξ on S0 and pick the one with the largest function value, then this process is a randomized approximation algorithm with approximation ratio τ , where the probability to this quality solution ln 1 is at least 1 − (1 − θ)m . If m = θ then 1 − (1 − θ)m ≥ 1 − , and this randomized algorithm indeed runs in polynomial-time in the problem dimensions. In fact, the framework of our investigation, viz. the probability bound (1), is sufficiently rich to include some highly nontrivial results beyond optimization as well. As an example, let f (x) = aT x be a linear function, and S = S0 = Bn := {1, −1}n be a binary hypercube. Khot and Noar in [11] derived the following probability bound, which can be seen as a nontrivial instance of (1). For every δ ∈ (0, 12 ), there is a constant c(δ) > 0 with the following property: Fix a = (a1 , a2 , · · · , an )T ∈ Rn and let ξ1 , ξ2 , ..., ξn be i.i.d. symmetric Bernoulli random 2

variables (taking ±1 with equal probability), then   r n X  c(δ) δ ln n Prob aj ξj ≥ · kak1 ≥ δ .   n n

(2)

j=1

q

δ ln n = kak1 , (2) is of type (1), with τ = and θ = c(δ) . This bound n nδ q ln n indeed gives rise to an Θ -approximation algorithm for the binary constrained trilinear n

Since maxx∈Bn

aT x

form maximization problem: max F (x, y, z) := s.t. x, y, z ∈ Bn .

Pn

i,j,k=1 aijk xi yj zk

To see why, let us denote its optimal solution to be (x∗ , y ∗ , z ∗ ) = arg maxx,y,z∈Bn F (x, y, z). By letting a = F (·, y ∗ , z ∗ ) ∈ Rn and ξ1 , ξ2 , · · · , ξn be i.i.d. symmetric Bernoulli random variables, it follows from (2) that ( ) r δ ln n c(δ) ∗ ∗ ∗ ∗ Prob F (ξ, y , z ) ≥ kF (·, y , z )k1 ≥ δ . (3) n n Notice that by the optimality of (x∗ , y ∗ , z ∗ ), we have kF (·, y ∗ , z ∗ )k1 = F (x∗ , y ∗ , z ∗ ). Besides for any fixed ξ, the problem maxy,z∈Bn F (ξ, y, z) is a binary constrained bilinear form maximization problem, which admits a deterministic approximation algorithm with approximation ratio 0.03 (see Alon and Naor [1]). Thus we are able to find two vectors yξ , zξ ∈ Bn in polynomial-time such that F (ξ, yξ , zξ ) ≥ 0.03 maxn F (ξ, y, z) ≥ 0.03 F (ξ, y ∗ , z ∗ ), y,z∈B

which by (3) implies ( Prob

r

F (ξ, yξ , zξ ) ≥ 0.03

δ ln n F (x∗ , y ∗ , z ∗ ) n

) ≥

c(δ) . nδ

Now we may independently draw ξ1 , ξ2 , · · · , ξn , followed by the algorithm proposed in [1] to solve nδ ln

1

maxy,z∈Bn F (ξ, y, z). If we apply this procedure c(δ) times and pick the one with the largest objective value, then it is actually a polynomial-time randomized approximation algorithm with q q δ ln n ln n approximation ratio 0.03 , whose chance of getting this quality bound is at least n =Θ n 1 − . The scope of applications for results of type (1) is certainly beyond optimization per se; it is significant in the nature of probability theory itself. Recall that most classical results in probability theory is to upper bound the tail of a distribution (e.g., the Markov inequality and the Chebyshev inequality), say Prob {ξ ≥ a} ≤ b. In other words, these are the upper bounds for the probability

3

of a random variable beyond a threshold value. However, in some applications a lower bound for such probability can be relevant, in the form of Prob {ξ ≥ a} ≥ b.

(4)

One interesting example is a result due to Ben-Tal, Nemirovskii, and Roos [2], where they proved a lower bound of 1/8n2 for the probability that a homogeneous quadratic form of n i.i.d. symmetric Bernoulli random variables lies above its mean. More precisely, they proved the following: If F ∈ Rn×n is a symmetric matrix and ξ = (ξ1 , ξ2 , · · · , ξn )T are i.i.d. symmetric Bernoulli random variables, then Prob {ξ T F ξ ≥ tr (F )} ≥ 8n1 2 . As a matter of fact, the author went on to conjecture in [2] that the lower bound can be as high as 14 ; hitherto this conjecture remains open. A significant progress on this conjecture is due to He et al. [9], where the authors improved the lower bound of 8n1 2 to 0.03. Note that the result of He et al. [9] also holds for any ξi ’s being i.i.d. standard normal random variables. Luo and Zhang [16] provides a constant lower bound for the probability that a homogeneous quartic function of a zero mean multivariate normal distribution lies above its mean, which was a first attempt to extend such probability bound for functions of random variables beyond quadratic. For a univariate random variable, bounds of type (4) and its various extensions can be found in a recent paper by He, Zhang, and Zhang [10]. A well known result of Gr¨ unbaum [5] can also be put in the category of probability inequality (4). Gr¨ unbaum’s theorem asserts: If S ⊆ Rn is convex and ξ is uniformly distributed on S, then for any c ∈ Rn , Prob {cT ξ ≥ cT E ξ} ≥ e−1 . The current paper aims at providing various new lower bounds for inequalities of type (1), when f is a multivariate polynomial function. To enable the presentation of our results, let us first briefly introduce the notations adopted in this paper. For any given d-th order tensor F ∈ Rn1 ×n2 ×···×nd , we denote F (x1 , x2 , · · · , xd ) to be the multilinear form induced by the tensor F , i.e., X F (x1 , x2 , · · · , xd ) := Fi1 i2 ···id x1i1 x2i2 · · · xdid = F • (x1 ⊗ x2 ⊗ · · · ⊗ xd ), 1≤i1 ≤n1 ,1≤i2 ≤n2 ,··· ,1≤id ≤nd d

where xk ∈ Rnk for k = 1, 2, . . . , d. If F ∈ Rn is super-symmetric (the component is invariant under the permutation of the indices), we denote f (x) to be the homogeneous polynomial function of x ∈ Rn induced by the super-symmetric tensor F , i.e., X f (x) := F (x, x, · · · , x) = Fi1 i2 ···id xi1 xi2 · · · xid = F • (x | ⊗x⊗ {z· · · ⊗ x}). | {z } d

1≤i1 ,i2 ,··· ,id ≤n

d

4

For any given set S ⊆ Rn , ξ ∼ S stands that ξ is a multivariate uniform distribution on the support S. Two types of support sets are frequently used in this paper, namely Bn := {1, −1}n

and Sn := {x ∈ Rn : kxk2 = 1}.

It is easy to verify the following equivalent relationship: 1. ξ = (ξ1 , ξ2 , · · · , ξn )T ∼ Bn is equivalent to ξi ∼ B (i = 1, 2, . . . , n), and ξi ’s are i.i.d. random variables; 2. ξ = (ξ1 , ξ2 , · · · , ξn )T ∼ Sn is equivalent to η/kηk2 , with η = (η1 , η2 , · · · , ηn )T and ηi ’s are i.i.d. standard normal random variables. To simplify the presentation, the notion Ω(f (n)) signifies that there are positive universal constants α and n0 such that Ω(f (n)) ≥ αf (n) for all n ≥ n0 , the notation O(f (n)) signifies that there are positive universal constants α and n0 such that O(f (n)) ≤ αf (n) for all n ≥ n0 , and the notation Θ(f (n)) signifies that it is both Ω(f (n)) and O(f (n)). Besides, the notion Φ(n) signifies that there are positive universal constants α, β and n0 such that Φ(n) ≤ αnβ for all n ≥ n0 , i.e., a function bounded by a polynomial function of n. To avoid confusion, the term constant sometimes also refers to a parameter depending only on the dimension of a polynomial function, which is a given number independent of the input data of the problem. The paper is organized as follows. In Section 2, we present probability inequalities of type (1) where f is a multilinear form, and ξ is either a random vector with i.i.d. symmetric Bernoulli random variables, or a uniform distribution over hypersphere. Then in Section 3, we present another set of probability bounds of homogeneous polynomial function over a general class of independent random variables, including symmetric Bernoulli random variables and uniform distribution over hypersphere. We discuss some polynomial optimization problems where these probability bounds can be directly applied in Section 4. Finally, we summarize and discuss the main results presented in the paper in Section 5, with concluding remarks.

2

Multilinear Tensor Function in Random Variables

In this section we present the following result, which provides tight probability bounds for multilinear form in two different sets of random variables. Theorem 2.1 Let ξ i ∼ Bni (i = 1, 2, . . . , d) be independent of each other, and η i ∼ Sni (i =

5

1, 2, . . . , d) be independent of each other. For any d-th order tensor F ∈ Rn1 ×n2 ×···×nd , we have 1

n o Prob F (ξ 1 , ξ 2 , · · · , ξ d ) ≥ σd kF k1 ≥ Φ

Q

Φ

Q

1

n o Prob F (η 1 , η 2 , · · · , η d ) ≥ σd kF k2 ≥

r where σd = Θ

Q ln di=1 ni Qd . i=1 ni

d i=1 ni

d i=1 ni

,

(5)

,

(6)

Moreover, the order of σd cannot be improved, if the bound is at least

Q1 . Φ( di=1 ni )

We remark here that the degree d is deemed a fixed constant in our discussion. If we let S = Bn1 ×n2 ×···×nd and S0 = {X ∈ Bn1 ×n2 ×···×nd | rank (X) = 1}, then (5) is in the form of (1). Similarly, if we let S = Sn1 ×n2 ×···×nd and S0 = {X ∈ Sn1 ×n2 ×···×nd | rank (X) = 1}, then (6) is in the form of (1). For clarity, we shall prove (5) and (6) separately in the following two subsections.

2.1

Multilinear Tensor Function in Bernoulli Random Variables

This subsection is dedicated to the proof of the first part of Theorem 2.1, namely (5). Let us first settle the tightness of the bound. Notice that and Naor [11] proved (5) for the case q Knot ln n . They also argued the tightness of this d = 1, i.e., (2), the ratio being precisely σ1 = Θ n ratio. In [11] they applied (2) to prove that the so-called L1 diameter of convex bodies can be approximated within a factor σ1 (see Theorem 3.1 in [11]). By an inapproximate result of the L1 diameter for convex bodies (see [3]), it follows that the ratio σ1 cannot be improved in general. In fact, σ1 is a tight bound not only for ξ ∼ Bn but also for other distributions on the support set Bn , also due to the inapproximate result of the L1 diameters of convex bodies. Now, for any given Q degree d, if we denote n = di=1 ni , then (5) is essentially ( ! ) r ln n 1 Prob F • (ξ 1 ⊗ ξ 2 ⊗ · · · ⊗ ξ d ) ≥ Θ kF k1 ≥ . (7) n Φ(n) Denote ξ = ξ 1 ⊗ ξ 2 ⊗ · · · ⊗ ξ d , and clearly it is an implementable distribution on the support Bn . Thus (7) can be regarded the form of (5) in the case of d = 1. Due to the tightness of σ1 , the q as r Qd ln n i ln n Qd i=1 =Θ bound σd = Θ for general d in (5), once established, is a tight one too. n i=1

ni

It is interesting to note that a completely free ξ and the more restrictive ξ = ξ 1 ⊗ ξ 2 ⊗ · · · ⊗ ξ d lies in the fact that the latter is rank-one. Hence, the establishment of (5) actually implies that as far as the randomized solution is concerned, the rank-one restriction is immaterial. To proceed, let us start with some technical preparations. First, we have the following immediate probability estimation. 6

Lemma 2.2 If ξ ∼ Bn , then for any vector a ∈ Rn , E |aT ξ| ≥ θ1 kak2 , where θ1 =

16 √ 25 5

≈ 0.286.

Proof. Denote z = |aT ξ|, and observe " 2

Ez = E

n X

#2 ξi a i

 n X  =E a2i + 2

X

i=1

1≤i 0, if Prob (ξ, y) ∈ W 0 ≥ δ

∀y ∈ V

ξ

and Prob η ∈ V 0 > 0, η

then the joint conditional probability Prob (ξ,η)

0 (ξ, η) ∈ W η ∈ V ≥ δ. 0

Proof. Notice that the first assumption is equivalent to 0 Prob (ξ, η) ∈ W η = y ≥ δ

∀ y ∈ V.

(ξ,η)

Suppose that η has a density g(·) in V , then 0 0 0 0 = Prob (ξ, η) ∈ W , η ∈ V Prob η ∈ V 0 Prob (ξ, η) ∈ W η ∈ V η (ξ,η) (ξ,η) Z 0 = Prob (ξ, η) ∈ W , η = y g(y)dy Prob η ∈ V 0 η V 0 (ξ,η) Z ≥ δ · g(y)dy Prob η ∈ V 0 = δ. η

V0

8

(8)

The case where η is a discrete random variable can be handled similarly.

We are now ready to prove (5). Proof of (5) in Theorem 2.1 Proof. Without loss of generality, we assumer that n1 ≤ n ≤ nd . Notice 2 ≤ ··· that d is considered Qd q ln i=1 ni ln n Qd Qd d as a fixed parameter, and therefore σd = Θ =Θ . The proof is based i=1

ni

i=1

ni

on induction on d. The case for d = 1 has been established by Knot and Naor [11]. Suppose the inequality holds for d−1, by treating ξ 1 as a given parameter and taking F (ξ 1 , ·, ·, · · · , ·) as a tensor of order d − 1, one has ) ( ! s 1 ln nd 1 2 d 1 . Prob F (ξ , ξ , · · · , ξ ) ≥ Θ kF (ξ , ·, ·, · · · , ·)k1 ≥ Q Qd 2 3 d d (ξ ,ξ ,··· ,ξ ) n i n Φ i=2 i=2 i Define the event E1 =

n o kF (ξ 1 , ·, ·, · · · , ·)k1 ≥ Θ √1n1 kF k1 . By applying Proposition 2.4 with

ξ = (ξ 2 , ξ 3 , · · · , ξ d ) and η = ξ1 , we have ( s Prob

(ξ 1 ,ξ 2 ,··· ,ξ d )

1

2

d

F (ξ , ξ , · · · , ξ ) ≥ Θ

ln nd Qd i=2 ni

!

) kF (ξ , ·, ·, · · · , ·)k1 E1 ≥

1

1

Φ

Q

d i=2 ni

.

(9)

The desired probability can be lower bounded as follows: ! ) ( s ln n d kF k1 Prob F (ξ 1 , ξ 2 , · · · , ξ d ) ≥ Θ Qd i=1 ni ( ! s ) ln nd 1 2 d 1 F (ξ , ξ , · · · , ξ ) ≥ Θ ≥ Prob kF (ξ , ·, ·, · · · , ·)k1 E1 · Prob {E1 } Q d (ξ 1 ,ξ 2 ,··· ,ξ d ) i=2 ni ! 1 1 1 Q · Θ Qd , = Q ≥ d d n i n n Φ Φ i=2 i=2 i i=1 i where the last inequality is due to (9) and Lemma 2.3.

2.2

Multilinear Tensor Function over Hyperspheres

In this subsection we shall prove the second part of Theorem 2.1, namely (6). The main construction is analogous to that of the proof for (5). First we shall establish a counterpart of inequality (2), i.e., we prove (6) for d = 1, which is essentially the following Lemma 2.5. Namely, if we uniformly and independently draw two vectors in Sn , then there is non-trial probability that their inner product q ln n is at least Θ . n

9

Lemma 2.5 If a, x ∼ Sn are drawn independently, then ( !) ! r r ln n 1 Prob aT x ≥ Θ ≥Θ . n n ln n Proof. By the symmetricity of Sn , we may without loss of generality assume that a is a given vector in Sn , e.g., a = (1, 0, . . . , 0)T . Let ηk (k = 1, 2, . . . , n) be i.i.d. standard normal random variables, then x = η/kηk2 and aT x = η1 /kηk2 . It suffices to show that ( !) ! r r η1 ln n 1 Prob ≥Θ ≥Θ . kηk2 n n ln n First, we have for n ≥ 2, Z +∞ n o √ x2 1 √ e− 2 dx Prob η1 ≥ ln n = √ 2π ln n Z 2√ln n x2 1 √ e− 2 dx ≥ √ 2π ln n Z 2√ln n x2 x 1 √ √ e− 2 dx ≥ √ 2π 2 ln n ln n ! r 1 1 1 1 √ − 2 ≥Θ = √ . n ln n n n 8π ln n Second, we have

n √ o Prob kηk2 ≥ 5n ≤ e−n .

(10)

To see why (10) holds, we may use a result on the χ2 -distribution estimation by Laurent and Massart (Lemma 1 of [15]): For any vector b = (b1 , b2 , · · · , bn )T with bk ≥ 0 (k = 1, 2, . . . , n), P denote z = nk=1 bi (ηi2 − 1), then for any t > 0, n o √ Prob z ≥ 2kbk2 t + 2kbk∞ t ≤ e−t . Thus (10) follows immediately by letting b be the all one vector and t = n. By these two inequalities, we conclude that ( ) ( ) r r ln n η1 ln n T Prob a x ≥ = Prob ≥ 5n kηk2 5n n √ √ o ≥ Prob η1 ≥ ln n, kηk2 ≤ 5n n o n √ √ o ≥ Prob η1 ≥ ln n − Prob kηk2 ≥ 5n ! ! r r 1 1 −n ≥ Θ −e ≥Θ . n ln n n ln n

10

The lemma is proven.

For any vector a ∈ Rn , as a/kak2 ∈ Sn , we have for x ∼ Sn ) ( ( ! !) ! r r r T a ln n ln n 1 T Prob a x ≥ Θ kak2 = Prob x≥Θ ≥Θ , (11) n kak2 n n ln n which implies (6) holds when d = 1. To proceed to the high order case, let us introduce the following intermediate result, which is analogous to Lemma 2.3 in previous subsection. Lemma 2.6 If x ∼ Sn , then for any matrix A ∈ Rm×n , 1 1 kAk2 ≥ Θ Prob kAxk2 ≥ Θ √ . rank (A) n Proof. Without loss of generality we may assume that m = rank (A) ≤ n, since if it were not the case, then by applying a Cholesky decomposition to AT A, we will have AT A = B T B with B ∈ Rrank (A)×n . Subsequently we can replace A by B, since kBxk22 = xT B T Bx = xT AT Ax = kAxk22

and kBk22 = tr (B T B) = tr (AT A) = kAk22 .

Let x = η/kηk2 , where η = (η1 , η2 , · · · , ηn )T are i.i.d. standard normal random variables. Denote ai ∈ Rn (i = 1, 2, . . . , m) to be the i-th row vector of matrix A. Then "m # m m X X 2 X 2 E kAηk22 = E η T ai = E η T ai = kai k22 = kAk22 . i=1

i=1

i=1

For each 1 ≤ i ≤ n, as η T ai is a normal distribution with mean 0 and variance kai k22 , its kurtosis 4 E η T ai /kai k42 = 3, and so !2 " m # m m m X X X X 2 4 4 T i T i T i 4 E kAηk2 = E η a ≤E m η a =m E η a =m 3kai k42 ≤ 3mkAk42 . i=1

i=1

i=1

i=1

By the Paley-Zygmund inequality we have for any θ ∈ (0, 1) (1 − θ)2 (E kAηk22 )2 ≥ Prob kAηk22 ≥ θkAk22 = Prob kAηk22 ≥ θE kAηk22 ≥ (1 − θ)2 . 3m E kAηk42 Using (10) we have 1 kAηk2 1 √ √ ≥ Prob kAxk2 ≥ kAk2 = Prob kAk2 kηk2 10n 10n n √ o ≥ Prob kAηk22 ≥ kAk22 /2, kηk2 ≤ 5n n √ o ≥ Prob kAηk22 ≥ kAk22 /2 − Prob kηk2 ≥ 5n 1 1 1 −n −m ≥ −e ≥ −e ≥Θ , 12m 12m rank (A) 11

and the desired inequality follows.

Now we are ready to prove (6) by induction. Proof of (6) in Theorem 2.1 r

Q ln di=1 ni Qd i=1 ni

= Proof. Without loss of generality, we assume that n1 ≤ n2 ≤ · · · ≤ nd . Thus σd = Θ q ln n Qd d Θ . The proof is based on induction on the degree d, and inequality (11) established i=1

ni

the case for d = 1. Supposs (6) holds for d − 1, by treating η 1 as a given parameter and taking F (η 1 , ·, ·, · · · , ·) as a tensor of order d − 1, one has ) ! ( s 1 ln nd 1 1 2 d . kF (η , ·, ·, · · · , ·)k2 ≥ Q Prob F (η , η , · · · , η ) ≥ Θ Qd 2 d d (η ,η3 ,··· ,η ) n i Φ n i=2 i=2 i o n Now define the event E2 = kF (η 1 , ·, ·, · · · , ·)k2 ≥ Θ √1n1 kF k2 , applying Proposition 2.4 with ξ = (η 2 , η3 , · · · , η d ) and η = ξ1 , we have ( s Prob

(η 1 ,η 2 ,··· ,η d )

F (η 1 , η 2 , · · · , η d ) ≥ Θ

ln nd Qd i=2 ni

!

) kF (η 1 , ·, ·, · · · , ·)k2 E2 ≥

1 Φ

Q

d i=2 ni

. (12)

Therefore, (

! ) ln nd kF k2 Prob F (η , η , · · · , η ) ≥ Θ Qd i=1 ni ( ! s ) ln nd 1 2 d 1 E2 · Prob {E2 } F (η , η , · · · , η ) ≥ Θ ≥ Prob kF (η , ·, ·, · · · , ·)k Q 2 d (η 1 ,η 2 ,··· ,η d ) n i i=2 1 1 1 Q ·Θ , = Q ≥ d d n1 Φ n Φ n i i i=2 i=1 s

1

2

d

where the last inequality is due to (12) and Lemma 2.6. With respect to the tightness of σd in (6), the argument is essentially the same as the one we used for the tightness of (5). The only difference is that we use the inapproximate result of the L2 diameter of convex bodies, instead of the L1 diameter. The inapproximate ratio of the L2 diameter of convexbodies states that one cannot expect to find an approximation in any order better than q ln n Ω in polynomial-time unless P = N P (cf. [3]). n

3

Homogeneous Polynomial Function in Random Variables

The previous section is concerned with tensor forms of independent entry vectors. One important aspect of tensor is its connection to the polynomial functions. As is well known, a homogeneous d-th 12

degree polynomial uniquely determines a super-symmetric tensor of d entry vectors. In this section we shall discuss the probability for polynomial function of random variables. In our discussion, the notion of square-free tensor plays an important role. Essentially, in the case of matrix, squarefree is equivalent to that the diagonal elements are all zero. For a general tensor F = (ai1 i2 ···id ), square-free means that ai1 i2 ···id = 0 whenever at least two indices are equal. d

Theorem 3.1 Let F ∈ Rn be a square-free super-symmetric tensor of order d, and let f (x) = F (x, x, · · · , x) be a homogeneous polynomial function induced by F . If ξ = (ξ1 , ξ2 , · · · , ξn )T are independent random variables with E ξi = 0, E ξi2 = 1, E ξi4 ≤ κ for i = 1, 2, . . . , n, then Prob {f (ξ) ≥ c2 kF k2 } ≥ c1 ,

d

where τd = Θ n− 2

Prob {f (ξ) ≥ τd kF k1 } ≥ c1 , q √ 2 3−3 d! . , constants c1 = 9d2 (d!)2 36d κd and c2 = 16κ

(13) (14)

Compared to Theorem 2.1 in the previous section, Theorem 3.1 only requires the random variables to be independent from each other, and each with a bounded kurtosis, including the Bernoulli random variables and normal random variables as special cases. It is easy to verify that, under the square-free property of F , together with the assumptions E ξi = 0 and ξi ’s are independent from each other (i = 1, 2, . . . , n), we then have E [f (ξ)] = 0. Since E ξi2 = 1 (i = 1, 2, . . . , n), we compute that Var (f (ξ)) = Θ kF k22 . This means that the standard deviation of f (ξ) is in the same order of kF k2 . Assertion (13) essentially states that given any set of independent random variables with bounded kurtosis, any square-free polynomial of these random variables will have a certain thickness of the tail at some point. The proof for Theorem 3.1 is technically involved, and we shall delegate the details to the appendix. Although our main results in Theorem 3.1 are valid for arbitrary random variables, it is interesting to discuss its implications when the random variables are uniform distributions on Bn and Sn . In case of quadratic polynomial of Bernoulli random variables, we have the following: n Proposition 3.2 If F is a diagonal-free symmetric matrix and ξ ∼ Bn , then Prob ξ T F ξ ≥ √ 2 3−3 135 .

kF k2 √ 2 30

o

The proof of this proposition will be discussed in appendix too. We remark that Proposition 3.2 is an extension to the result of Nemirovskii et al. [2] where it was shown that Prob xT F x ≥ 0 ≥ 1/8n2 , and the result of He et al. [9] where it was shown that Prob xT F x ≥ 0 ≥ 0.03. Essentially, Proposition 3.2 is on the probability of a strict tail rather than the probability above the mean. d

Proposition 3.3 Let F ∈ Rn be a square-free super-symmetric tensor of order d, and let f (x) = F (x, x, · · · , x) be a homogeneous polynomial function induced by F . If ξ ∼ Bn , then Prob {f (ξ) ≥ τd kF k1 } ≥ c3 13

≥

d where τd = Θ n− 2 and constant c3 = d = 2, 4.

√ 2 3−3 . 2 9d (d!)2 36d

Moreover, the bound τd cannot be improved for

As a remark, Proposition 3.3 can be seen as an instance of (1) in the case f (X) = F • X, S = d {X ∈ Bn : X is super-symmetric} and S0 = {X ∈ S : rank (X) = 1}. The probability bound in Proposition 3.3 directly follows from (14), since E ξi = 0, E ξi2 = E ξi4 = 1 for all i = 1, 2, . . . , n. It remains to show that even in this special case, the bounds are tight when d = 2 and d = 4, which are illustrated by the following examples. Example 3.4 For the case d = 2, define F = I − E, where I is the identity and E is the all-one matrix. In this case, for any x ∈ Bn , xT F x = n − (eT x)2 ≤ n and kF k1 = n2 − n. Therefore xT F x/kF k1 ≤ 1/(n − 1) for any x ∈ Bn , implying that the ratio τ2 cannot be better than Θ (1/n) for any positive probability. Example 3.5 For the case d = 4, define F to be the square-free tensor of order 4, with all non square-free components being −1. It is obvious that kF k1 = Θ n4 , and for any x ∈ Bn !4 n n X X X X X F (x, x, x, x) = x4i + 12 x2i xj xk + 6 x2i x2j + 4 x3i xj − xi i=1

i6=j,j6=k,i6=k

= n + 12(n − 2)

X

i6=j

xj xk + 3n(n − 1) + 4

X

j6=k 2

= 3n − 2n + (6n − 10)

i=1

i6=j n X

xi xj −

X

2xj xk −

= 3n2 − 2n + (6n − 10) 

!4 xi

i=1

j6=k



xi

i=1

i6=j n X

!4

n X

!2 xi

 − n −

i=1

n X

!4 xi

i=1

 = 3n2 − 2n − n(6n − 10) + (3n − 5)2 − 

n X

!2 xi

2 − (3n − 5)

i=1 2

≤ 6n − 22n + 25. Thus we have F (x, x, x, x)/kF k1 ≤ Θ 1/n2 , implying that the ratio τ4 cannot be better than Θ 1/n2 for any positive probability. We believe that examples of the above type exist for any given d ≥ 4; however, so far we are unable to explicitly construct a general example. Let us now specialize the random variables to be uniformly distributed on the hypersphere. Since the components of the unit vector are not independent, we cannot directly apply Theorem 3.1. After some treatments, we get similar results as follows. 14

d

Proposition 3.6 Let F ∈ Rn be a square-free super-symmetric tensor of order d, and let f (x) = F (x, x, · · · , x) be a homogeneous polynomial function induced by F . If η ∼ Sn , then Prob {f (η) ≥ τd kF k2 } ≥ c4 − e−n , d where τd = Θ n− 2 and c4 =

√ 2 3−3 . 9d2 (d!)2 108d

Proof. Let η = ξ/kξk2 with ξ = (ξ1 , ξ2 , . . . , ξn )T being i.i.d. standard normal random variables. Since E ξi = 0, E ξi2 = 1, E ξi4 = 3 for all 1 ≤ i ≤ n, by applying (13) in Theorem 3.1 with κ = 3, we have o n p Prob f (ξ) ≥ d!/48 kF k2 ≥ c4 . Together with (10), we have Prob {f (η) ≥ τd kF k2 } = Prob {f (ξ/kξk2 ) ≥ τd kF k2 } n p √ o ≥ Prob f (ξ) ≥ d!/48 kF k2 , kξk2 ≤ 5n n o n p √ o ≥ Prob f (ξ) ≥ d!/48 kF k2 − Prob kξk2 ≥ 5n ≥ c4 − e−n . Before concluding this section, we remark that Proposition 3.6 can still be categorized to the type d of (1) with f (X) = F • X, S = {X ∈ Sn : X is super-symmetric} and S0 = {X ∈ S : rank (X) = 1}. Luo and Zhang [16] offered a constant lower bound for the probability that a homogeneous quartic form of a zero mean multivariate normal distribution lies above its mean. In particular, by restricting the distributions to be i.i.d. standard normals and quartic form to be square-free, applying Theorem 3.1 in the case of d = 4, we obtain a constant bound for the probability that the quartic form above the mean plus some constant times the standard deviation. We may view this as a strengthening of the result in [16].

4

Applications of Polynomial Function Optimization

As discussed in the introduction, the probability bounds in the form of (1) have immediate applications in optimization. In particular, in this section we shall apply the bounds derived in Sections 2 and 3 to polynomial function optimization problems. We shall derive polynomial-time randomized approximation algorithms, with the approximation ratios improving the existing ones in the literature.

15

4.1

Polynomial Optimization in Binary Variables

The general unconstrained binary polynomial optimization model is maxx∈Bn p(x), where p(x) is a multivariate polynomial function. He, Li, and Zhang [8] proposed a polynomial-time randomized approximation algorithm with a relative performance ratio. When the polynomial p(x) is homogeneous, this problem has many applications in graph theory; e.g., the max-cut problem [6] and the matrix cut-norm problem [1]. In particular we shall discuss two models in this subsection: (B1 ) max F (x1 , x2 , · · · , xd ) s.t. xk ∈ Bnk , k = 1, 2, . . . , d; (B2 ) max f (x) = F (x, x, · · · , x) | {z } d

s.t.

x ∈ Bn .

When d = 2, (B1 ) is to compute the matrix ∞ 7→ 1 norm, which is related to so called matrix cut-norm problem. The current best approximation ratio is 0.56, due to Alon and Naor [1]. When d = 3, (B1 ) is a slight generalization of the model considered by Khot and Naor [11], where F was assumed to be super-symmetric q and square-free. The approximation bound of the optimal ln n value estimated in [11] is Ω , while no polynomial-time procedure is offered to actually n find an approximate solution with that assured quality. Recently, He, Li, and Zhang [8] proposed a polynomial-time randomized approximation Q q algorithm for (B1 ) for any fixed degree d, with apd−2 1 proximation performance ratio Ω i=1 ni . The results in this subsection will improve this approximation ratio for fixed d, thanks to Theorem 2.1. Algorithm B1 (Randomized Algorithm for (B1 )) 1. Sort and rename the dimensions if necessary, so as to satisfy n1 ≤ n2 ≤ · · · ≤ nd . 2. Randomly and independently generate ξ k ∼ Bnk for k = 1, 2, . . . , d − 2. 3. Solve the following bilinear form optimization problem max F (ξ 1 , ξ 2 , · · · , ξ d−2 , xd−1 , xd ) s.t. xd−1 ∈ Bnd−1 , xd ∈ Bnd using the deterministic algorithm of Alon and Naor [1], and get its approximate solution (ξ d−1 , ξ d ). 4. Compute the objective value F (ξ 1 , ξ 2 , · · · , ξ d ). Q d 5. Repeat the above procedures Φ n ln 1 times and choose a solution whose k k=1 objective function is the largest.

16

Theorem 4.1 Algorithm B1 solves (B1 ) in polynomial-time with probability at least 1 − , and Qd−2 q ln ni approximation performance ratio Θ . i=1 ni The proof is based on the induction on d. Essentially, if an algorithm solves (B1 ) of order d − 1 approximately with an approximation ratio τ , then there is an algorithm solves (B1 ) of order d q ln n , where n is the dimension of the additional approximately with an approximation ratio τ ·Θ n order. Proof. Suppose (ξ 1 , ξ 2 , · · · , ξ d ) is the approximate solution generated by Algorithm B1 . For t = 2, 3, . . . , d, we treat (ξ 1 , ξ 2 , · · · , ξ d−t ) as given parameters and define the following problems (Dt ) max F (ξ 1 , ξ 2 , · · · , ξ d−t , xd−t+1 , xd−t+2 · · · , xd ) s.t. xk ∈ Bnk , k = d − t + 1, d − t + 2, . . . , d, whose optimal value is denoted by v(Dt ). By applying Algorithm B1 to (Dt ), (ξ d−t+1 , ξ d−t , · · · , ξ d ) can be viewed as an approximate solution generated. In the remaining, we shall prove by induction that for each t = 2, 3, . . . , d, s ! ) ( d−2 Y 1 ln nk 1 2 d . (15) v(Dt ) ≥ Q Prob F (ξ , ξ , · · · , ξ ) ≥ Θ d−t+1 d−t+2 d d−2 n (ξ ,ξ ,··· ,ξ ) k Φ n k=d−t+1

k=d−t+1

In other words, (ξ d−t+1 , ξ d−t+2 , · · · , ξ d ) has a non-trivial probability to be a Θ approximate solution of (Dt ).

Q

k

d−2 k=d−t+1

q

ln nk nk

-

For the initial case t = 2, the deterministic algorithm by Alon and Naor [1] (Step 3 of Algorithm B1 ) guarantees a constant ratio, i.e., F (ξ 1 , ξ 2 , · · · , ξ d ) ≥ Θ(1)v(D2 ), implying (15). Suppose now (15) holds for t − 1. To prove that (15) holds for t, we notice that (ξ 1 , ξ 2 , · · · , ξ d−t ) are given fixed parameters. Denote (z d−t+1 , z d−t+2 · · · , z d ) to be an optimal solution of (Dt ), and define the following events s ( ! ) ln n d−t+1 E3 = z ∈ Bnd−t+1 F (ξ 1 , · · · , ξ d−t , z, z d−t+2 · · · , z d ) ≥ Θ v(Dt ) ; nd−t+1 ( E4 = ξ d−t+1 ∈ E3 , ξ d−t+2 ∈ Bnd−t+2 , · · · , ξ d ∈ Bnd s ! ) d−2 Y ln nk 1 d 1 d−t d−t+1 d−t+2 d F (ξ , · · · , ξ ) ≥ Θ F (ξ , · · · , ξ , ξ ,z ··· ,z ) . nk k=d−t+2

Then we have d−2 Y

( Prob

(ξ d−t+1 ,··· ,ξ d )

≥

Prob

(ξ d−t+1 ,··· ,ξ d )

F (ξ 1 , · · · , ξ d ) ≥ Θ

s

k=d−t+1

n (ξ d−t+1 , · · · , ξ d ) ∈ E4 ξ d−t+1 17

! ) ln nk v(Dt ) nk o n o ∈ E3 · Prob ξ d−t+1 ∈ E3 . ξ d−t+1

(16)

To lower bound (16), first note that (z d−t+2 , · · · , z d ) is a feasible solution of (Dt−1 ), and so we have n o Prob (ξ d−t+1 , · · · , ξ d ) ∈ E4 ξ d−t+1 ∈ E3 (ξ d−t+1 ,··· ,ξ d ) s ( ) ! d−2 Y ln n d−t+1 k 1 d ≥ Prob F (ξ , · · · , ξ ) ≥ Θ v(Dt−1 ) ξ ∈ E3 nk (ξ d−t+1 ,··· ,ξ d ) k=d−t+2

1

≥ Φ

Q

d−2 k=d−t+2 nk

,

where the last inequality is due to the induction assumption on t − 1, and Proposition 2.4 for joint conditional probability with ξ = (ξ d−t+2 , · · · , ξ d ) and η = ξ d−t+1 . Secondly, we have n o Prob ξ d−t+1 ∈ E3 ξ d−t+1 s ( ! ) ln n d−t+1 F (ξ 1 , · · · , ξ d−t , z d−t+1 , · · · , z d ) = Prob F (ξ 1 , · · · , ξ d−t+1 , z d−t+2 , · · · , z d ) ≥ Θ nd−t+1 ξ d−t+1 s ( ! ) ln nd−t+1 1 d−t+1 d−t+2 d 1 d−t d−t+2 d ,z ,··· ,z ) ≥ Θ = Prob F (ξ , · · · , ξ kF (ξ , · · · , ξ , ·, z , · · · , z )k1 nd−t+1 ξ d−t+1 ≥

1 Φ (nd−t+1 )

,

where the last inequality is due to Theorem 2.1 for the case d = 1. With the above two facts, we can lower bound the right hand side of (16), and conclude s ( ! ) d−2 Y ln n k Prob F (ξ 1 , · · · , ξ d ) ≥ Θ v(Dt ) nk (ξ d−t+1 ,··· ,ξ d ) k=d−t+1

1

≥ Φ

Q

d−2 k=d−t+2 nk

·

1 Φ (nd−t+1 )

1

= Φ

Q

d−2 k=d−t+1 nk

.

As (Dd ) is exactly (B1 ), Algorithm B1 solves (B1 ) approximately with probability at least 1 − . We remark that theoretically we may get a better approximate solution, using the 0.56-randomized algorithm in [1] to replace the subroutine in Step 3 of Algorithm B1 , though that algorithm is quite complicated. In a similar vein, we obtain approximation results for (B2 ). Algorithm B2 (Randomized Algorithm for (B2 )) 1. Randomly generate ξ ∼ Bn and compute f (ξ). 2. Repeat this procedure Θ( c13 ) ln 1 times and choose a solution whose objective function is the largest.

18

The model (B2 ) has been studied extensively in the quadratic cases, i.e., d = 2. Goemans and Williamson [6] gave a 0.878-approximation ratio for the case F being the Laplacian of a given graph. Later, Nesterov [18] gave a 0.63-approximation ratio for the case F being positive semidefinite. For diagonal-free matrix, the best possible approximation ratio is Ω(1/ ln n), due to Charikar and Wirth [4], which is also known to be tight. For d = 3 and F is square-free, Knot and Naor [11] q ln n -approximation bound for the optimal value of (B2 ). For general d, He, Li, and gave an Ω n Zhang [8] proposed polynomial-time randomized approximation algorithms with approximation − d−2 ratio Ω n 2 when F is square-free for odd d; however for even d, they can only propose a relative d−2 approximation ratio Ω n− 2 . Now, by virtue of Theorem 3.1 (more precisely Proposition 3.3), since kF k1 is an upper bound for the optimal value of (B2 ), absolute approximation ratios are also established when d is even, as shown below. Theorem 4.2 When d is even and F is square-free, Algorithm B2 solves 2 ) in polynomial-time (B − d2 . with probability at least 1 − , and approximation performance ratio Θ n

4.2

Polynomial Optimization over Hyperspheres

Polynomial function optimization over hyperspheres have much applications in biomedical engineering, material sciences, numerical linear algebra, among many others. Readers are referred to [7, 13, 14] and references therein for more information. Let us consider: (S1 ) max F (x1 , x2 , · · · , xd ) s.t. xk ∈ Snk , k = 1, 2, . . . , d; (S2 ) max f (x) = F (x, x, · · · , x) | {z } d

s.t.

x∈

Sn .

When d = 2, (S1 ) and (S2 ) reduce to computing matrix spectrum norms and can be solved in polynomial-time. However they are NP-hard when d ≥ 3. For general d, (S2 ) is to compute the largest eigenvalue of the tensor F . As far as approximation algorithms are concerned, He, Li, and Zhang proposed polynomial-time approximation algorithms for (S1 ) with approximation Q [7] q d−2 1 ratio Ω i=1 ni . In [7], a generic linkage relating (S2 ) and (S1 ) is established. This linkage enables one to get a solution with the same approximation ratio (relative ratio for even d though) for (S2 ) whenever a solution with an approximation ratio for (S1 ) is available. Therefore, Q us q let d−2 1 now focus on (S1 ). For (S1 ), recently So [20] improved the result of [7] from Ω to i=1 ni q Q d−2 ln ni Ω . Unfortunately, the method in [20] relies on the equivalence (polynomial-time i=1 ni reduction) between convex optimization and membership oracle queries using the ellipsoid method, and it is computationally impractical. On the other hand, the algorithm that we present below is straightforward, while retaining the same quality of approximation as the result in [20]. 19

Algorithm S1 (Randomized Algorithm for (S1 )) 1. Sort and rename the dimensions if necessary, so as to satisfy n1 ≤ n2 ≤ · · · ≤ nd . 2. Randomly and independently generate η k ∼ Snk for k = 1, 2, . . . , d − 2. 3. Solve the largest singular value problem max F (η 1 , η 2 , · · · , η d−2 , xd−1 , xd ) s.t. xd−1 ∈ Snd−1 , xd ∈ Snd , and get its optimal solution (η d−1 , η d ). 4. Compute the objective value F (η 1 , η 2 , · · · , η d ). Q d 5. Repeat the above procedures Φ n ln 1 times and choose a solution whose k k=1 objective function is the largest.

Theorem 4.3 Algorithm S1 solves(S1 ) in polynomial-time with probability at least 1 − , and Qd−2 q ln nk approximation performance ratio Θ . k=1 nk The proof is similar to that for Theorem 4.1, and is omitted here.

4.3

Polynomial Function Mixed Integer Programming

This last subsection deals with optimization of polynomial functions under binary variables and variables with spherical constraints mixed up. Such problems have applications in matrix combinatorial problem, vector-valued maximum cut problem; see e.g., [8]. In [8], the authors considered 0

(M1 ) max F (x1 , x2 , · · · , xd , y 1 , y 2 , . . . , y d ) s.t. xk ∈ Bnk , k = 1, 2, . . . , d; y ` ∈ Sm` , ` = 1, 2, . . . , d0 ; (M2 ) max F (x, x, · · · , x, y, y, · · · , y ) | {z } | {z } d Bn , y

d0

Sm ;

s.t. x ∈ ∈ 1 1 (M3 ) max F (x , x , · · · , x1 , · · · , xs , xs , · · · , xs , y 1 , y 1 , · · · , y 1 , · · · , y t , y t , · · · , y t ) | {z } {z } | {z } | {z } | s.t.

xk

∈

d1 n k B ,k

= 1, 2, . . . , s;

ds ` y ∈

d01

Sm` ,

d0t

` = 1, 2, . . . , t;

and proposed polynomial-time randomized approximation algorithms when the tensor F is squarefree in x (the binary part). In fact, (M3 ) is a generalization of (M1 ) and (M2 ), and it can also be regarded as a generalization of (B1 ), (B2 ), (S1 ) and (S2 ) as well. Essentially the approximative results can be applied by using the linkage we mentioned earlier (see [7]) if approximation result for (M1 ) can be established. In fact, (M1 ) plays the role as a cornerstone for the whole construction. 20

The approximation ratio for (M1 ) derived in [8] is Ω Section 2 lead to the following improvements:

Q

d−1 k=1

q

1 nk

Qd0 −1 q `=1

1 m`

. The results in

Theorem 4.4 Denote N to be the set of the d+d0 −2 smallest numbers in {n1 , · · · , nd , m1 , · · · , md0 }. (M1 ) admits apolynomial-time randomized approximation algorithm with approximation perforq Q ln u mance ratio Θ . n∈N u The method for solving (M1 ) is similar to that for solving (B1 ) and (S1 ), and we shall not repeat the detailed discussions. Basically we shall do multiple trials to get a solution with high probability. For the d + d0 − 2 numbers in N , if it is the dimension of binary constraints, the algorithm uniformly picks a vector in the discrete hypercube; and if it is the dimension of spherical constraints, the algorithms uniformly pick a vector in the hypersphere. All the randomized procedures will be done independent from each other. As the first inductive step, we will then come across a bilinear function optimization problem in either of the three possible cases: • maxx∈Bn ,y∈Bm xT F y, which can be solved by the algorithm proposed in Alon and Naor [1] to get a solution with the guaranteed constant approximation ratio; • maxx∈Bn ,y∈Sm xT F y, which can be solved by the algorithm proposed in He, Li, and Zhang [8] to get a solution with the guaranteed constant approximation ratio; • maxx∈Sn ,y∈Sm xT F y, which can be solved by computing the largest singular value of matrix F .

5

Summary and Concluding Remark

To put the paper in perspective, in this section let us highlight and briefly summarize our new findings. We set out to explore the probability bound in the form of Prob ξ∼S0 {f (ξ) ≥ τ maxx∈S f (x)} ≥ θ with S0 ⊆ S, denoted by (1) in this paper. The function F in our discussion is either a multilinear tensor form or a homogeneous polynomial function. To enable probability bounds in the form of (1), we will need some structure in place. In particular, we consider the choice of the structural sets S0 and S respectively as follows: 1. Consider S = Bn1 ×n2 ×···×nd and S0 = {X ∈ S | rank (X) = 1}, and F ∈ Rn1 ×n2 ×···×nd . If we draw ξ uniformly over S0 , then 1 , Prob F • ξ ≥ σd max F • X = σd kF k1 ≥ Q d X∈S Φ n i=1 i 21

r where σd = Ω

Q ln di=1 ni Qd , i=1 ni

and Φ(n) denotes that there are positive universal constants

α, β and n0 such that Φ(n) ≤ αnβ for all n ≥ n0 . Moreover, σd cannot be improved if the bound is required to be at least Qd1 . Φ( i=1 ni ) 2. Consider S = {X ∈ Rn1 ×n2 ×···×nd | X • X = 1} and S0 = {X ∈ S | rank (X) = 1}, and F ∈ Rn1 ×n2 ×···×nd . If we draw ξ uniformly over S0 , then 1 . Prob F • ξ ≥ σd max F • X = σd kF k2 ≥ Q d X∈S Φ i=1 ni Moreover, σd cannot be improved if the bound is required to be at least

Q1 . Φ( di=1 ni )

d

3. Consider S = {X ∈ Bn | X is super-symmetric} and S0 = {X ∈ S | rank (X) = 1}, and d a square-free super-symmetric tensor F ∈ Rn . If we draw ξ uniformly over S0 , then there exists a constants c > 0, such that Prob F • ξ ≥ τd max F • X = τd kF k1 ≥ c, X∈S

d

where τd = Ω n− 2 . Moreover, when d = 2 or d = 4, τd cannot be improved for any positive bound. d

4. Consider S = {X ∈ Rn : X • X = 1, X is super-symmetric} and S0 = {X ∈ S | rank (X) = d 1}, and a square-free super-symmetric tensor F ∈ Rn . If we draw ξ uniformly over S0 , then there exists a constant c > 0, such that Prob F • ξ ≥ τd max F • X = τd kF k2 ≥ c. X∈S

Applying the results straightforwardly, we obtain polynomial-time randomized approximation algorithms for solving various polynomial optimization models with high probability. Specifically, our results include: 1. Ω

Q

d−2 i=1

q

ln ni ni

-approximation ratio for

max F (x1 , x2 , · · · , xd ) s.t. xk ∈ Bnk , k = 1, 2, . . . , d. Q q d−2 1 This ratio improves that of Ω i=1 ni proposed by He, Li, and Zhang [8]. d 2. Ω n− 2 -approximation ratio for max f (x) := F (x, x, · · · , x) | {z } d

s.t.

x ∈ Bn , 22

where f (x) is a homogeneous polynomial function with the tensor F being square-free. This ratio is new. In the literature, when d ≥ 4 and is even, the only previous approximation ratio for this model was in He, Li, and Zhang [8]; however, the ratio there is a relative one. q Q d−2 ln ni 3. Ω -approximation ratio for i=1 ni max F (x1 , x2 , · · · , xd ) s.t. xk ∈ Snk , k = 1, 2, . . . , d. Q

d−2 i=1

q

1 This improves the Ω ni ratio in [7], and achieves the same theoretical bound as in So [20]. However, the algorithm proposed here is straightforward to implement, while the one in [20] is very involved. q Q ln u 4. Ω -approximation ratio for n∈N u 0

max F (x1 , x2 , · · · , xd , y 1 , y 2 , · · · , y d ) s.t. xk ∈ Bnk , k = 1, 2, . . . , d, y ` ∈ Sm` , ` = 1, 2, . . . , d0 , where N is the set of the dq + d0 − 2 smallest Q q numbers in {n1 , · · · , nd , m1 , · · · , md0 }. This ratio d−1 1 Qd0 −1 1 improves that of Ω `=1 k=1 nk m` proposed in [8]. Before concluding the whole paper, let us finally remark the results in this paper are also connected to Khintchine’s inequality [12], which asserts that: If ξ = (ξ1 , ξ2 , · · · , ξn )T are i.i.d. symmetric Bernoulli random variables, then for any p > 0, there exist constants bp and cp such that for any vector a ∈ Rn , p p1 ≤ cp kak2 . bp kak2 ≤ E ξ T a

(17)

Since Khintchine’s result in early 1923, much effort has been on determining the sharp values of the constants bp and cp or some sort of extensions of Khintchine’s inequality. In particular, similar inequality in the matrix case (the vector a is replaced by a matrix) has been established. Recently, So [21] proved a conjecture proposed by Nemirovski through matrix version Khintchine’s inequality. Observe that Lemma 2.2 is exactly the lower bound part of Khintchine’s inequality with p = 1. Furthermore our new probability inequality (Theorem 3.1) also implies the lower bound part of Khintchine’s inequality where the random variables are endowed with some dependent structures. d

Corollary 5.1 Suppose F ∈ Rn is a square-free super-symmetric tensor of order d, and Ξ = ξ ⊗ ξ ⊗ · · · ⊗ ξ , where ξ = (ξ1 , ξ2 , · · · , ξn )T are independent random variables with E ξi = 0, E ξi2 = | {z } d

23

1, E ξi4

≤ κ. Then for any p > 0, there exists a constant bp =

√ 2 3−3 p p 4p+d 91+d d2 (d!)2− 2 κ(d+ 2 )

1

p

, such that

1

bp kF k2 ≤ (E |F • Ξ|p ) p . To see why, by Theorem 3.1 we have Prob √

( p

E |F • Ξ| ≥ Prob

p

|F • Ξ| ≥

d!kF k2 √ 4 κ

n |F • Ξ| ≥

√ o d!kF k2 √ 4 κ

√

!p

!p ) ·

d!kF k2 √ 4 κ

(18) ≥

√ 2 3−3 9d2 (d!)2 36d κd

. Therefore

√ 2 3−3 ≥ 2 9d (d!)2 36d κd

√

d!kF k2 √ 4 κ

!p ,

which implies (18).

References [1] N. Alon and A. Naor, Approximating the Cut-Norm via Grothendieck’s Inequality, SIAM Journal on Computing, 35, 787–803, 2006. [2] A. Ben-Tal, A. Nemirovskii, and C. Roos, Robust Solutions of Uncertain Quadratic and ConicQuadratic Problems, SIAM Journal on Optimization, 13, 535–560, 2002. [3] A. Brieden, P. Gritzmann, R. Kannan, V. Klee, L. Lov´asz, and M. Simonovits, Deterministic and Randomized Polynomial-Time Approximation of Radii, Mathematika, 48, 63–105, 2003. [4] M. Charikar and A. Wirth, Maximizing Quadratic Programs: Extending Grothendieck’s Inequality, The 45th Annual IEEE Symposium on Foundations of Computer Science, 54–60, 2004. [5] B. Gr¨ ubaum, Partitions of Mass-Distributions and of Convex Bodies by Hyperplanes, Pacific Journal of Mathematics, 10, 1257–1271, 1960. [6] M.X. Goemans and D.P. Williamson, Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems using Semidefinite Programming, Journal of the ACM, 42, 1115– 1145, 1995. [7] S. He, Z. Li, and S. Zhang, Approximation Algorithms for Homogeneous Polynomial Optimization with Quadratic Constraints, Mathematical Programming, 125, 353–383, 2010. [8] S. He, Z. Li, and S. Zhang, Approximation Algorithms for Discrete Polynomial Optimization, Technical Report SEEM2010-01, Department of Systems Engineering & Engineering Management, The Chinese University of Hong Kong, 2010. [9] S. He, Z.-Q. Luo, J. Nie, and S. Zhang, Semidefinite Relaxation Bounds for Indefinite Homogeneous Quadratic Optimization, SIAM Journal on Optimization, 19, 503–523, 2008. 24

[10] S. He, J. Zhang, and S. Zhang, Bounding Probability of Small Deviation: A Fourth Moment Approach, Mathematics of Operations Research, 35, 208–232, 2010. [11] S. Khot and A. Naor, Linear Equations Modulo 2 and the L1 Diameter of Convex Bodies, The 48th Annual IEEE Symposium on Foundations of Computer Science, 318–328, 2007. ¨ [12] A. Khintchine, Uber Dyadische Br¨ uche, Mathematische Zeitschrift, 18, 109–116, 1923. [13] Z. Li, Polynomial Optimization Problems—Approximation Algorithms and Applications, Ph.D. Thesis, The Chinese Univesrity of Hong Kong, June 2011. [14] Z. Li, S. He, and S. Zhang, Approximation Methods for Polynomial Optimization: Models, Algorithms, and Applications, SpringerBriefs in Optimization, Springer, New York, 2012. [15] B. Laurent and P. Massart, Adaptive Estimation of a Quadratic Functional by Model Selection, The Annals of Statistics, 28, 1302–1338, 2000. [16] Z.-Q. Luo and S. Zhang, A Semidefinite Relaxation Scheme for Multivariate Quartic Polynomial Optimization with Quadratic Constraints, SIAM Journal on Optimization, 20, 1716–1736, 2010. [17] Z.-Q. Luo, N.D. Sidiropoulos, P. Tseng, and S. Zhang, Approximation Bounds for Quadratic Optimization with Homogeneous Quadratic Constraints, SIAM Journal on Optimization, 18, 1–28, 2007. [18] Yu. Nesterov, Semidefinite Relaxation and Nonconvex Quadratic Optimization, Optimization Methods and Softwares, 9, 141–160, 1998. [19] R. Paley and A. Zygmund, A Note on Analytic Functions in the Unit Circle, Mathematical Proceedings of the Cambridge Philosophical Society, 28, 266–272, 1932. [20] A.M.-C. So, Deterministic Approximation Algorithms for Sphere Constrained Homogeneous Polynomial Optimization Problems, Mathematical Programming, 129, 357–382, 2011. [21] A.M.-C. So, Moment Inequalities for Sums of Random Matrices and Their Applications in Optimization, Mathematical Programming, 130, 125–151, 2011.

A

Proofs of Theorem 3.1 and Proposition 3.2

The whole appendix is devoted to the proof of Theorem 3.1, among which Proposition 3.2 is proved d d as a byproduct. First, we observe that kF k2 ≥ n− 2 kF k1 since F ∈ Rn , and thus (14) can be immediately derived from (13). Hence we shall focus on (13).

25

Furthermore, we observe that Theorem 3.1 is almost equivalent to the fact that any homogeneous polynomial function of independent random variables with bounded kurtosis should also have a bounded kurtosis itself, as formulated as follows: d

Theorem A.1 Let F ∈ Rn be a square-free super-symmetric tensor of order d, and let f (x) = F (x, x, · · · , x) be a homogeneous polynomial function induced by F . If ξ = (ξ1 , ξ2 , · · · , ξn )T are independent random variables with E ξi = 0, E ξi2 = 1, E ξi4 ≤ κ for all i = 1, 2, . . . , n, then E f 4 (ξ) ≤ d2 (d!)2 36d κd (E f 2 (ξ))2 . Before proving the theorem, let us note another important fact required in the proof, namely if a random variable has a bounded kurtosis, then it has a constant probability above the mean plus some constant proportion of the standard deviation. Lemma A.2 For any random variable z with its kurtosis upper bounded by κ > 0, namely E (z − E z)4 ≤ κ(E (z − E z)2 )2 , we have

( Prob

z ≥ Ez +

p

Var(z) √ 4 κ

)

√ 2 3−3 ≥ . 9κ

p Proof. By normalizing z, i.e., letting y = (zn− E z)/ oVar(z), we shall have E y = 0, E y 2 = 1 and √ 3−3 . E y 4 ≤ κ. Thus we only need to show Prob y ≥ 4√1 κ ≥ 2 9κ Denote x = t − y, where the constant t > 0 will be decided later. We have E x = t − E y = t, E x2 = t2 − 2tE y + E y 2 = t2 + 1, √ E x4 = t4 − 4t3 E y + 6t2 E y 2 − 4tE y 3 + E y 4 ≤ t4 + 6t2 + 4t κ + κ, where (E y 3 )2 ≤ E y 2 E y 4 ≤ κ is applied in the last inequality. By applying Theorem 2.3 of [10], for any constant v > 0 Prob {y ≥ t} = Prob {x ≤ 0} √ 4(2 3 − 3) 2E x 3E x2 E x4 − + 2 − 4 ≥ 9 v v v √ √ 2 4 4(2 3 − 3) 2t 3t + 3 t + 6t2 + 4t κ + κ ≥ − + − 9 v v2 v4 √ √ 1 4(2 3 − 3) 1 3 3 1 6 1 1 let t = √ and v = κ = − + + − − − − 9 2κ 16κ2 κ 256κ4 16κ3 κ2 κ 4 κ √ 4(2 3 − 3) 24 13 6 1 = − − − 9 16κ 16κ2 16κ3 256κ4 √ √ 4(2 3 − 3) 4 2 3−3 4 2 2 notice κ ≥ E y ≥ (E y ) = 1 ≥ · = . 9 16κ 9κ 26

Let us now prove Theorem A.1. We start with a special case when d = 2 and ξ are symmetric Bernoulli random variables, which helps to illustrate the ideas underlying the proof for the general case. Proposition A.3 Let F ∈ Rn×n be a diagonal-free symmetric matrix, and let f (x) = xT F x. If ξ ∼ Bn , then E f 4 (ξ) ≤ 15(E f 2 (ξ))2 . P Proof. Rewrite y = f (ξ) = σ aσ ξ σ , where σ ∈ Π := {(1, 2), (1, 3), · · · , (n−1, n)} and ξ (i,j) := ξi ξj . Since E ξid = 0 for odd d and E ξid = 1 for even d, the non-zero terms in E y 4 are all in the forms of aij aij aij aij , aij aij aik aik , aij aij ak` ak` and aij aik aj` ak` , where we assume i, j, k, ` are distinctive. Let us count the different types of terms. Type A: aij aij aij aij . The total number of such type of terms is

n 2

;

Type B: aij aij aik aik . The total number of such type of terms is n ·

n−1 2

·

Type C: aij aij ak` ak` . The total number of such type of terms is

n 4

·3·

4 2

Type D: aij aik aj` ak` . The total number of such type of terms is

n 4

4 2

;

;

· 3 · 4!.

Notice that !2 2 2

(E y ) =

X σ∈Π

a2σ

=

X σ∈Π

a4σ + 2

X

a2σ1 a2σ2 =: ‘Part I’ + ‘Part II’.

σ1 6=σ2

Type A terms constitute exactly ‘Part I’ in (E y 2 )2 ; each item of Types B and C will appear exactly once in ‘Part II’ of (E y 2 )2 ; each term of Type D can be bounded by an average of two terms in ‘Part II’ of (E y 2 )2 since aij aik aj` ak` ≤ (a2ij a2k` + a2ik a2j` )/2. The number of the terms of Types B, C and D is: n−1 4 n 4 n n(n − 1)(n − 2)(15n − 33) =: N n· + ·3· + · 3 · 4! = 4 2 2 4 2 4 and there are

n n n(n − 1)(n − 2)(n + 1) · −1 = =: N 0 4 2 2

terms in ‘Part II’ of (E y 2 )2 . Clearly N ≤ 15N 0 , which leads to E y 4 ≤ 15(E y 2 )2 .

We are now in a position to prove Proposition 3.2, which follows from Proposition A.3 and Lemma A.2. 27

Proof of Proposition 3.2 Proof. Since F is diagonal-free and symmetric, it is easy to verify E [ξ T F ξ] = 0 and X X Var(ξ T F ξ) = a2σ = 4 (aσ /2)2 = 2kF k22 . By Lemma A.2 we have Prob

ξTF ξ

≥

σ∈Π

σ∈Π

√

Var(ξ T F ξ) √ 4 15

≥

√ 2 3−3 135 ,

the desired inequality holds.

Let us now come to the proof of main theorem in the appendix. Proof of Theorem A.1 Proof. Let I := {1, 2, . . . , n} be the index set, and Π be the set containing all the combinations Q of d distinctive indices in I. Obviously |Π| = nd . For any π ∈ Π, we denote xπ := i∈π xi and xπ1 +π2 := xπ1 xπ2 (e.g., x{1,2} = x1 x2 and x{1,2}+{1,3} = x{1,2} x{1,3} = x1 x2 · x1 x3 = x21 x2 x3 ). P P Since F is square-free and super-symmetric, y can be written as π∈Π aπ xπ , or simply π aπ xπ (whenever we write summation over π, it means the summation over all π ∈ Π). We thus have " # X X X X E y2 = E aπ1 xπ1 aπ2 xπ2 = aπ1 aπ2 E xπ1 +π2 = aπ1 aπ2 E xπ1 +π2 = a2π . π1 ,π2

π1 ,π2

π1 =π2

Our task is to bound " # X E y4 = E aπ1 xπ1 aπ2 xπ2 aπ3 xπ3 aπ4 xπ4 = π1 ,π2 ,π3 ,π4

X

π

aπ1 aπ2 aπ3 aπ4 E xπ1 +π2 +π3 +π4 .

(19)

π1 ,π2 ,π3 ,π4

For any combination quadruple {π1 , π2 , π3 , π4 }, there are in total 4d indices, with each index appearing at most 4 times. Suppose there are a number of indices appearing 4 times, b number of indices appearing 3 times, c number of indices appearing twice, and g number of indices appearing once. Clearly 4a + 3b + 2c + g = 4d. In order to compute the summation of all the terms aπ1 aπ2 aπ3 aπ4 E xπ1 +π2 +π3 +π4 over π1 , π2 , π3 , π4 ∈ Π in (19), we shall group them according to different {a, b, c, g}. 1. g ≥ 1: as we know E xi = 0 for all i ∈ I, all the terms in this group will vanish. 2. b = c = g = 0: the summation of all the terms in this group is X X X aπ1 aπ2 aπ3 aπ4 E xπ1 +π2 +π3 +π4 = a4π . a4π1 E x4π1 ≤ κd π1 =π2 =π3 =π4

π1

π

3. g = 0 and b + c ≥ 1: we shall classify all the terms in this group step by step. In the following, we assume |Π| ≥ 2 and n ≥ d + 1 to avoid triviality. 28

• It is clear that 4a + 3b + 2c = 4d, 0 ≤ a ≤ d − 1, 0 ≤ b ≤ (4d − 4a)/3 and b must be even. Pd−1 2 1 + b 4d−4a In this group, the number of different {a, b, c} is at most a=0 6 c ≤d . n−a−b number of distinctive ways • For any given triple {a, b, c}, there are total na n−a c b n n−a n−a−b ≤ n!/(n − a − b − c)! ≤ n!/(n − 2d)+ !. to assign indices. Clearly, we have a c b • For any given a indices appearing 4 times, b indices appearing 3 times, and c indices appearing twice, we shall count how many distinctive ways they can form a particular combination quadruple {π1 , π2 , π3 , π4 } (note that orders do count). For the indices appearing 4 times, they do not have choice but to be located in {π1 , π2 , π3 , π4 } each once; for indices appearing 3 times, each has at most 4 choices; for indices appearing twice, each has at most 6 choices. Therefore, the total number of distinctive ways to formulate the combination of quadruples is upper bounded by 4b 6c ≤ 62d . • For any given combination quadruple {π1 , π2 , π3 , π4 }, noticing that (E x3i )2 ≤ E x2i E x4i ≤ √ κ for all i ∈ I, we have |E xπ1 +π2 +π3 +π4 | ≤ κa · ( κ)b · 1c = κa+b/2 ≤ κd . • For any given combination quadruple {π1 , π2 , π3 , π4 }, in this group each combination can appear at most twice. Specifically, if we assume i 6= j (implying πi 6= πj ), then the forms of {π1 , π1 , π1 , π2 } and {π1 , π1 , π1 , π1 } do not appear. The only possible forms are {π1 , π2 , π3 , π4 }, {π1 , π1 , π2 , π3 } and {π1 , π1 , π2 , π2 }. We notice that aπ1 aπ2 aπ3 aπ4

≤ (a2π1 a2π2 + a2π1 a2π3 + a2π1 a2π4 + a2π2 a2π3 + a2π2 a2π4 + a2π3 a2π4 )/6,

aπ1 aπ1 aπ2 aπ3

≤ (a2π1 a2π2 + a2π1 a2π3 )/2,

aπ1 aπ1 aπ2 aπ2

= a2π1 a2π2 .

Therefore, in any possible form, each aπ1 aπ2 aπ3 aπ4 can be on average upper bounded by P one item a2π1 a2π2 (π1 6= π2 ) in π1 6=π2 a2π1 a2π2 . Overall, in this group, by noticing the symmetry of Π, the summation of all the terms is upper P n! ·62d ·κd number of items in form of a2π1 a2π2 (π1 6= π2 ) in π1 6=π2 a2π1 a2π2 . bounded by d2 · (n−2d)+! n P 2 2 Notice that there are in total |Π|(|Π| − 1)/2 = 21 nd π1 6=π2 aπ1 aπ2 , and d − 1 items in each item is evenly distributed. By symmetry, the summation of all the terms in this group is upper bounded by d2 ·

n! 2d (n−2d)+! · 6 · n 1 n 2 d d −1

κd X

a2π1 a2π2 ≤ d2 (d!)2 36d κd · 2

π1 6=π2

X π1 6=π2

29

a2π1 a2π2 .

Finally, we are able to bound E y 4 by E y 4 ≤ κd

X

a4π + d2 (d!)2 36d κd · 2

π

X

a2π1 a2π2

π1 6=π2

  X X a2π1 a2π2  a4π + 2 ≤ d2 (d!)2 36d κd  π

π1 6=π2

!2 = d2 (d!)2 36d κd

X

a2π

= d2 (d!)2 36d κd (E y 2 )2 .

π

Putting the pieces together, the theorem follows.

Finally, combining Theorem A.1 and Lemma A.2, and noticing Var(f (ξ)) = d!kF k22 in Theorem 3.1, lead us to the probability bound (13) in Theorem 3.1, which concludes the whole proof.

30

Probability Bounds for Polynomial Functions in Random Variables - Enet

Probability Bounds for Polynomial Functions in Random Variables - Enet

Suggest Documents

Random Variables and Probability Distributions Random Variables

Probability and Random Variables

Probability, Random Variables, and Selectivity

Probability, Random Variables, and Stochastic

Discrete Random Variables and Probability Distributions Random ...

Probability Theory and Random Variables: Mean, Variance ...

Probability and Random Variables: A Beginner's Guide

Chapter 2 Probability and Random Variables

Chapter 3: DISCRETE RANDOM VARIABLES AND PROBABILITY ...

Chapter 1 Probability, Random Variables and Expectations

103796670-Papoulis-Probability-Random-Variables-and-Stochastic ...

RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS 1.1 ...

Basic Probability Concepts, Random Variables and Sampling ...

New degree bounds for polynomial threshold functions - CiteSeerX

pdf-12115\probability-random-variables-and-random-signal ...

Chernoff type bounds for sum of dependent random variables and

Bounds for sums of random variables when the ...

Functions of Random Variables - Naval Postgraduate School

Coverage probability of prediction intervals for discrete random variables

1 Errata for Papoulis/Pillai's Probability, Random Variables and ...

Distribution Functions for Random Variables for Ensembles of Positive ...

Moment inequalities for functions of independent random variables By ...

Random variables ∑

1 Overview: Some Probability and Statistics 2 Random Variables