Introduction to Probability for Electrical Engineering

15 downloads 0 Views 3MB Size Report
Jun 22, 2010 - Chevalier de Mere's Scandal of Arithmetic: Which is more ...... (d) Craig's formula: Q(x) = 1 π π. 2. ∫0 e ...... 2005. 18, 21. [26] Larry Wasserman.
Introduction to Probability for Electrical Engineering Prapun Suksompong, Ph.D. School of Information, Computer and Communication Technology Sirindhorn International Institute of Technology Thammasat University, Pathum Thani 12000, Thailand [email protected] June 22, 2010

Contents 1 Motivation

4

2 Mathematical Background 2.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Enumeration / Combinatorics / Counting . . . . . . . . . . . . . . . . . . 2.3 Dirac Delta Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 5 10 23

3 Classical Probability 3.1 Independence and Conditional Probability . . . . . . . . . . . . . . . . . . 3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25 26 27

4 Probability Foundations 4.1 Algebra and σ-algebra . . . . . . . . 4.2 Kolmogorov’s Axioms for Probability 4.3 Properties of Probability Measure . . 4.4 Countable Ω . . . . . . . . . . . . . . 4.5 Independence . . . . . . . . . . . . .

. . . . .

31 31 36 38 41 41

. . . . . . .

43 44 45 48 49 52 53 55

5 Random Element 5.1 Random Variable . . . . . . 5.2 Distribution Function . . . . 5.3 Discrete random variable . . 5.4 Continuous random variable 5.5 Mixed/hybrid Distribution . 5.6 Independence . . . . . . . . 5.7 Misc . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

6 PMF Examples 6.1 Random/Uniform . . . . . . . . . . . . 6.2 Bernoulli and Binary distributions . . . 6.3 Binomial: B(n, p) . . . . . . . . . . . . 6.4 Geometric: G(β) . . . . . . . . . . . . 6.5 Poisson Distribution: P(λ) . . . . . . . 6.6 Compound Poisson . . . . . . . . . . . 6.7 Hypergeometric . . . . . . . . . . . . . 6.8 Negative Binomial Distribution (Pascal 6.9 Beta-binomial distribution . . . . . . . 6.10 Zipf or zeta random variable . . . . . .

. . . . . . . / . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P´olya distribution) . . . . . . . . . . . . . . . . . . . . . .

7 PDF Examples 7.1 Uniform Distribution . . . . . . . . . . . . 7.2 Gaussian Distribution . . . . . . . . . . . 7.3 Exponential Distribution . . . . . . . . . . 7.4 Pareto: Par(α)–heavy-tailed model/density 7.5 Laplacian: L(α) . . . . . . . . . . . . . . . 7.6 Rayleigh . . . . . . . . . . . . . . . . . . . 7.7 Cauchy . . . . . . . . . . . . . . . . . . . . 7.8 More PDFs . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

55 55 56 57 58 60 64 65 66 66 67

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

67 67 69 72 75 75 76 77 77

8 Expectation

78

9 Inequalities

85

10 Random Vectors 10.1 Random Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89 95

11 Transform Methods 11.1 Probability Generating Function 11.2 Moment Generating Function . 11.3 One-Sided Laplace Transform . 11.4 Characteristic Function . . . . .

. . . .

95 95 96 97 98

. . . .

101 101 103 108 111

. . . .

117 123 124 124 125

12 Functions of random variables 12.1 SISO case . . . . . . . . . . . 12.2 MISO case . . . . . . . . . . . 12.3 MIMO case . . . . . . . . . . 12.4 Order Statistics . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

13 Convergences 13.1 Summation of random variables . . . . . . . 13.2 Summation of independent random variables 13.3 Summation of i.i.d. random variable . . . . 13.4 Central Limit Theorem (CLT) . . . . . . . . 2

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

14 Conditional Probability and 14.1 Conditional Probability . . 14.2 Conditional Expectation . 14.3 Conditional Independence

Expectation 126 . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

15 Real-valued Jointly Gaussian

129

16 Bayesian Detection and Estimation

133

17 Monte Carlo Method

136

A Math Review A.1 Inequalities . . . . . . . . . A.2 Summations . . . . . . . . . A.3 Calculus . . . . . . . . . . . A.3.1 Derivatives . . . . . A.3.2 Integration . . . . . A.4 Gamma and Beta functions

137 137 140 142 142 146 151

. . . . . .

. . . . . .

. . . . . .

. . . . . .

3

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

1

Motivation

1.1. Random phenomena arise because of [16]: (a) our partial ignorance of the generating mechanism (b) the laws governing the phenomena may be fundamentally random (as in quantum mechanics) (c) our unwillingness to carry out exact analysis because it is not worth the trouble 1.2. Communication Theory [27]: The essence of communication is randomness. (a) Random Source: The transmitter is connected to a random source, the output of which the receiver cannot with certainty predict. • If a listener knew in advance exactly what a speaker would say, and with what intonation he would say it, there would be no need to listen! (b) Noise: There is no communication problem unless the transmitted signal is disturbed during propagation or reception in a random way. 1.3. Random numbers are used directly in the transmission and security of data over the airwaves or along the Internet. (a) A radio transmitter and receiver could switch transmission frequencies from moment to moment, seemingly at random, but nevertheless in synchrony with each other. (b) The Internet data could be credit-card information for a consumer purchase, or a stock or banking transaction secured by the clever application of random numbers. 1.4. Randomness is an essential ingredient in games of all sorts, computer or otherwise, to make for unexpected action and keen interest. Definition 1.5. Important terms [16]: (a) An experiment is called a random experiment if its outcome cannot be predicted precisely because the conditions under which it is performed cannot be predetermined with sufficient accuracy and completeness. • Tossing a coin, rolling a die, and drawing a card from a deck are some examples of random experiments. (b) A random experiment may have several separately identifiable outcomes. We define the sample space Ω as a collection of all possible separately identifiable outcomes of a random experiment. Each outcome is an element, or sample point, of this space. • Rolling a die has six possible identifiable outcomes (1,2, 3,4,5, and 6). (c) Events are sets (or classes) of outcomes meeting some specifications. 4

1.6. Although the outcome of a random experiment is unpredictable, there is a statistical regularity about the outcomes. • For example, if a coin is tossed a large number of times, about half the times the outcome will be “heads,” and the remaining half of the times it will be “tails.” Let A be one of the events of a random experiment. If we conduct a sequence of N independent trials of this experiment, and if the vent A occurs in N (A) out of these N trials, then the fraction N (A) f (A) = lim N →∞ N is called the relative frequency of the event A.

2

Mathematical Background

2.1

Set Theory

2.1. Basic Set Identities: • Idempotence: (Ac )c = A • Commutativity (symmetry): A∪B =B∪A, A∩B =B∩A • Associativity: ◦ A ∩ (B ∩ C) = (A ∩ B) ∩ C

◦ A ∪ (B ∪ C) = (A ∪ B) ∪ C • Distributivity ◦ A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)

◦ A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)

• de Morgan laws ◦ (A ∪ B)c = Ac ∩ B c

◦ (A ∩ B)c = Ac ∪ B c

2.2. Basic Terminology: • A ∩ B is sometimes written simply as AB. • The set difference operation is defined by B \ A = B ∩ Ac . 5

diagram with five congruent ellipses. (Two ellipses are congruent if they are the exact same size and shape, and differ only by their placement in the plane.) 8. Many of the logical identities given in §1.1.2 correspond to set identities, given in the following table. name

rule A∩B =B∩A

Commutative laws

A∪B =B∪A

A ∩ (B ∩ C) = (A ∩ B) ∩ C A ∪ (B ∪ C) = (A ∪ B) ∪ C

Associative laws Distributive laws

A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)

DeMorgan’s laws

A∩B =A∪B

Complement laws

A∩A=∅

Double complement law

A=A

Idempotent laws

A∩A=A

Absorption laws

A ∩ (A ∪ B) = A

Dominance laws

A∩∅=∅

A∪∅=A

Identity laws

A∪B =A∩B

A∪A=U A∪A=A A ∪ (A ∩ B) = A

A∪U =U

A∩U =A

9. In a computer, a subset of a relatively small universal domain can be represented by Figure 1: Set Identities [21] a bit string. Each bit location corresponds to a specific object of the universal domain, and the bit value indicates the presence (1) or absence (0) of that object in the subset.

• Sets B are said to beof disjoint ⊥ B) if and only iforAuniversal ∩ B = domain ∅ 10. AInand a computer, a subset a relatively(A large ordered datatype can be represented by a binary search tree.

• A collection of sets (Ai : i ∈ I) is said to be pair-wise disjoint or mutually exclusive 11. For any two finite sets A and B, |A ∪ B| = |A| + |B| − |A ∩ B| (inclusion/exclusion [11, p. 9] if and only if Ai ∩Aj = ∅ when i 6= j. principle). (See §2.3.)

• A12. collection Π = can (Aαbe : αproved ∈ I) by of any subsets Ω (in this case, indexed or labeled by α Set identities of theoffollowing: taking• avalues in an index label is side saidisto be a partition if the containment proof: or show thatset theI) left a subset of the right of sideΩand right side is a subset of the left side; S (a) Ω Aα∈I andtable: construct the analogue of the truth table for each side of • a=membership (b) Forthe all equation; i 6= j, Ai ⊥ Aj (pairwise disjoint). • using other set identities.

In13.which case, collection For all setsthe A, |A| < |P(A)|.(B ∩ A Sα : α ∈ I) is a partition of B. In other words, any set B can be expressed as B = (B ∩ Ai ) where the union is a disjoint union. α

c 2000 by CRC Press LLC 

• The cardinality (or size) of a collection or set A, denoted |A|, is the number of elements of the collection. This number may be finite or infinite. ◦ Inclusion-Exclusion Principle: n [ X Ai = i=1

n T c ◦ Ai = |Ω| + i=1

P

φ6=I⊂{1,...,n}

φ6=I⊂{1,...,n}

T (−1) Ai . |I|

i∈I

6

\ (−1)|I|+1 Ai . i∈I

S ◦ If ∀i, Ai ⊂ B ( or equivalently, ni=1 Ai ⊂ B), then n \ \ X (−1)|I| Ai . (B \ Ai ) = |B| + i=1

φ6=I⊂{1,...,n}

i∈I

• An infinite set A is said to be countable if the elements of A can be enumerated or listed in a sequence: a1 , a2 , . . . . Empty set and finite sets are also said to be countable. • By a countably infinite set, we mean a countable set that is not finite. Examples of such sets include ◦ The set N = {1, 2, , 3, . . . } of natural numbers. ◦ The set {2k : k ∈ N} of all even numbers.

◦ The set {2k + 1 : k ∈ N} of all odd numbers. ◦ The set Z of integers

◦ The set Q of all rational numbers

◦ The set Q+ of positive rational numbers

◦ The set of all finite-length sequences of natural numbers. ◦ The set of all finite subsets of the natural numbers.

• A singleton is a set with exactly one element. • R = (−∞, ∞). • For a set of sets, to avoid the repeated use of the word “set”, we will call it a collection/class/family of sets. Definition 2.3. Monotone sequence of sets • The sequence of events (A1 , A2 , A3 , . . . ) is monotone-increasing sequence of events if and only if A1 ⊂ A2 ⊂ A3 ⊂ . . .. In which case, ◦

n S

Ai = An

i=1

◦ lim An = n→∞

∞ S

Ai .

i=1

Put A = lim An . We then write An % A; that is An % A if and only if ∀n n→∞ ∞ S An ⊂ An+1 and An = A. n=1

7

• The sequence of events (B1 , B2 , B3 , . . . ) is monotone-decreasing sequence of events if and only if B1 ⊃ B2 ⊃ B3 ⊃ . . .. In which case, ◦

n T

Bi = Bn

i=1

◦ lim Bn = n→∞

∞ T

Bi

i=1

Put B = lim Bn . We then write Bn & B; that is Bn & B if and only if ∀n n→∞ ∞ T Bn+1 ⊂ Bn and Bn = B. n=1

Note that An & A ⇔ Acn % Ac .

2.4. An (Event-)indicator function IA : Ω → {0, 1} is defined by  1, if ω ∈ A IA (ω) = 0, otherwise. • Alternative notation: 1A . • A = {ω : IA (ω) = 1} • A = B if and only if IA = IB • IAc (ω) = 1 − IA (ω) • A ⊂ B ⇔ {∀ω, IA (ω) ≤ IB (ω)} ⇔ {∀ω, IA (ω) = 1 ⇒ IB (ω) = 1} IA∩B (ω) = min (IA (ω) , IB (ω)) = IA (ω) · IB (ω) IA∪B (ω) = max (IA (ω) , IB (ω)) = IA (ω) + IB (ω) − IA (ω) · IB (ω) S∞ S SN S∞ 2.5. Suppose N i=1 Bi . i=1 Ai = i=1 Bi for all finite N ∈ N. Then, i=1 Ai = S Proof. To show “⊂”, suppose x ∈ ∞ ∃N0 such that x ∈ AN0 . Now, AN0 ⊂ i=1 Ai . Then, SN0 SN0 S∞ S∞ i=1 Ai = i=1 Bi ⊂ i=1 Bi . Therefore, x ∈ i=1 Bi . To show “⊃”, use symmetry.

2.6. Let A1 , A2 , . . . be a sequence of disjoint sets. Define Bn = ∪i>n Ai = ∪i≥n+1 Ai . Then, (1) Bn+1 ⊂ Bn and (2) ∩∞ n=1 Bn = ∅.

Proof. Bn = ∪i≥n+1 Ai = (∪i≥n Ai ) ∪ An+1 = Bn+1 ∪ An+1 So, (1) is true. For (2), consider two cases. (2.1) For element x ∈ / ∪i Ai , we know that x ∈ / Bn and hence x ∈ / ∩∞ n=1 Bn . (2.2) For x ∈ ∪i Ai , we know that ∃i0 such that x ∈ Ai0 . Note that x can’t be in other Ai because the Ai ’s are disjoint. So, x ∈ / Bi+0 and therefore x ∈ / ∩∞ n=1 Bn .

8

2.7. Any countable union can be written as a union of pairwise disjoint sets: Given any sequence of sets Fn , define a new sequence by A1 = F1 , and   \ [ c An = Fn ∩ Fn−1 ∩ · · · ∩ F1c = Fn ∩ Fic = Fn \  Fi  . i∈[n−1]

for n ≥ 2. Then,

S∞

i=1

Fi =

S∞

i=1

i∈[n−1]

Ai where the union on the RHS is a disjoint union.

Proof. Note that the An are pairwise disjoint. To see this, consider An1 , An2 where n1 6= n2 . WLOG, assume n1 < n2 . This implies that Fnc1 get intersected in the definition of An2 . So,   n\ n\ 2 −1 1 −1    Fic  = ∅ Fic ∩ Fn2 ∩ An1 ∩ An2 = Fn1 ∩ Fnc1 ∩  {z } | i=1 i=1 ∅

i6=n1

S

S

Also, for finite N ≥ 1, we have n∈[N ] Fn = n∈[N ] An (*). To see this, note that (*) is S S true for N = 1. Suppose (*) is true for N = m and let B = n∈[m] An = n∈[m] Fn . Now, for N = m + 1, by definition, we have [ [ An = An ∪ Am+1 = (B ∪ Fm+1 ) ∩ Ω n∈[m+1]

n∈[m]

= B ∪ Fm+1 =

[

Fi .

i∈[m+1]

So, (*) is true for N = m + 1. By induction, (*) is true for all finite N . The extension to ∞ is done via (2.5). For finite union, we can modify the above statement by setting Fn = ∅ for n ≥ N . Then, An = ∅ for n ≥ N . By construction, {An : n ∈ N} ⊂ σ ({Fn : n ∈ N}). However, in general, it is not true that {Fn : n ∈ N} ⊂ σ ({An : n ∈ N}). For example, for finite union with N = 2, we can’t get F2 back from set operations on A1 , A2 because we lost information about F1 ∩ F2 . To create a disjoint union which preserve the information about the overlapping parts of the Fn ’s, we can define the A’s by ∩n∈N Bn where Bn is Fn or Fnc . This is done in (2.8). However, this leads to uncountably many Aα , which is why we used the index α above instead of n. The uncountability problem does not occur if we start with a finite union. This is shown in the next result. 2.8. Decomposition: • Fix sets A1 , A2 , . . . , An , not necessarily disjoint. Let Π be a collection of all sets of the form B = B1 ∩ B2 ∩ · · · ∩ Bn where each Bi is either Aj or its complement. There are 2n of these, say n

B (1) , B (2) , . . . , B (2 ) . Then, 9

(a) Π is a partition of Ω and  (b) Π \ ∩j∈[n] Acj is a partition of ∪j∈[n] Aj .

Moreover, any Aj can be expressed as Aj =

S

i∈Sj

specifically, Aj is the union of all B

(i)

B (i) for some Si ⊂ [2n ]. More

which is constructed by Bj = Aj

• Fix sets T A1 , A2 , . . ., not necessarily disjoint. Let Π be a collection of all sets of the form B = n∈N Bn where each Bn is either An or its complement. There are uncountably of these hence we will index the B’s by α; that is we write B (α) . Let I be the set of all α after we eliminate all repeated B (α) ; note that I can still be uncountable. Then,  (a) Π = B (α) : α ∈ I is a partition of Ω and (b) Π \ {∩n∈N Acn } is a partition of ∪n∈N An .

Moreover, any Aj can be expressed as Aj =

S

α∈Sj

B (α) for some Sj ⊂ I. Because I is

uncountable, in general, Sj can be uncountable. More specifically, Aj is the (possibly uncountable) union of all B (α) which is constructed by Bj = Aj . The uncountability of Sj can be problematic because it implies that we need uncountable union to get Aj back. 2.9. Let {Aα : α ∈ I} be a collection of disjoint sets where I is a nonempty index set. For any set S ⊂ I, define a mapping g on 2I by g(S) = ∪α∈S Aα . Then, g is a 1:1 function if and only if none of the Aα ’s is empty. 2.10. Let

 A = (x, y) ∈ R2 : (x + a1 , x + b1 ) ∩ (y + a2 , y + b2 ) 6= ∅ ,

where ai < bi . Then,

 A = (x, y) ∈ R2 : x + (a1 − b2 ) < y < x + (b1 − a2 )  = (x, y) ∈ R2 : a1 − b2 < y − x < b1 − a2 .

2.2

Enumeration / Combinatorics / Counting

There are many probability problems, especially those concerned with gambling, that can ultimately be reduced to questions about cardinalities of various sets. Combinatorics is the study of systematic counting methods, which we will be using to find the cardinalities of various sets that arise in probability. 2.11. See Chapter 2 of the handbook [21] for more details. 2.12. Rules Of Sum, Product, And Quotient:

10

𝑦

𝑦=𝑥 A

𝑐 + 𝑏1 − 𝑎2 𝑐 + 𝑎1 − 𝑏2

𝑐

𝑐

x

 Figure 2: The region A = (x, y) ∈ R2 : (x + a1 , x + b1 ) ∩ (y + a2 , y + b2 ) 6= ∅ . (a) Rule of sum: When there are m cases such that the ith case has ni options, for i = 1, ..., m, and no two of the cases have any options in common, the total number of options is n1 + n2 + · · · + nm . (i) In set-theoretic terms, if sets S1 , . . . , Sm are finite and pairwise disjoint, then |S1 ∩ S2 ∩ · · · ∩ Sm | = |S1 | + |S2 | + · · · + |Sm |. (b) Rule of product: • If one task has n possible outcomes and a second task has m possible outcomes, then the joint occurrence of the two tasks has n × m possible outcomes.

• When a procedure can be broken down into m steps, such that there are n1 options for step 1, and such that after the completion of step i − 1 (i = 2, . . . , m) there are ni options for step i, the number of ways of performing the procedure is n1 n2 · · · nm .

• In set-theoretic terms, if sets S1 , . . . , Sm are finite, then |S1 × S2 × · · · × Sm | = |S1 | · |S2 | · · · · · |Sm |. • For k finite sets A1 , ..., Ak , there are |A1 | · · · |Ak | k-tuples of the form (a1 , . . . , ak ) where each ai ∈ Ai .

(c) Rule of quotient: When a set S is partitioned into equal-sized subsets of m elements subsets. each, there are |S| m 2.13. The four kinds of counting problems are: (a) ordered sampling of r out of n items with replacement: nr ; (b) ordered sampling of r ≤ n out of n items without replacement: (n)r ; 11

(c) unordered sampling of r ≤ n out of n items without replacement:  (d) unordered sampling of r out of n items with replacement: n+r−1 . r

n r

 ;

• See 2.22 for bars and stars argument.

2.14. Given a set of n distinct items, select a distinct ordered sequence (word) of length r drawn from this set. • Sampling with replacement: µn,r = nr ◦ Meaning

∗ Ordered sampling of r out of n items with replacement.

◦ µn,1 = n ◦ µ1,r = 1

◦ Examples:

∗ There are nr different sequences of r cards drawn from a deck of n cards with replacement; i.e., we draw each card, make a note of it, put the card back in the deck and re-shuffle the deck before choosing the next card. ∗ Suppose A is a finite set, then the cardinality of its power set is 2A = 2|A| . ∗ There are 2r binary strings/sequences of length r.

• Sampling without replacement: (n)r =

r−1 Y i=0

(n − i) =

n! (n − r)!

= n · (n − 1) · · · (n − (r − 1)); | {z } r terms

r≤n

◦ Ordered sampling of r ≤ n out of n items without replacement.

◦ “the number of possible r-permutations of n distinguishable objects” ◦ For integers r, n such that r > n, we have (n)r = 0.

◦ Extended definition: The definition in product form (n)r =

r−1 Y i=0

(n − i) = n · (n − 1) · · · (n − (r − 1)) | {z } r terms

can be extended to any real number n and a non-negative integer r. We define (n)0 = 1. (This makes sense because we usually take the empty product to be 1.) ◦ (n)1 = n

◦ (n)r = (n − (r − 1))(n)r−1 . For example, (7)5 = (7 − 4)(7)4 . 12

◦ (1)r =



1, 0,

if r = 1 if r > 1

• Ratio: (n)r = nr

r−1 Q

(n − i)

i=0 r−1 Q

(n)

i=0

i=0



r−1  Y i=0

r2

e

− ni

=

r−1  Y



=e

1 −n

i 1− n

r−1 P i=0

i



= e−

r(r−1) 2n

≈ e− 2n

2.15. Factorial and Permutation: The number of arrangements (permutations) of n ≥ 0 distinct items is (n)n = n!. • For any integer n greater than 1, the symbol n!, pronounced “n factorial,” is defined as the product of all positive integers less than or equal to n. • 0! = 1! = 1 • n! = n(n − 1)! • n! =

R∞

e−t tn dt

0

• Meaning: The number of ways that n distinct objects can be ordered. • Computation: (a) MATLAB: ◦ Use factorial(n). Since double precision numbers only have about 15 digits, the answer is only accurate for n ≤ 21. For larger n, the answer will have the right magnitude, and is accurate for the first 15 digits. ◦ Use perms(v), where v is a row vector of length n, to creates a matrix whose rows consist of all possible permutations of the n elements of v. (So the matrix will contain n! rows and n columns.) (b) Google’s web search box built-in calculator: n! • Approximation: Stirling’s Formula [8, p. 52]: √  √ 1 n n −n 2πe e(n+ 2 ) ln( e ) . n! ≈ 2πnn e =

(1)

◦ The sign ≈ can be replaced by ∼ to emphasize that the ratio of the two sides to unity as n → ∞. ◦ ln n! = n ln n − n + o(n)

13

2.16. Binomial coefficient:   (n)r n n! = = r r! (n − r)!r! (a) Read “n choose r”. (b) Meaning: (i) The number of subsets of size r that can be formed from a set of n elements (without regard to the order of selection). (ii) The number of combinations of n objects selected r at a time. (iii) The number of (unordered) sets of size r drawn from an alphabet of size n without replacement. (iv) Unordered sampling of r ≤ n out of n items without replacement (c) Computation: (i) MATLAB:

 • nchoosek(n,r), where n and r are nonnegative integers, returns nr . • nchoosek(v,r), where v is a row vector of length n, creates a matrix whose rows consist of all possible combinations of the n elements of v taken r at a  time. The matrix will contains nr rows and r columns.

(ii) Use combin(n,r) in Mathcad. However, to do symbolic manipulation, use the factorial definition directly.  (iii) In Maple, use nr directly.

(d) (e) (f) (g) (h) (i)

(j)

(iv) Google’s web search box built-in calculator: n choose k   n Reflection property: nr = n−r .   n = n0 = 1. n   n n = n−1 = n. 1  n = 0 if n < r or r is a negative integer. r   n max nr = b n+1 . r 2 c    Pascal’s “triangle” rule: nk = n−1 + n−1 . This property divides the process of k k−1 choosing k items into two steps where the first step is to decide whether to choose the first item or not.  P  P n n n−1 0≤k≤n k = 0≤k≤n k = 2 k even

k odd

There are many ways to show this identity. 14

Figure 3: Pascal Triangle. (i) Consider the number of subsets of S = {a1 , a2 , . . . , an } of n distinct elements. First choose subsets A of the first n − 1 elements of S. There are 2n−1 distinct S. Then, for each A, to get set with even number of element, add element an to A if and only if |A| is odd.

(k)

(ii) Look at binomial expansion of (x + y)n with x = 1 and y = −1.   n (iii) For odd n, use the fact that nr = n−r .

Pmin(n1 ,n2 ) k=0

n1 k



n2 k



=

n1 +n2 n1



=

n1 +n2 n2



.

• This property divides the process of choosing k items from n1 +n2 into two steps where the first step is to choose from the first n1 items.  • Can replace the min(n1 , n2 ) in the first sum by n1 or n2 if we define nk = 0 for k > n.  Pn n2 2n • = . r=1 r n

(l) Parallel summation:

n   X m

m=k

k

        k k+1 n n+1 = + + ··· + = . k k k k+1

To see this, suppose we try to choose k + 1 items from n + 1 items a1 , a2 , . . . , an+1 . First, we choose whether to choose a1 . If so, then weneed to choose the rest k items from the n items a2 , . . . , an+1 . Hence, we have the nk term. Now, suppose we didn’t choose a1 . Then, we still need to choose k + 1 items from n items a2 , . . . , an+1 . We then repeat the same argument on a2 in stead of a1 . Equivalently,  r  X n+k

k=0

k

=

 r  X n+k

k=0

n

=



   n+r+1 n+r+1 + . n+1 r

To prove the middle equality in (2), use induction on r.

15

(2)

2.17. Binomial theorem: n   X n r n−r (x + y) = xy r r=0 n

(a) Let x = y = 1, then

n P

r=0

n r



= 2n .

(b) Sum involving only the even terms (or only the odd terms): n   X n r n−r 1 xy = ((x + y)n + (y − x)n ) , and 2 r r=0

r even n  X r=0 r odd

 n r n−r 1 xy = ((x + y)n − (y − x)n ) . r 2

In particular, if x + y = 1, then n   X n r n−r 1 xy = (1 + (1 − 2x)n ) , and r 2 r=0 r even n  X r=0 r odd

 n r n−r 1 xy = (1 − (1 − 2x)n ) . 2 r

(3a)

(3b)

(c) Approximation by the entropy function: Recall that H(p) = −p logb (p) − (1 − p) logb (1 − p) If we set b = 2 (binary), then H2 (p) = −p log2 (p) − (1 − p) log2 (1 − p) . In which case, 1 nH ( nr ) 2 ≤ n+1 Hence,



n r

  r n ≤ 2nH ( n ) . r

r

≈ 2nH2 ( n ) .

(d) By repeated differentiating with respect to x followed by multiplication by x, we have  r n−r Pn n = nx(x + y)n−1 and • r=0 r r x y Pn 2 n r n−r • = nx (x(n − 1)(x + y)n−2 + (x + y)n−1 ). r=0 r r x y For x + y = 1, we have  r Pn n n−r • = nx and r=0 r r x (1 − x)

16

Table 5 Binomial coefficient identities.

Factorial expansion Symmetry Monotonicity Pascal’s identity Binomial theorem Counting all subsets Even and odd subsets Sum of squares Square of row sums Absorption/extraction Trinomial revision Parallel summation Diagonal summation Vandermonde convolution Diagonal sums in Pascal’s triangle (§2.3.2) Other Common Identities

 n k  n k  n 0

=

k = 0, 1, 2, . . . , n

= 0, 1, 2, . . . , n  n  < < n/2 , n≥0 n n−1 n−1 k = k−1 + k , k = 0, 1, 2, . . . , n    n (x + y)n = k=0 nk xk y n−k , n ≥ 0 n n n k=0 k = 2 , n ≥ 0   n k n k=0 (−1) k = 0, n ≥ 0 n n2 2n = n , n≥0 k=0 k n n2 2n 2n = k=0 k , n ≥ 0 k=0 k n n n−1 =0 k = k k−1 , k   n m n n−k  m k = k m−k , 0 ≤ k ≤ m ≤ n m n+k n+m+1 = , m, n ≥ 0 k=0 k m n−m m+k  n+1  = m+1 , n ≥ m ≥ 0 k=0 m r m n  m+n , m, n, r ≥ 0 k=0 k r−k = r  n/2 n−k = Fn+1 (Fibonacci numbers), n ≥ 0 k=0 k n

=

n! k!(n−k)! ,  n  n−k , k  n 1 < ···

 

n n−1 , n≥0 k=0 k k = n2   n n 2 n−2 , k=0 k k = n(n + 1)2   n n k k=0 (−1) k k = 0, n ≥ 0

n

(nk)

k=0 k+1

n

=

2n+1 −1 n+1 ,

n k

k ( ) k=0 (−1) k+1

n

=

n

n≥0

n≥0

1 n+1 ,

n≥0

1 1 1 k−1 ( k ) k=1 (−1) k = 1 + 2 + 3 + · · · + n, n > 0 n−1 n n   2n  k=0 k k+1 = n−1 , n > 0 m m n  m+n k=0 k p+k = m+p , m, n, p ≥ 0, n ≥ p + m

Figure 4: Binomial coefficient identities [21]

• Sum of squares: Choose a committee of size    n from a group of n men and n n women. The left side, rewritten as nk n−k , describes the process of selecting committees according to the number of men, k, and the number of women, n − k, on the committee. The right side gives the total number of committees possible. • Absorption/extraction: From a group of n people, choose a committee of size k and a person on the committee to 17 be its chairperson. Equivalently, first select a chairperson from the entire group, and then select the remaining k−1 committee members from the remaining n − 1 people. c 2000 by CRC Press LLC 



Pn

r=0

r2



n r

xr (1 − x)n−r = nx (nx + 1 − x).

All identities above can be verified easily via Mathcad. 2.18. Central binomial coefficient: (a) nth central binomial coefficient:



2n n

=

(2n)! (n!)2

(i) They are called central since they show up exactly in the middle of the evennumbered rows in Pascal’s triangle. (ii) The first few central binomial coefficients starting at n = 0 are 1, 2, 6, 20, 70, 252, 924, 3432, 12870, 48620, . . . . (b) Approximation: From Stirling’s formula (1), as n → ∞, we have   2n 22n ≈√ n πn (c) Bounds: 4n √ ≤ 4n

  2n 4n ≤√ n 3n + 1

2.19. Multinomial Counting : The multinomial coefficient r  Y n− i=1

i−1 P

k=0

ni

nk



(4)

n n1 n2 ··· nr



is defined as

        n n − n1 n − n1 − n2 nr n! . = · · ··· = Q r n1 n2 n3 nr n! i=1

It is the number of ways that we can arrange n =

r P

ni tokens when having r types of

i=1

symbols and ni indistinguishable copies/tokens of a type i symbol. 2.20. Multinomial Theorem n− n

(x1 + . . . + xr ) =

n n−i X X1

i1 =0 i2 =0

P

ij

X

j 2 2 ◦ When n = 3, we have

7n−1 2

= 10.

0.14 n=3 n=4

0.12

0.1

0.08

0.06

0.04

0.02

0

0

5

10

15

20

25

Figure 9: pmf of Sn for n = 3 and n = 4. 3.9. Chevalier de Mere’s Scandal of Arithmetic: Which is more likely, obtaining at least one six in 4 tosses of a fair die (event A), or obtaining at least one double six in 24 tosses of a pair of dice (event B)? We have

and

 4 5 = .518 P (A) = 1 − 6 P (B) = 1 −



35 36

24

= .491.

Therefore, the first case is more probable. Remark: Probability theory was originally inspired by gambling problems. In 1654, Chevalier de Mere invented a gambling system which bet even money on the second case above. However, when he began losing money, he asked his mathematician friend Blaise Pascal to analyze his gambling system. Pascal discovered that the Chevalier’s system would lose about 51 percent of the time. Pascal became so interested in probability and together with another famous mathematician, Pierre de Fermat, they laid the foundation of probability theory. 29

3.10. A random sample of size r with replacement is taken from a population of n elements. The probability of the event that in the sample no element appears twice (that is, no repetition in our sample) is (n)r . nr The probability that at least one element appears twice is  r−1  Y r(r−1) i ≈ 1 − e− 2n . pu (n, r) = 1 − 1− n i=1 In fact, when r − 1 < n2 , (A.3) gives e

1 r(r−1) 3n+2r−1 2 n 3n



r−1  Y i=1

i 1− n



1 r(r−1) n

≤ e2

.

• From the approximation, to have pu (n, r) = p, we need r≈

1 1p 1 − 8n ln (1 − p). + 2 2

• Probability of coincidence birthday: Probability that there is at least two people who have the same birthday2 in a group of r persons  1, if r ≥ 365,         365 364 = 365 − (r − 1)    , if 0 ≤ r ≤ 365  1 − · · · · · ·     365 365 365  | {z }  r terms ◦ Birthday Paradox : In a group of 23 randomly selected people, the probability that at least two will share a birthday (assuming birthdays are equally likely to occur on any given day of the year) is about 0.5.

∗ At first glance it is surprising that the probability of 2 people having the same birthday is so large3 , since there are only 23 people compared with 365 days on the calendar. Some of the surprise disappears if you realize  23 that there are 2 = 253 pairs of people who are going to compare their birthdays. [6, p. 9]

◦ Remark: The group size must be at least 253 people if you want a probability > 0.5 that someone will have the same birthday as you. [6, Ex. 1.13] (The 364 r probability is given by 1 − 365 .)

∗ A naive (but incorrect) guess is that d365/2e = 183 people will be enough. The “problem” is that many people in the group will have the same birthday, so the number of different birthdays is smaller than the size of the group. 30

r −1

− i⎞ ⎛ probability that X 1 ,…, X r are all distinct is pu ( n, r ) = 1 − ∏ ⎜ 1 − ⎟ ≈ 1 − e n⎠ i =1 ⎝ a) Special case: the birthday paradox in (1).

(

) 2n

.

pu(n,r) for n = 365 1

pu ( n, r )

0.9

0.6

0.8 0.7

1− e

0.6



p = 0.9 p = 0.7 p = 0.5

0.5

r ( r −1) 2n

0.4

r n

0.5 0.4

r = 23

0.3

n = 365

r = 23 n = 365

p = 0.3 p = 0.1

0.3

0.2

0.2

0.1 0.1 0

0

5

10

15

20

25

30

35

40

45

50

0

55

0

50

100

150

200

250

300

350

n

r

b) The approximation comes from 1 + x ≈ e x .

c)

Figure 10: pu (n, r): The probability of the event that at least one element appears twice in random sample of size r with replacement fromneed a population of n elements. From the approximation, to have pu n, ris taken = p , we

(

)

3.11. Monte Hall’s Game: Started with showing a contestant 3 closed doors behind of which was a prize. The contestant selected a door but before the door was opened, Monte Hall, who knew which door hid the prize, opened a remaining door. The contestant was then allowed to either stay with his original guess or change to the other closed door. Question: better to stay or to switch? Answer: Switch. Because after given that the contestant switched, then the probability that he won the prize is 23 .

4

Probability Foundations

To study formal definition of probability, we start with the probability space (Ω, A, P ). Let Ω be an arbitrary space or set of points ω. Viewed probabilistically, a subset of Ω is an event and an element ω of Ω is a sample point. Each event is a collection of outcomes which are elements of the sample space Ω. The theory of probability focuses on collections of events, called event algebras and typically denoted A (or F) that contain all the events of interest (regarding the random experiment E) to us, and are such that we have knowledge of their likelihood of occurrence. The probability P itself is defined as a number in the range [0, 1] associated with each event in A.

4.1

Algebra and σ-algebra

The class 2Ω of all subsets can be too large4 for us to define probability measures with consistency, across all member of the class. In this section, we present smaller classes which have several “nice” properties. 2

We ignore February 29 which only comes in leap years. In other words, it was surprising that the size needed to have 2 people with the same birthday was so small. 4 There is no problem when Ω is countable. 3

31

Definition 4.1. [9, Def 1.6.1 p38] An event algebra A is a collection of subsets of the sample space Ω such that it is (1) nonempty (this is equivalent to Ω ∈ A); (2) closed under complementation (if A ∈ A then Ac ∈ A); (3) and closed under finite unions (if A, B ∈ A then A ∪ B ∈ sA). In other words, “A class is called an algebra on Ω if it contains Ω itself and is closed under the formation of complements and finite unions.” 4.2. Examples of algebras • Ω = any fixed interval of R. A = {finite unions of intervals contained in Ω} • Ω = (0, 1]. B0 = the collection of all finite unions of intervals of the form (a, b] ⊂ (0, 1] ◦ Not a σ-field. Consider the set 4.3. Properties of an algebra A:

∞ S

i=1

1 ,1 2i+1 2i



(a) Nonempty: ∅ ∈ A, X ∈ A (b) A ⊂ 2Ω (c) An algebra is closed under finite set-theoretic operations. • A ∈ A ⇒ Ac ∈ A

• A, B ∈ A ⇒ A ∪ B ∈ A, A ∩ B ∈ A, A\B ∈ A, A∆B = (A\B ∪ B\A) ∈ A n n S T • A1 , A2 , . . . , An ∈ F ⇒ Ai ∈ F and Ai ∈ F i=1

i=1

(d) The collection of algebras in Ω is closed under arbitrary intersection. In particular, let A1 , A2 be algebras of Ω and let A = A1 ∩ A2 be the collection of sets common to both algebras. Then A is an algebra. (e) The smallest A is {∅, Ω}. (f) The largest A is the set of all subsets of Ω known as the power set and denoted by 2Ω . (g) Cardinality of Algebras: An algebra of subsets of a finite set of n elements will always have a cardinality of the form 2k , k ≤ n. It is the intersection of all algebras which contain C.

32

4.4. There is a smallest (in the sense of inclusion) algebra containing any given family C of subsets of Ω. Let C ⊂ 2X , the algebra generated by C is \ G, G is an algebra C⊂G

i.e., the intersection of all algebra containing C. It is the smallest algebra containing C. Definition 4.5. A σ-algebra A is an event algebra that is also closed under countable unions, (∀i ∈ N)Ai ∈ A =⇒ ∪i∈N ∈ A. Remarks: • A σ-algebra is also an algebra. • A finite algebra is also a σ-algebra. 4.6. Because every σ-algebra is also an algebra, it has all the properties listed in (4.3). Extra properties of σ-algebra A are as followed: (a) A1 , A2 , . . . ∈ A ⇒

∞ S

j=1

Aj ∈ A and

∞ T

j=1

Aj ∈ A

(b) A σ-field is closed under countable set-theoretic operations. (c) The collection of σ-algebra in Ω is closed under arbitrary intersection, i.e., let ATα be σ-algebra ∀α ∈ A where A is some index set, potentially uncountable. Then, Aα is still a σ-algebra. α∈A

(d) An infinite σ-algebra F on X is uncountable i.e. σ-algebra is either finite or uncountable.

(e) If A and B are σ-algebra in X, then, it is not necessary that A ∪ B is also a σalgebra. For example, let E1 , E2 ⊂ X distinct, and not complement of one another. Let Ai = {∅, Ei , Eic , X}. Then, Ei ∈ A1 ∪ A2 but E1 ∪ E2 ∈ / A1 ∪ A2 . Definition 4.7. Generation of σ-algebra Let C ⊂ 2Ω , the σ-algebra generated by C, σ (C) is \ G, G is a σ−field C⊂G

i.e., the intersection of all σ-field containing C. It is the smallest σ-field containing C • If the set Ω is not implicit, we will explicitly write σX (C) • We will say that a set A can be generated by elements of C if A ∈ σ (C) 4.8. Properties of σ (C): 33

(a) σ (C) is a σ-field (b) σ (C) is the smallest σ-field containing C in the sense that if H is a σ-field and C ⊂ H, then σ (C) ⊂ H (c) C ⊂ σ (C) (d) σ (σ (C)) = σ (C) (e) If H is a σ-field and C ⊂ H, then σ (C) ⊂ H (f) σ ({∅}) = σ ({X}) = {∅, X} (g) σ ({A}) = {∅, A, Ac , X} for A ⊂ Ω. (h) σ ({A, B}) has at most 16 elements. They are ∅, A, B, A∩B, A∪B, A\B, B \A, A∆B, and their corresponding complements. See also (4.11). Some of these 16 sets can be the same and hence the use of “at-most” in the statement. (i) σ (A) = σ (A ∪ {∅}) = σ (A ∪ {X}) = σ (A ∪ {∅, X}) (j) A ⊂ B ⇒ σ (A) ⊂ σ (B) (k) σ (A) , σ (B) ⊂ σ (A) ∪ σ (B) ⊂ σ (A ∪ B) (l) σ (A) = σ (A ∪ {∅}) = σ (A ∪ {Ω}) = σ (A ∪ {∅, Ω}) 4.9. For the decomposition described in (2.8), let the starting collection be C1 and the decomposed collection be C2 . • If C1 is finite, then σ (C2 ) = σ (C1 ). • If C1 is countable, then σ (C2 ) ⊂ σ (C2 ). 4.10. Construction of σ-algebra from countable partition: An intermediate-sized σ-algebra (a σ-algebra smaller than 2Ω ) can be constructed by first partitioning Ω into subsets and then forming the power set of these subsets, with an individual subset now playing the role of an individual outcome ω. Given a countable5 partition Π = {Ai , i ∈ I} of Ω, we can form a σ-algebra A by including all unions of subcollections: A = {∪α∈S Aα : S ⊂ I} (10) S [9, Ex 1.5 p. 39] where we define Ai = ∅. It turns out that A = σ(Π). Of course, a i∈∅

σ-algebra is also an algebra. Hence, (10) is also a way to construct an algebra. Note that, from (2.9), the necessary and sufficient condition for distinct S to produce distinct element in (10) is that none of the Aα ’s are empty. In particular, for countable Ω = {xi : i ∈ N}, where xi ’s are distinct. If we want a σ-algebra which contains {xi } ∀i, then, the smallest one is 2Ω which happens to be the biggest one. So, 2Ω is the only σ-algebra which is “reasonable” to use. 5

In this case, Π is countable if and only if I is countable.

34

4.11. Generation of σ-algebra from finite partition: If a finite collection Π = {Ai , i ∈ I} of non-empty sets forms a partition of Ω, then the algebra generated by Π is the same as the σ-algebra generated by Π and it is given by (10). Moreover, |σ(Π)| is 2|I| = 2|Π| ; that is distinct sets S in (10) produce distinct member of the (σ-)algebra. Therefore, given a finite collection of sets C = {C1 , C2 , . . . , Cn }. To find an algebra or a σ-algebra generated by C, the first step is to use the Ci ’s to create a partition of Ω. Using (2.8), the partition is given by Π = {∩ni=1 Bi : Bi = Ci or Cic } .

(11)

By (4.9), we know that σ(π) = σ(C). Note that there are seemingly 2n sets in Π however, some of them can be ∅. We can eliminate the empty set(s) from Π and it is still a partition. So the cardinality of Π in (11) (after empty-set elimination) is k where k is at most 2n . The partition Π is then a collection of k sets which can be renamed as A1 , . . . , Ak , all of which are non-empty. Applying the construction in (10) (with I = [n]), we then have σ(C) n whose cardinality is 2k which is ≤ 22 . See also properties (4.8.7) and (4.8.8). Definition 4.12. In general, the Borel σ-algebra or Borel algebra B is the σ-algebra generated by the open subsets of Ω • Call BΩ the σ-algebra of Borel subsets of Ω or σ-algebra of Borel sets on Ω • Call set B ∈ BΩ Borel set of Ω (a) On Ω = R, the σ-algebra generated by any of the followings are Borel σ-algebra: (i) Open sets (ii) Closed sets (iii) Intervals (iv) Open intervals (v) Closed intervals (vi) Intervals of the form (−∞, a], where a ∈ R

i. Can replace R by Q ii. Can replace (−∞, a] by (−∞, a), [a, +∞), or (a, +∞) iii. Can replace (−∞, a] by combination of those in 1(f)ii.

(b) For Ω ⊂ R, BΩ = {A ∈ BR : A ⊂ Ω} = BR ∩ Ω where BR ∩ Ω = {A ∩ Ω : A ∈ BR } (c) Borel σ-algebra on the extended real line is the extended σ-algebra BR = {A ∪ B : A ∈ BR , B ∈ {∅, {−∞} , {∞} , {−∞, ∞}}} It is generated by, for example (i) A ∪ {{−∞} , {∞}} where σ (A) = BR 35

(ii)

nh

o nh io n io ˆ a ˆ, b , a ˆ, ˆb , a ˆ, ˆb

(iii) {[−∞, b]} , {[−∞, b)} , {(a, +∞]} , {[a, +∞]} nh io nh o (iv) −∞, ˆb , −∞, ˆb , {(ˆ a, +∞]} , {[ˆ a, +∞]}

Here, a, b ∈ R and a ˆ, ˆb ∈ R ∪ {±∞}

(d) Borel σ-algebra in Rk is generated by (i) the class of open sets in Rk (ii) the class of closed sets in Rk (iii) the class of bounded semi-open rectangles (cells) I of the form  I = x ∈ Rk : ai < xi ≤ bi , i = 1, . . . , k . k

Note that I = ⊗ (ai , bi ] where ⊗ denotes the Cartesian product × i=1

k (iv)  the class of “southwest regions” Sx of points “southwest” to x ∈ R , i.e. Sx = k y ∈ R : yi ≤ xi , i = 1, . . . , k

4.13. The Borel σ-algebra B of subsets of the reals is the usual algebra when we deal with real- or vector-valued quantities. Our needs will not require us to delve into these issues beyond being assured that events we discuss are constructed out of intervals and repeated set operations on these intervals and these constructions will not lead us out of B. In particular, countable unions of intervals, their complements, and much, much more are in B.

4.2

Kolmogorov’s Axioms for Probability

Definition 4.14. Kolmogorov’s Axioms for Probability [15]: A set function satisfying K0–K4 is called a probability measure. K0 Setup: The random experiment E is described by a probability space (Ω, A, P ) consisting of an event σ-algebra A and a real-valued function P : A → R. K1 Nonnegativity: ∀A ∈ A, P (A) ≥ 0. K2 Unit normalization: P (Ω) = 1. K3 Finite additivity: If A, B are disjoint, then P (A ∪ B) = P (A) + P (B). K4 Monotone continuity: If (∀i > 1) Ai+1 ⊂ Ai and ∩i∈N Ai = ∅ (a nested series of sets shrinking to the empty set), then lim P (Ai ) = 0.

i→∞

36

K40 Countable or σ-additivity: If (Ai ) is a countable collection of pairwise disjoint (nonoverlapping) events, then ! ∞ ∞ [ X P Ai = P (Ai ). i=1

i=1

• Note that there is never a problem with the convergence of the infinite sum; all partial sums of these non-negative summands are bounded above by 1. • K4 is not a property of limits of sequences of relative frequencies nor meaningful in the finite sample space context of classical probability. It is offered by Kolmogorov to ensure a degree of mathematical closure under limiting operations [9, p. 111]. • K4 is an idealization that is rejected by some accounts (usually subjectivist) of probability. • If P satisfies K0–K3, then it satisfies K4 if and only if it satisfies K40 [9, Theorem 3.5.1 p. 111]. Proof. To show “⇒”, consider disjoint A1 , A2 , . . .. Define Bn = ∪i>n Ai . Then, by (2.6), Bn+1 ⊂ Bn and ∩∞ n=1 Bn = ∅. So, by K4, lim P (Bn ) = 0. Furthermore, n→∞

∞ [

i=1

Ai = Bn ∪

n [

Ai

i=1

!

where all the sets on the RHS are disjoint. Hence, by finite additivity, ! n ∞ X [ P P (Ai ). Ai = P (Bn ) + i=1

i=1

Taking limiting as n → ∞ gives us K40 . To show “⇐”, see (4.17).

• K3 implies that K40 holds for a finite number of events, but for infinitely many events this is a new assumption. Not everyone believes that K40 (or its equivalent K4) should be used. However, without K40 (or its equivalent K4) the theory of probability becomes much more difficult and less useful. In many cases the sample space is finite, so K40 (or its equivalent K4) is not relevant anyway. Equivalently, in stead of K0-K4, we can define probability measure using P0-P2 below. Definition 4.15. A probability measure defined on a σ-algebra A of Ω is a (set) function (P0) P : A → [0, 1] that satisfies: (P1,K2) P (Ω) = 1 37

(P2,K40 ) Countable additivity countable sequence (An )∞ n=1 of disjoint ele ∞ : For  every ∞ S P ments of A, one has P An = P (An ) n=1

n=1

• P (∅) = 0

• The number P (A) is called the probability of the event A • The triple (Ω, A, P ) is called a probability measure space, or simply a probability space • A support of P is any A-set A for which P (A) = 1

4.3

Properties of Probability Measure

4.16. Properties of probability measures: (a) P (∅) = 0 (b) 0 ≤ P ≤ 1: For any A ∈ A, 0 ≤ P (A) ≤ 1 (c) If P (A) = 1, A is not necessary Ω. (d) Additivity: A, B ∈ A, A ∩ B = ∅ ⇒ P (A ∪ B) = P (A) + P (B) (e) Monotonicity: A, B ∈ A, A ⊂ B ⇒ P (A) ≤ P (B) and P (B − A) = P (B) − P (A) (f) P (Ac ) = 1 − P (A) (g) P (A) + P (B) = P (A ∪ B) + P (A ∩ B) = P (A − B) + 2P (A ∩ B) + P (B − A). P (A ∪ B) = P (A) + P (B) − P (A ∩ B). (h) P (A ∪ B) ≥ max(P (A), P (B)) ≥ min(P (A), P (B)) ≥ P (A ∩ B) (i) Inclusion-exclusion principle:  n  S P P P P Ak = P (Ai ) − P (Ai ∩ Aj ) + P (Ai ∩ Aj ∩ Ak )+ i

k=1

i t] R0 (t) f (t) d =− = = ln R(t). r (t) = lim δ→0 δ R(t) R (t) dt (i) r (t) δ ≈ P [T ∈ (t, t + δ] |T > t]

(ii) R(t) = e−

Rt 0

r(τ )dτ

(iii) f (t) = r(t)e−

Rt 0

.

r(τ )dτ

51

• For T ∼ E(λ), r(t) = λ. See also [11, section 5.7]. Definition 5.37. A random variable whose cdf is continuous but whose derivative is the zero function is said to be singular. • See Cantor-type distribution in [7, p. 35–36].

• It has no density. (Otherwise, the cdf is the zero function.) So, ∃ continuous random variable X with no density. Hence, ∃ random variable X with no density.

• Even when we allow the use of delta function for the density as in the case of mixed r.v., it still has no density because there is no jump in the cdf. • There exists singular r.v. whose cdf is strictly increasing. Definition 5.38. fX,A (x) =

5.5

d F dx X,A

(x). See also definition 5.17.

Mixed/hybrid Distribution

There are many occasion where we have a random variable whose distribution is a combination of a normalized linear combination of discrete and absolutely continuous distributions. For convenience, we use the Dirac delta function to link the pmf to a pdf as in definition 5.27. Then, we only have to concern about the pdf of a mixed distribution/r.v. 5.39. By allowing density functions toRcontain impulses, the cdfs of mixed random variables can be expressed in the form F (x) = (−∞,x] f (t)dt. 5.40. Given a cdf FX of a mixed random variable X, the density fX is given by X fX (x) = f˜X (x) + P [X = xk ]δ(x − xk ), k

where • the xi are the distinct points at which FX has jump discontinuities, and  0 FX (x) , FX is differentiable at x • f˜X (x) = 0, otherwise. In which case, E [g(X)] =

Z

g(x)f˜X (x)dx +

Note also that P [X = xk ] = F (xk ) −

X

g(xk )P [X = xk ].

k

F (x− k)

5.41. Suppose the cdf F can be expressed in the form F (x) = G(x)U (x − x0 ) for some function G. Then, the density is f (x) = G0 (x)U (x − x0 ) + G(x0 )δ(x − x0 ). Note that G(x0 ) = F (x0 ) = P [X = x0 ] is the jump of the cdf at x0 . When the random variable is continuous, G(x0 ) = 0 and thus f (x) = G0 (x)U (x − x0 ). 52

5.6

Independence

Definition 5.42. A family of random variables {Xi : i ∈ I} is independent if ∀ finite J ⊂ I, the family of random variables {Xi : i ∈ J} is independent. In words, “an infinite collection of random elements is by definition independent if each finite subcollection is.” Hence, we only need to know how to test independence for finite collection. (a) (Ei , Ei )’s are not required to be the same. (b) The collection of random variables {1Ai : i ∈ I} is independent iff the collection of events (sets) {Ai : i ∈ I} is independent. Definition 5.43. Independence among finite collection of random variables: For finite I, the following statements are equivalent ≡ (Xi )i∈I are independent (or mutually independent [2, p 182]). Q ≡ P [Xi ∈ Hi ∀i ∈ I] = P [Xi ∈ Hi ]where Hi ∈ Ei 

i∈I



≡ P (Xi : i ∈ I) ∈ × Hi = ≡ P (Xi :i∈I)



i∈I

× Hi

i∈I



=

≡ P [Xi ≤ xi ∀i ∈ I] =

Q

Q

i∈I

P [Xi ∈ Hi ] where Hi ∈ Ei

i∈I

P Xi (Hi ) where Hi ∈ Ei

i∈I

P [Xi ≤ xi ]

Q

≡ [Factorization Criterion] F(Xi :i∈I) ((xi : i ∈ I)) = ≡ Xi and X1i−1 are independent ∀i ≥ 2  ≡ σ (Xi ) and σ X1i−1 are independent ∀i ≥ 2

Q

FXi (xi )

i∈I

≡ Discrete random variables Xi ’s with countable range E: Y P [Xi = xi ∀i ∈ I] = P [Xi = xi ] ∀xi ∈ E ∀i ∈ I i∈I

≡ Absolutely continuous Xi with density fXi : f(Xi :i∈I) ((xi : i ∈ I)) =

Q

fXi (xi )

i∈I

Definition 5.44. If the Xα , α ∈ I are independent and each has the same marginal distribution with distribution Q, we say that the Xα ’s are iid (independent and identically iid distributed) and we write Xα ∼ Q • The abbreviation can be IID [26, p 39]. Definition 5.45. A pairwise independent collection of random variables is a set of random variables any two of which are independent. (a) Any collection of (mutually) independent random variables is pairwise independent

53

(b) Some pairwise independent collections are not independent. See example (5.46). Example 5.46. Let suppose X, Y , and Z have the following joint probability distribution: pX,Y,Z (x, y, z) = 14 for (x, y, z) ∈ {(0, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0)}. This, for example, can be constructed by starting with independent X and Y that are Bernoulli- 12 . Then set Z = X ⊕ Y = X + Y mod 2. (a) X, Y, Z are pairwise independent. Z.

|=

Z does not imply (X, Y )

|=

Z and Y

|=

(b) The combination of X

Definition 5.47. The convolution of probability measures µ1 and µ2 on (R, B) is the measure µ1 ∗ µ2 defined by Z (µ1 ∗ µ2 ) (H) = µ2 (H − x)µ1 (dx) H ∈ BR (a) µX ∗ µY = µY ∗ µX and µX ∗ (µY ∗ µZ ) = (µX ∗ µY ) ∗ µZ (b) If FX and GX are distribution functions corresponding to µX and µY , the distribution function corresponding to µX ∗ µY is (µX ∗ µY ) (−∞, z] =

Z

FY (z − x)µX (dx)

In this case, it is notationally convenientR to replace µX (dx) by dFX (x) (Stieltjes Integral.) Then, (µX ∗ µY ) (−∞, z] = FY (z − x)dFX (x). This is denoted by (FX ∗ FY ) (z). That is Z FZ (z) = (FX ∗ FY ) (z) = FY (z − x)dFX (x) (c) If densityfY exists, FX ∗ FY has density FX ∗ fY , where Z (FX ∗ fY ) (z) = fY (z − x)dFX (x) (i) If Y (or FY ) is absolutely continuous with density fY , then for any X (or F R X ), X + Y (or FX ∗ FY ) is absolutely continuous with density (FX ∗ fY ) (z) = fY (z − x)dFX (x)

If, in addition, FX has density fX . Then, Z Z fY (z − x)dFX (x) = fY (z − x)fX (x) dx This is denoted by fX ∗ fY

In other words, if densities fX , fY exist, then FX ∗ FY has density fX ∗ fY , where Z (fX ∗ fY ) (z) = fY (z − x)fX (x) dx 54

5.48. If random variables X and Y are independent and have distribution µX and µY , then X + Y has distribution µX ∗ µY 5.49. Expectation and independence (a) Let X and Y be nonnegative independent random variables on (Ω, A, P ), then E [XY ] = EXEY (b) If X1 , X2 , . . . , Xn are independent and gk ’s complex-valued measurablefunction. Then  n Q gk (Xk )’s are independent. Moreover, if gk (Xk ) is integrable, then E gk (Xk ) = n Q

k=1

E [gk (Xk )]

k=1

(c) If X1 , . . . , Xn are independent and

n P

Xk has a finite second moment, then all Xk

k=1

have finite second as well.  n moments  n P P Moreover, Var Xk = Var Xk . k=1

k=1

2

(d) If pairwise independent Xi ∈ L , then Var

5.7



k P

i=1



Xi =

k P

Var [Xi ]

i=1

Misc

5.50. The mode of a discrete probability distribution is the value at which its probability mass function takes its maximum value. The mode of a continuous probability distribution is the value at which its probability density function attains its maximum value. • the mode is not necessarily unique, since the probability mass function or probability density function may achieve its maximum value at several points.

6

PMF Examples

The following pmf will be defined on its support S. For Ω larger than S, we will simply put the pmf to be 0.

6.1

Random/Uniform

6.1. Rn , Un When an experiment results in a finite number of “equally likely” or “totally random” outcomes, we model it with a uniform random variable. We say that X is uniformly distributed on [n] if 1 P [X = k] = , k ∈ [n]. n We write X ∼ Un . 55

X∼ Uniform Un U{0,1,...,n−1}

Support set X {1, 2, . . . , n} {0, 1, . . . , n − 1}

Bernoulli B(1, p)

{0, 1}

Binomial B(n, p) Geometric G(p) Geometric G 0 (p)

{0, 1, . . . , n} N ∪ {0} N

Poisson P(λ)



pX (k)

ϕX (u)

1 n 1 n

1−eiun n(1−eiu )

1 − p, k = 0 p, k=1  n k p (1 − p)n−k k (1 − p)pk (1 − p)k−1 p k

e−λ λk!

N ∪ {0}

n

(1 − p + peju ) 1−β 1−βeiu

eλ(e

iu −1

)

Table 2: Examples of probability mass functions. Here, p, β ∈ (0, 1). λ > 0. • pi =

1 n

for i ∈ S = {1, 2, . . . , n}.

• Examples ◦ classical game of chance / classical probability drawing at random

◦ fair gaming devices (well-balanced coins and dice, well shuffled decks of cards) ◦ experiment where

∗ there are only n possible outcomes and they are all equally probable ∗ there is a balance of information about outcomes

6.2. Uniform on a finite set: U(S) Suppose |S| = n, then p(x) = n1 for all x ∈ S. Example 6.3. For X uniform on [-M:1:M], we have EX = 0 and Var X = M (M3 +1) . 1 For X uniform on [N:1:M], we have EX = M −N and Var X = 12 (M − N )(M − N − 2). 2 Example 6.4. Set S = 0, 1, 2, . . . , M , then the sum of two independent U(S) has pmf p(k) =

(M + 1) − |k − M | (M + 1)2

for k = 0, 1, . . . , 2M . Note its triangular shape with maximum value at p(M ) = visualize the pmf in MATLAB, try k = 0:2*M; P = (1/((M+1)^2))*ones(1,M+1); P = conv(P,P); stem(k,P)

6.2

Bernoulli and Binary distributions

6.5. Bernoulli : B(1, p) or Bernoulli(p) • S = {0, 1} 56

1 . M +1

To



EX  1 p  0 1  p   p



EX  1 p  0 1  p   p

E  X 2   p12  1  p  02  p Var  X   E  X 2    EX   p  p 2  p 1  p  • p0 = q = 1 − p, p1 = p 2

Alternatively, Var2 ] = X p. p 1  p   1  p  0  p   p 1  p 1  p  p   p 1  p  . • EX = E [X Var [X] = p − p2is=maximized p (1 − p). Note Note that the variance at p =that 0.5. the variance is maximized at p = 0.5. 2

2

0.25

0.2 p  ( 1 p ) 0.1

0 0

0

0.2

0.4

0

0.6

0.8

p

1 1

Figure 13: The plot for p(1 − p). 6.6. Binary : Suppose X takes only two values a and b with b > a. P [X = b] = p. (a) X can be expressed as X = (b − a) I + a, where I is a Bernoulli random variable with P [I = 1] = p. (b) Var X = (b − a)2 V arI = (b − a)2 p (1 − p) . Note that it is still maximized at p = 1/2. (c) Suppose a = −b. Then, X = 2I + a = 2I − b. In which case, Var X = 2b2 p (1 − p). (d) Suppose Xk are independent random variables taking on two values b P and a with n P [Xk = p] = p = 1 − P [Xk = a] where b > a. Then, the sum S = k=1 Xk is “binomial” on + an : k = 0, 1, . . . , n} where the probability at point k(b −  {k(b − a) n k n−k a) + an is k p (1 − p) .

6.3

Binomial: B(n, p)

6.7. Binomial distribution with size n and parameter p. p ∈ [0, 1].  (a) pi = ni pi (1 − p)n−i for i ∈ S = {0, 1, 2, . . . , n} • Use binopdf(i,n,p) in MATLAB.

(b) X is the number of success in n independent Bernoulli trials and hence the sum of n independent, identically distributed Bernoulli r.v. n

(c) ϕX (u) = (1 − p + peju )

57

(d) EX = np (e) EX 2 = (np)2 + np (1 − p) (f) Var [X] = np (1 − p) (g) Tail probability:

n P

r=k



n r

pr (1 − p)n−r = Ip (k, n − k + 1)

(h) Maximum probability value happens at kmax = mode X = b(n + 1) pc ≈ np • When (n+1)p is an integer, then the maximum is achieved at kmax and kmax −1. (i) By (3), 1 (1 + (1 − 2p)n ) , and 2 1 P [X is odd] = (1 − (1 − 2p)n ) . 2

P [X is even] =

(j) If we have E1 , . . ., En , n unlinked repetition of E and event A for E, the the distribution B(n,p) describe the probability that A occurs k times in E1 , . . . ,En . (k) Gaussian Approximation for Binomial Probabilities: When n is large, binomial distribution becomes difficult to compute directly because of the need to calculate factorial terms. We can use P [X = k] ' p

(k−np)2

1 2πnp (1 − p)

e− 2np(1−p) ,

(16)

which comes from approximating X by Gaussian Y with the same mean and variance and the relation P [X = k] ' P [X ≤ k] − P [X ≤ k − 1] ' P [Y ≤ k] − P [Y ≤ k − 1] ' fY (k). See also (13.23). (l) Approximation:

6.4



n k

k

n−k

p (1 − p)

=

(np)k −np e k!



  2 k2 1 + O np , n

Geometric: G(β)

A geometric distribution is defined by the fact that for some β ∈ [0, 1), pk+1 = βpk for all k ∈ S where S can be either N or N ∪ {0}. • When its support is N, pk = (1 − β) β k−1 . This is referred to as G1 (β) or geometric1 (β). In MATLAB, use geopdf(k-1,1-β). • When its support is N ∪ {0}, pk = (1 − β) β k . This is referred to as G0 (β) or geometric0 (β). In MATLAB, use geopdf(k,1-β). 58

nλ0 = λ . Hence X is approximately normal N ( λ , λ ) for λ large.

Some says that the normal approximation is good when λ > 5 . p := 0.05

−λ

e



λ

−λ

−1

2⋅ π λ

p := 0.05

x

e

Γ ( x+ 1)

1

n := 100 λ := 5

⋅e

2 ⋅λ



0.15 ( x− λ )

λ

x

Γ ( x+ 1) −1

2

1 2⋅ π λ

0.1

− x λ−1

n := 800 λ := 40

⋅e

2 ⋅λ

0.06 ( x− λ )

2

0.04

− x λ−1

e ⋅x

e ⋅x

Γ ( λ)

Γ ( λ)

Γ ( n+ 1)

x

Γ ( n− x+ 1) ⋅ Γ ( x+ 1)

⋅ p ⋅ ( 1− p )

Γ ( n+ 1)

n− x 0.05

Γ ( n− x+ 1) ⋅ Γ ( x+ 1)

0

0

5

x

⋅ p ⋅ ( 1− p )

n− x 0.02

0

10

0

20

x

40

60

x

The above Figure figure14: compare Poisson when x is integer, Gaussian,and 3)Gamma Gamma, 4) Gaussian1) approximation to Binomial, Poisson2) distribution, distriBinomial. bution. If g : Z + →6.8. function and Λ ~ P ( λ ) , then E ⎡⎣λ g ( Λ + 1) − Λg ( Λ ) ⎤⎦ = 0 . any bounded R isConsider X ∼ G0 (β). ∞

Proof.

λ

• pi = (1 − β) β i , for S = iN ∪ {0} ⎛, 0 ∞≤ β < 1

λ i +1



λi ⎞

ig ( i ) ) e = e ⎜ ∑ g ( i + 1) − ∑ ig ( i ) ⎟ ∑ ( λ g• (βi =+ 1) − where m = average waiting time/ lifetime i! i! i! −λ

−λ

m m+1

i =0



i =1 ⎝ i =0 ⎠ k ∞ a success] =i +(P ∞ P [X = k] = P [k failures followed by P [success] λ 1 [failures]) λ m +1 ⎞ −λ ⎛ k P [X ≥ k] = β = the probability of having g ( i +at1)least k−initial gfailure = e ⎜∑ ( m +=1)the probability ∑ ⎟ i! m=0 m! ⎠ of having to perform at least k+1 =0 ⎝ itrials.

P [X > k] = β k+1 = the probability of having at least k+1 initial failure.

=0

• Memoryless property:

Any function f : Z + → R for which E ⎡⎣ f ( Λ ) ⎤⎦ = 0 can be expressed in the form ◦ P [X ≥ k + c |X ≥ k ] = P [X ≥ c], k, c > 0. a bounded function f ( j ) = λ g ( j + 1)◦−Pjg j )k for [X( > + c |X ≥ k ] = P [X > c], k, cg. > 0.

E ⎡⎣the λ g probability 0 for all Λ same bounded then has the Thus, conversely, ifthen ( Λ + 1) − ofΛghaving ( Λ )⎤⎦to=perform at least j more g, trials is the the Poisson ◦ If a success has not occurred in the first k trials (already fails for k times),

distribution P ( λ ) .

probability of initially having to perform at least jtrials.

◦ Each time a failure occurs, the system “forgets” and begins anew as if it were the first trial. distributionperforming can be obtained as a limit from negative binomial distributions.

Poisson ◦ Geometric r.v. is the only discrete r.v. that satisfies the memoryless property. (Thus, the negative binomial distribution with parameters r and p can be approximated by the • Ex. rq Poisson distribution with parameter λ = (maen-matching), provided that p is p ◦ lifetimes of components, measured in discrete time units, when the fail catastrophically (without degradation due to aging) “sufficiently” close to 1 and r is “sufficiently” large. ◦ waiting times Let X be Poisson with mean λ . Suppose, that the mean λ is chosen in accord with a ∗ for next customer in a queue . Hence, disintegrations probability distribution FΛ ( λ )radioactive ∗ between 59

∗ between photon emission

◦ number of repeated, unlinked random experiments that must be performed prior to the first occurrence of a given event A ∗ number of coin tosses prior to the first appearance of a ‘head’ number of trials required to observe the first success • The sum of independent G0 (p) and G0 (q) has pmf ( k+1 −pk+1 (1 − p) (1 − q) q q−p , p 6= q 2 k (k + 1) (1 − p) p , p=q for k ∈ N ∪ {0}. 6.9. Consider X ∼ G1 (β). • P [X > k] = β k

• Suppose independent Xi ∼ G1 (βi ). min (X1 , X2 , . . . , Xn ) ∼ G1 (

6.5

Poisson Distribution: P(λ)

Qn

i=1

βi ).

6.10. Characterized by k

• pX (k) = P [X = k] = e−λ λk! ; or equivalently, • ϕX (u) = eλ(e

),

iu −1

where λ ∈ (0, ∞) is called the parameter or intensity parameter of the distribution. In MATLAB, use poisspdf(k,lambda). 6.11. Denoted by P (λ). 6.12. In stead of X, Poisson random variable is usually denoted by Λ. 6.13. EX = Var X = λ. 6.14. Successive probabilities are connected via the relation kpX (k) = λpX (k − 1). 6.15. mode X = bλc. • Note that when λ ∈ N, there are two maximums at λ − 1 and λ.

• When λ  1, pX (bλc) ≈

√1 2πλ

via the Stirling’s formula (2.15).

60

0 5.

6.29. Poisson distribution can be obtained as a limit from negative binomial distributions. Thus, the negative binomial distribution with parameters r and p can be approximated by the Poisson distribution with parameter λ = rqp (mean-matching), provided that p is “sufficiently” close to 1 and r is “sufficiently” large. 6.30. Convergence of sum of bernoulli random variables to the Poisson Law Suppose that for each n ∈ N Xn,1 , Xn,2 , . . . , Xn,rn are independent; the probability space for the sequence may change with n. Such a collection is called a triangular array [1] or double sequence [10] which captures the nature

63

of the collection when it is arranged as  X1,1 , X1,2 , . . . , X1,r1 ,    X2,1 , X2,2 , . . . , X2,r2 ,    .. .. .. . . ··· .  Xn,1 , Xn,2 , . . . , Xn,rn ,      .. .. .. . . ··· .

where the random variables in each row are independent. Let Sn = Xn,1 + Xn,2 + · · · + Xn,rn be the sum of the random variables in the nth row. Consider a triangular array of bernoulli random variables Xn,k with P [Xn,k = 1] = pn,k . rn P If max pn,k → 0 and pn,k → λ as n → ∞, then the sums Sn converges in distribution 1≤k≤rn

k=1

to the Poisson law. In other words, Poisson distribution rare events limit of the binomial (large n, small p). As a simple special case, consider a triangular array of bernoulli random variables Xn,k with P [Xn,k = 1] = pn . If npn → λ as n → ∞, then the sums Sn converges in distribution to the Poisson law.  i n To show this special case directly, we bound the first i terms of n! to get (n−i) ≤ ≤ i! i ni . Using the upper bound, i!   npn n n i 1 ) . pn (1 − pn )n−i ≤ (npn )i (1 − pn )−i (1 − i i! | {z } | {z } | {zn } i →1 →λ

The lower bound gives the same limit because (n − i)i = → 1.

6.6

→e−λ

 n−i i n

ni where the first term

Compound Poisson

Given an arbitrary probability measure µ and a positive real PΛ number λ, the compound Poisson distribution CP (λ, µ) is the distribution of the sum j=1 Vj where the Vj are i.i.d. with distribution µ and Λ is a P (λ) random variable, independent of the Vj . Sometimes, it is written as POIS (λµ). The parameter λ is called the rate of CP (λ, µ) and µ is called the base distribution. 6.31. The mean and variance of CP (λ, L (V )) are λEV and λEV 2 respectively. 6.32. If Z ∼ CP (λ, q), then ϕZ (t) = eλ(ϕq (t)−1) . An important property of the Poisson and Compound Poisson laws is that their classes are close under convolution (independent summation). In particular, we have divisibility properties (6.22) and (6.32) which are straightforward to prove from their characteristic functions. 6.33. Divisibility property ofPthe compound Poisson law:   Suppose we P P have independent Λi ∼ CP λi , µ(i) , then ni=1 Λi ∼ CP λ, λ1 ni=1 λi µ(i) where λ = ni=1 λi . 64

Proof. n ϕP i=1

Zi

(t) =

n Y

n X

eλi (ϕqi (t)−1) = exp

i=1

= exp

n X i=1

λi ϕqi (t) − λ

!i=1

!

λi (ϕqi (t) − 1)

= exp λ

!! n 1X λi ϕqi (t) − 1 λ i=1

We usually focus on the case when µ is a discrete probability measure on N = {1, 2, . . .}. In which case, we usually refer to µ by the pmf P q on N; q is called the base pmf. Equivalently, CP (λ, q) is also the distribution of the sum i∈N iΛi where (Λi : i ∈ N) are independent P with Λi ∼ P (λqi ). Note that i∈N λqi = λ. The Poisson distribution is a special case of the compound Poisson distribution where we set q to be the point mass at 1. 6.34. The compound negative binomial [Bower, Gerber, Hickman, Jones, and Nesbitt, 1982, Ch 11] can be approximated by the compound Poisson distribution.

6.7

Hypergeometric

An urn contains N white balls and M black balls. One draws n balls without replacement, so n ≤ N + M . One gets X white balls and n − X black balls. ( N M ( x )(n−x) , 0 ≤ x ≤ N and 0 ≤ n − x ≤ M (N +M P [X = x] = n ) 0, otherwise 6.35. The hypergeometric distributions “converge” to the binomial distribution: Assume N that n is fixed, while N and M increase to +∞ with lim N +M = p, then N →∞ M →∞

  n x p (x) → p (1 − p)n−x (binomial). x Note that binomial is just drawing balls with replacement:  x n−x    x  n−x n N M n N M x p (x) = = . x (N + M )n N +M N +M Intuitively, when N and M large, there is not much difference in drawing n balls with or without replacement. 6.36. Extension: If we have m colors and Ni balls of color i. The urn contains N = N1 + · · · + Nm balls. One draws n balls without replacement. Call Xi the number of balls of color i drawn among n balls. (Of course, X1 + · · · + Xm = n.)  N N  ( x11 )( x22 )···(Nxmm ) , x1 + · · · + xm = n and xi ≥ 0 (Nn ) P [X1 = x1 , . . . , Xm = xm ] =  0, otherwise 65

6.8

Negative Binomial Distribution (Pascal / P´ olya distribution)

6.37. The probability that the rth success occurs on the (x + r)th trial     x+r−1 x+r−1 x r−1 pr (1 − p)x =p p (1 − p) = r−1 r−1   x+r−1 = pr (1 − p)x x 

i.e. among the first (x + r − 1) trials, there are r − 1 successes and x failures. • Fix r. 1 • ϕX (u) = pr (1−(1−p)e iu )r

• EX =

rq p

and Var [X] =

• Note that if we define 

−r x



rq , p2



n x

where q = 1 − p.



n(n−1)·····(n−(x−1)) . x(x−1)·····1

Then,

  r (r + 1) · · · · · (r + (x − 1)) x r+x−1 ≡ (−1) = (−1) . x (x − 1) · · · · · 1 x x

• If independent Xi ∼ NegBin (ri , p), then see from the characteristic function.

P i

Xi ∼ NegBin



P i

 ri , p . This is easy to

• When r = 1, we have geometric distribution. Hence, when r is a positive integer, negative binomial is an independent sum of i.i.d. geometric. • p (x) =

Γ(r+x) r p Γ(r)x!

(1 − p)x

6.38. A negative binomial distribution can arise  as a mixture of Poisson distributions with p . mean distributed as a gamma distribution Γ q = r, λ = 1−p Let X be Poisson with mean λ. Suppose, that the mean λ is chosen in accord with a probability distribution FΛ (λ). Then, ϕX (u) = ϕΛ (−i (eiu − 1)) [see compound Poisson distribution]. Here, Λ ∼ Γ (q, λ0 ); hence, ϕΛ (u) =  1 u q .l 1−i λ 0  q   λ0 −q λ0 +1 eiu −1 0 = 1− 1 eiu , which is negative binomial with p = λ0λ+1 So, ϕX (u) = 1 − λ0 . λ0 +1

So, a.k.a. Poisson-gamma distribution, or simply compound Poisson distribution.

6.9

Beta-binomial distribution

A variable with a beta binomial distribution is distributed as a binomial distribution with parameter p, where p is distribution with a beta distribution with parameters αand β.  • P (k |p ) = nk pk (1 − p)n−k . 66

• f (p) = fβq1 ,q2 (p) =

nq1 q1 +q2

• Var X =

6.10

=

 n β(k+q1 ,n−k+q2 ) β(q1 ,q2 ) k

nq1 q2 (n+q1 +q2 ) (q1 +q2 )2 (1+q1 +q2 )

Zipf or zeta random variable

6.39. P [X = k] = 6.40. E [X n ] =

7

(1 − p)q2 −1 1(0,1) (p)

 n Γ(q1 +q2 ) Γ(k+q1 )Γ(n−k+q2 ) k Γ(q1 )Γ(q2 ) Γ(q1 +q2 +n)

• pmf: P (k) = • EX =

Γ(q1 +q2 ) q1 −1 p Γ(q1 )Γ(q2 )

1 1 ξ(p) kp

ξ(p−n) ξ(p)

where k ∈ N, p > 1, and ξ is the zeta function defined in (A.13).

is finite for n < p − 1, and E [X n ] = ∞ for n ≥ p − 1.

PDF Examples

7.1

Uniform Distribution

7.1. Characterization for uniform[a, b]:  0 1 (a) f (x) = b−a U (x − a) U (b − x) = 1

b−a

(b) F (x) =



0 x−a b−a

(c) ϕX (u) = eiu (d) MX (s) =

b+a 2

x < a, x > b a≤x≤b

x < a, x > b a≤x≤b sin(u b−a 2 ) u b−a 2

esb −esa . s(b−a)

7.2. For most purpose, it does not matter whether the value of the density f at the 1 . endpoints are 0 or b−a Example 7.3. • Phase of oscillators ⇒ [-π, π] or [0,2π] • Phase of received signals in incoherent communications → usual broadcast carrier phase φ ∼ U(-ππ) • Mobile cellular communication: multipath → path phases φc ∼ U(−π, π) • Use with caution to represent ignorance about a parameter taking value in [a, b]. 7.4. EX =

a+b , 2

Var X =

(b−a)2 , 12

E [X 2 ] = 13 (b2 + ab + a2 ).

67

X∼

fX (x)

ϕX (u)

Uniform U(a, b)

1 1 (x) b−a [a,b] λe−λx 1[0,∞] (x) x−s − µ−s0 1 01 e [s0 ,∞) (x) µ−s0 α −αx e 1[a,b] (x) e−αa −e−αb α −α|x| e 2 1 x−m 2 √1 e− 2 ( σ ) σ 2π T −1 1 1 e− 2 (x−m) Λ (x−m) n√ (2π) 2 det(Λ) λq xq−1 e−λx 1(0,∞) (x) Γ(q)

eiu

Exponential E(λ)

Shifted Exponential (µ, s0 ) Truncated Exp. Laplacian L(α) Normal N (m, σ 2 ) Normal N (m, Λ) Gamma Γ (q, λ) Pareto Par(α) Par(α, c) = c Par(α) Beta β (q1 , q2 ) Rayleigh Standard Cauchy Cau(α) Cau(α, d) Log Normal eN (

µ,σ 2

)

sin(u b−a 2 ) u b−a 2

λ λ−iu

α2 α2 +u2 ium− 12 σ 2 u2

e

T m− 1 uT Λu 2

eju

1

q

(1−i uλ )

αx−(α+1) 1[1,∞] (x)  α c α+1 1(c,∞) (x) c x Γ(q1 +q2 ) q1 −1 x (1 − x)q2 −1 1(0,1) (x) Γ(q1 )Γ(q2 ) Γ(q1 +q2 ) xq1 −1 1 Γ(q1 )Γ(q2 ) (x+1)(q1 +q2 ) (0,∞) 2 2αxe−αx 1[0,∞] (x) 1 1 π 1+x2 1 α π α2 +x2 Γ(d) 1 √   παΓ(d− 12 ) 1+ x 2 d

Beta prime

b+a 2

1 √ e σx 2π

− 12

(

( ) )1

α ln x−µ 2 σ

(0,∞)

(x)

(x)

Table 3: Examples of probability density functions. Here, c, α, q, q1 , q2 , σ, λ are all strictly d positive and d > 21 . γ = −ψ(1) ≈ .5772 is the Euler-constant. ψ (z) = dz log Γ (z) = 0 Γ (z) Γ(q1 )Γ(q2 ) (log e) Γ(z) is the digamma function. B(q1 , q2 ) = Γ(q1 +q2 ) is the beta function.

68

7.5. The product X of two independent U[0, 1] has fX (x) = − (ln (x)) 1[0,1] (x) and FX (x) = x − x ln x on [0, 1]. This comes from P [X > x] =

R1 0

7.2

1 − FU

x t

 R1 dt = 1 − xt dt. x

Gaussian Distribution

7.6. Gaussian distribution, often called normal: (a) Denoted by N (m, σ 2 ) . (b) N (0, 1) is the standard Gaussian (normal) distribution. (c) fX (x) =

1

√ 1 e− 2 ( 2πσ

x−m 2 σ

).

(d) FX (x) = normcdf(x,m,sigma). • The standard normal cdf is sometimes denoted by Φ(x). It inherits all properties of cdf. Moreover, note that Φ(−x) = 1 − Φ(x). (e) An arbitrary Gaussian random variable with mean m and variance σ 2 can be represented as σS + m, where S ∼ N (0, 1). ∞ 1  jvX  − jω m − ω σ 1 2 2 2 Fourier . (f) ϕX23) (v) = E transform: e =F ejmv− ( f X )2=v ∫σ f.]X ( x ) e− jω x dt = e 2

2

−∞

sm+∞12 s2 σ 2

(g) MX (s) = e

24) Note that

∫e

−α x 2

π . α

dx =

−∞

R∞

(h) Fourier transform: Fm{f ⎛ x− ⎞ X} = 25) P [ X > x ] = Q ⎜ ⎝ σ

1

−jωm− ω fX (x)⎛ ex−jωx − m ⎞dt =⎛ e x − m ⎞ 2

x] = 1 − Q ⎜ ⎟ ; P [ X σ−⎤Φ = Φ − x−m (i) P [X •> x] ≥⎤ x] = Q Px−m P ⎡⎣=X P − μ[X< σ = 0.6827, ⎡ X − μ = 0.3173 σ σ σ    ⎦ ⎣ ⎦ x−m x−m P [X < x] = P [X ≤ x] = 1 − Q x−m = Q − = Φ . σ σ σ P ⎡⎣ X − μ > 2σ ⎤⎦ = 0.0455, P ⎡⎣ X − μ < 2σ ⎤⎦ = 0.9545 fX ( x)

fX ( x)

95%

68%

μ −σ

μ μ +σ

μ − 2σ

μ

μ + 2σ

∞ 1 − x2 : Q 15: to P [ X >ofz ]Xwhere 26) Q-function ~ Nσ(20,1 ( z ) =Probability Figure density function ∼ NX(m, ) .) ; ∫z 2π e dx corresponds that is Q ( z ) is the probability of the “tail” of N ( 0,1) . 2

N ( 0,1)

69

7.7. Properties (a) P [|X − µ| < σ] = 0.6827; P [|X − µ| > σ] = 0.3173 P [|X − µ| > 2σ] = 0.0455; P [|X − µ| < 2σ] = 0.9545 (b) Moments and central moments: h i h i  0, k odd k k−2 (i) E (X − µ) = (k − 1) E (X − µ) = 1 · 3 · 5 · · · · · (k − 1) σ k , k even ( q h i 2 k , k odd 2 · 4 · 6 · · · · · (k − 1) σ π (ii) E |X − µ|k = k 1 · 3 · 5 · · · · · (k − 1) σ , k even

(iii) Var [X 2 ] = 4µ2 σ 2 + 2σ 4 . n EX n E [(X − µ)n ]

0 1 1

1 µ 0

2 µ + σ2 σ2

3 µ (µ + 3σ 2 ) 0

2

(c) For N (0, 1) and k ≥ 1,     E X k = (k − 1) E X k−2 =

2



4 µ + 6µ σ + 3σ 4 3σ 4 4

2 2

0, k odd 1 · 3 · 5 · · · · · (k − 1) , k even.

The first equality comes from integration by parts. Observe also that   (2m)! E X 2m = m . 2 m! (d) L´evy–Cram´er theorem: If the sum of two independent non-constant random variables is normally distributed, then each of the summands is normally distributed. • Note that

R∞

−∞

2

e−αx dx =



α

.

7.8 (Length bound). For X ∼ N (0, 1) and any (Borel) set B, P [X ∈ B] ≤

Z

|B|/2

−|B|/2

fX (x) = 1 − 2Q



|B| 2



,

where |B| is the length (Lebesgue measure) of the set B. This is because the probability is concentrated around 0. More generally, for X ∼ N (m, σ 2 )   |B| P [X ∈ B] ≤ 1 − 2Q . 2σ

70

7.9 (Stein’s Lemma). Let X ∼ N (µ, σ 2 ), and let g be a differentiable function satisfying ⎛ 1⎞ E |g 0 (X)| < ∞. Then N ⎜ 0, ⎟ E [g(X)(X − µ)] =⎝ σ 22E⎠ [g 0 (X)] . erf ( z )

[2, Lemma 3.6.5 p 124]. Note that this is simply integration by parts with u = g(x) and dv = (x − µ)fX (x)dx. Q 2z       k−1 2 k • E (X − µ) = E (X − µ) (X − µ) = σ (k − 1)E (X − µ)k−2 .

(

7.10. Q-function: Q (z) =

R∞ z

2

x √1 e− 2 2π

0

)

z

dx corresponds to P [X > z] where X ∼ N (0, 1);

that is Q (z) is the probability of the “tail” of N (0, 1). The Q function is then a complementary cdf (ccdf). N ( 0,1)

1 0.9 0.8 0.7

Q(z)

0.6 0.5 0.4 0.3 0.2 0.1

0

0 -3

z

-2

-1

0

1

2

3

z

Figure 16: Q-function (a) Q is a decreasing function with Q (0) = 21 . (b) Q (−z) = 1 − Q (z) = Φ(z) (c) Q−1 (1 − Q (z)) = −z π

(d) Craig’s formula: Q (x) =

1 π

R2

π



e

x2 2 sin2 θ

dθ =

0

i.i.d.

1 π

R2 0

x2

e− 2 cos2 θ dθ, x ≥ 0.

To see this, consider X, Y ∼ N (0, 1). Then, ZZ

Q (z) =

π

fX,Y (x, y)dxdy = 2

Z2 Z∞ 0

(x,y)∈(z,∞)×R

fX,Y (r cos θ, r sin θ)drdθ.

z cos θ

where we evaluate the double integral using polar coordinates [11, Q7.22 p 322]. π

2

(e) Q (x) =

1 π

R4

x2

e− 2 sin2 θ dθ

0

71

(f)

d Q (x) dx

(g)

d Q (f dx

(h)

R

x2

= − √12π e− 2

(x)) = − √12π e−

(f (x))2 2

d f dx

(x)

R

Q (f (x)) g (x)dx = Q (f (x)) g (x)dx +  x−m

(i) P [X > x] = Q σ P [X < x] = 1 − Q

x−m σ



(ii) 1 −

 e− x22

d f dx

(x)

  Rx a

 g (t) dt dx

a = π1 , b = 2π

x2

≤ Q (x) ≤ 12 e− 2  − z2 1 − 0.7 e 2 ;z > 2 z2



x 2π

(iii) Q (z) ≈

2

(f (x)) √1 e− 2 2π

 = Q − x−m . σ

(j) Approximation: i h z2 1 √ √1 e− 2 ; (i) Q (z) ≈ (1−a)z+a 2π z 2 +b 1 x2

R

√1 z 2π

7.11. Error function (MATLAB): erf (z) =

√2 π

Rz 0

2

e−x dx = 1 − 2Q

√  2z

(a) It is an odd function of z.  (b) For z ≥ 0, it corresponds to P [|X| < z] where X ∼ N 0, 21 . (c) lim erf (z) = 1 z→∞

(d) erf (−z) = −erf (z)      1 1 z √ (e) Q (z) = 2 erfc 2 = 2 1 − erf √z2 (f) Φ(x) =

1 2



(g) Q−1 (q) =

1 + erf





√x (2)



=

1 erfc 2

2 erfc−1 (2q)



− √x2



(h) The complementary error function: erfc (z) = 1−erf (z) = 2Q

7.3

√  2z =

√2 π

R∞ z

2

e−x dx

Exponential Distribution

7.12. Denoted by E (λ). It is in fact Γ (1, λ). λ > 0 is a parameter of the distribution, often called the rate parameter. 7.13. Characterized by • fX (x) = λe−λx U (x) ; 72

29) Error function (Matlab): erf ( z ) =

π

∫e

−x

dx = 1 − 2Q

(

)

2 z corresponds to

0

⎛ 1⎞ P ⎡⎣ X < z ⎤⎦ where X ~ N ⎜ 0, ⎟ . ⎝ 2⎠ ⎛ 1⎞ N ⎜ 0, ⎟ ⎝ 2⎠

erf ( z ) Q

(

2z

)

z

0

a) lim erf ( z ) = 1 Figure 17: erf-function and Q-function z →∞ b) erf ( − z ) = −erf ( z ) • FX (x) = 1 − e−λx U (x) ;

• Survival-, survivor-, or reliability-function: P [X > x] = e−λx 1[0,∞) (x) + 1(−∞,0) (x);

• ϕX (u) =

• MX (s) =

λ . λ−iu λ λ−s

for Re {s} < λ.

7.14. EX = σX = λ1 , Var [X] = 7.15. median(X) =

1 λ

1 . λ2

ln 2, mode(X) = 0, E [X n ] =

7.16. Coefficient of variation: CV =

σX EX

n! . λn

=1

7.17. It is a continuous version of geometric distribution. In fact, bXc ∼ G0 (e−λ ) and dXe ∼ G1 (e−λ ) 7.18. X ∼ E(λ) is simply λ1 X1 where X1 ∼ E(1). 7.19. Suppose X1 ∼ E(1). Then E [X1n ] = n! for n ∈ N ∪ {0}. In general, for X ∼ E(λ), we have E [X α ] = λ1α Γ(α + 1) for any α > −1. In particular, for n ∈ N ∪ {0}, the moment E [X n ] = λn!n . 7.20. µ3 = E [(X − EX)3 ] = 7.21. Hazard function:

2 λ3

f (x) P [X>x]

and µ4 =

9 . λ4



7.22. h(X) = log λe . 7.23. Can be generated by X = − λ1 ln U where U ∼ U (0, 1). 7.24. MATLAB: • X = exprnd(1/lambda)

• fX (x) = exppdf(x,1/lambda)

• FX (x) = expcdf(x,1/lambda)

73

7.25. Memoryless property : The exponential r.v. is the only continuous r.v. on [0, ∞) that satisfies the memoryless property: P [X > s + x |X > s ] = P [X > x] for all x > 0 and all s > 0 [17, p. 157–159]. In words, the future is independent of the past. The fact that it hasn’t happened yet, tells us nothing about how much longer it will take before it does happen. • In particular, suppose we define the set B + x to be {x + b : b ∈ B}. For any x > 0 and set B ⊂ [0, ∞), we have P [X ∈ B + x|X > x] = P [X ∈ B] because

R

P [X ∈ B + x] = P [X > x]

λe−λt dt τ =t−x = e−λx

B+x

R

B

λe−λ(τ +x) dτ . e−λx

7.26. The difference of two independent E(λ) suppose X and Y √ p is L(λ). In particular, 2 1 2 are i.i.d. E(λ), then E|X − Y | = σX = λ and E [(X − Y ) ] = λ . P 7.27. Consider independent Xi ∼ E (λ). Let Sn = ni=1 Xi . (a) Sn ∼ Γ(n, λ), i.e. its has n-Erlang distribution.

(b) Let N = inf {n : Sn ≥ s}. Then, N ∼ Λ + 1 where Λ ∼ P (λs). 7.28. If independent Xi ∼ E (λi ), then   P (a) min Xi ∼ E λi . i

i

Recall order statistics. Let Y1 = min Xi

h i (b) P min Xi = j = i

λ Pj . λi

Note that this is

m X i=1

Si >

R∞ 0

i

i.i.d. 7.29. If Si , Ti ∼ E (α), then P

i

n X j=1

Tj

!

fXj (t)

Q

P [Xi > t]dt.

i6=j

m−1 X

  n+m−1 1 n+m−1 = i 2 i=0 m−1 X  n + i − 1   1 n+i = . i 2 i=0

Note that we can set up two Poisson processes. Consider the superposed process. We want the nth arrival from the T processes to come before the mth one of the S process.

74

7.4

Pareto: Par(α)–heavy-tailed model/density

7.30. Characterizations: Fix α > 0. (a) f (x) = αx−α−1 U (x − 1) (b) F (x) = 1 −

1 xα

Example 7.31.



U (x − 1) =



0 1−

1 xα

x 0 (a) Also known as Laplace or double exponential. (b) f (x) = α2 e−α|x|  1 αx e x t] = 1 − F (t) = (d) Use



2

e−αt , t ≥ 0 1, t 0. (a) fX (x) =

α 1 , π α2 +x2

α>0  (b) FX (x) = π1 tan− 1 λx + 12 . (c) ϕX (u) = e−α|u|

Note that Cau(α) is simply αX1 where X1 ∼ Cau(1). Also, because fX is even, f−X = fX and thus −X is still ∼ Cau(α). 7.40. Odd moments are not defined; even moments are infinite. • Because the first moment is not defined, central moments, including the variance, are not defined. • Mean and variance do not exist.

• Note that even though the pdf of Cauchy distribution is an even function, the mean is not 0. P 7.41. Suppose Xi are independent Cau(αi ). Then, i ai Xi ∼ Cau(|ai | αi ).  7.42. Suppose X ∼ Cau(α). X1 ∼ Cau α1 .

7.8

More PDFs

7.43. Beta (a) fβq1 ,q2 (x) =

Γ(q1 +q2 ) q1 −1 x Γ(q1 )Γ(q2 )

(1 − x)q2 −1 1(0,1) (x), x ∈ (0, 1)

(b) MATLAB: X = betarnd(q1,q2), fX (x) = betapdf(x,q1,q2), and FX (x) = betacdf(x,q1,q2) (c) For symmetric distributions q1 and q2 are the same. As their values increase the distributions become more peaked. (d) The uniform distribution has q1 = q2 = 1. h i Γ(q1 +i) Γ(q2 +j) 1 +q2 ) (e) E X i (1 − X)j = Γ(qΓ(q Γ(q2 ) 1 +q2 +i+j) Γ(q1 ) (f) E [X] =

q1 q1 +q2

and Var X =

q1 q2 . (q1 +q2 )2 (q1 +q2 +1)

(g) Parameters for random variable with mean m and variance σ 2 : (i) Suppose we want E [X] = m and Var X = σ 2 , then we need q1 = m   (1−m)m and q2 = (1 − m) −1 σ2



(1−m)m σ2

 −1

2 (ii) The support of Y =aX is [0, a]. Suppose we want E  [Y ] = m and  Var Y = σ ,  (a−m)m (a−m)m − 1 and q2 = 1 − m −1 then we set q1 = m a σ2 a σ2

77

7.44. Rice/Rician (a) Characterizations: fix v ≥ 0 and σ > 0,     r rv −(r2 + v 2 ) fR (r) = 2 exp I0 2 1[r > 0] 2 σ 2σ σ where I0 is the modified Bessel function7 of the first kind with order zero. (b) When v = 0, the distribution reduces to a Rayleigh distribution given in (21). p 2 (c) Suppose we have independent N ∼ N (m , σ ), then R = N12 + N22 is Rician with i i p v = m21 + m22 and the same σ. To see this, use (34) and the fact that we can express m1 = v cos φ and m2 = v sin φ for some φ. (d) E [R2 ] = 2σ 2 + v 2 , E [R4 ] = 8σ 4 + 8σ 2 v 2 + v 4 . 7.45. Weibull : For λ > 0 and p > 0, the Weibull(p, λ) distribution [11] is characterized by 1 (a) X = Yλ p where Y ∼ E(1). p

(b) fX (x) = λpxp−1 e−λx , x > 0 p

(c) FX (x) = 1 − e−λx , x > 0 (d) E [X n ] =

8

Γ(1+ n p) n

λp

.

Expectation

Consider probability space (Ω, A, P ) 8.1. Let X + = max (X, 0), and X − = − min (X, 0) = max (−X, 0). Then, X = X + − X − , and X + , X − are nonnegative r.v.’s. Also, |X| = X + + X − 8.2. A random variable X is integrable if and only if ≡ X has a finite expectation ≡ both E [X + ] and E [X − ] are finite ≡ E |X| is finite. ≡ EX is finite ≡ EX is defined. ≡ X∈L 7

I0 (z) =

1 π



ez cos θ dθ. In MATLAB, use besseli(0,z) for I0 (z).

0

78

≡ |X| ∈ L

In which case,

    EX = E X + − E X − Z Z = X (ω) P (dω) = XdP Z Z X = xdP (x) = xP X (dx)

and

Z

XdP = E [1A X] .

A

Definition 8.3. A r.v. X admits (has) an expectation if E [X + ] and E [X − ] are not both equal to +∞. Then, the expectation of X is still given by EX = E [X + ] − E [X − ] with the conventions +∞ + a = +∞ and −∞ + a = −∞ when a ∈ R 8.4. L 1 = L 1 (Ω, A, P ) = the set of all integrable random variables. 8.5. For 1 ≤ p < ∞, the following are equivalent: 1

(a) (E [X p ]) p = 0 (b) E [X p ] = 0 (c) X = 0 a.s. a.s.

8.6. X = Y ⇒ EX = EY

8.7. E [1B (X)] = P (X −1 (B)) = P X (B) = P [X ∈ B]   • FX (x) = E 1(−∞,x] (X)

8.8. Expectation rule: Let X be a r.v. on (Ω, A, P ), with values in (E, E), and distribution P X . Let h : (E, E) → (R, B) be measurable. If • X ≥ 0 or

• h (X) ∈ L 1 (Ω, A, P ) which is equivalent to h ∈ L 1 E, E, P X

then

R R • E [h (X)] = h (X (ω)) P (dω) = h (x) P X (dx) R R • h (X (ω)) P (dω) = h (x) P X (dx)



G

[X∈G]

8.9. Expectation of an absolutely continuous random variable: Suppose X has density fX , then h is P X -integrable if and only if h · fX is integrable w.r.t. Lebesgue measure. In which case, Z Z X E [h (X)] = h (x) P (dx) = h (x) fX (x) dx

and

Z G

X

h (x) P (dx) =

Z G

79

h (x) fX (x) dx

• Caution: Suppose h is an odd function and fX is an even function, we can not conclude that E [h(X)] = 0. One obvious odd-function h is h(x) = x. For example, in (7.40), when X is Cauchy, the expectation does not exist even though the pdf is an even funciton. Of course, in general, if we also know that h(X) is integrable, then E [h(X)] is 0. Expectation of a discrete random variable: Suppose x is a discrete random variable. X EX = xP [X = x] x

and E [g(X)] =

X

g(x)P [X = x].

x

Similarly, E [g(X, Y )] =

XX x

= g(x, y)P [X = x, Y = y].

y

These are called the law/rule of the lazy statistician (LOTUS) [26, Thm 3.6 p 48],[11, p. 149]. R R R R 8.10. P [X ≥ t] dt = P [X > t] dt and P [X ≤ t] dt = P [X < t] dt E

E

E

E

8.11. Expectation and cdf : (a) For nonnegative X, EX =

Z∞

P [X > y] dy =

0

Z∞

(1 − FX (y)) dy =

0

For p > 0, E [X p ] =

Z∞

Z∞

P [X ≥ y] dy

(22)

0

pxp−1 P [X > x]dx.

0

(b) For integrable X, EX =

Z∞

(1 − FX (x)) dx −

Z0

FX (x) dx

−∞

0

(c) For nonnegative integer-valued X, EX = Definition 8.12.

P∞

k=0

P [X > k] =

P∞

k=1

P [X ≥ k].

h i R   (a) Absolute moment: E |X|k = |x|k P X (dx), where we define E |X|0 = 1

h i   R (b) Moment: If E |X|k < ∞, then mk = E X k = xk P X (dx) = the k th moment of X 80

(c) Variance: If E [X 2 ] < ∞, then we define Z    2 Var X = E (X − EX) = (x − EX)2 P X (dx) = E X 2 − (EX)2 = E [X(X − EX)]

2 • Notation: DX , or σ 2 (X), or σX , or VX [26, p. 51] k  k  k k P P P P 2 2 • If Xi ∈ L , then Xi ∈ L , Var Xi exists, and E Xi = EXi i=1

i=1

i=1

i=1

2

• Suppose E [X ] < ∞. If Var X = 0, then X = EX a.s. p (d) Standard Deviation: σX = Var[X]. (e) Coefficient of Variation: CVX =

σX . EX

• It is the standard deviation of the “normalized” random variable • 1 for exponential. (f) Fano Factor (index of dispersion):

X . EX

Var X . EX

• 1 for Poisson. (g) Central Moments: the nth central moment is µn = E [(X − EX)n ]. (i) µ1 = E [X − EX] = 0.

2 (ii) µ2 = σX = Var X

(iii) Apply binomial expansion (X − EX)n =

n P

k=1



n k

X n−k (−EX)k , we have

n   X n µn = mn−k (−m1 )k . k k=1

(iv) By binomial expansion X n = (X − EX + EX)n = we have

n   X n mn = µn−k mk1 . k k=1

(h) Skewness coefficient: γX =

n P

k=1

(23) 

n k

(X − EX)n−k (EX)k , (24)

µ3 3 σX

(i) Describe the deviation of the distribution from a symmetric shape (around the mean) (ii) 0 for any symmetric distribution (i) Kurtosis: κX =

µ4 4 . σX

81

• κX = 3 for Gaussian distribution µ4 4 σX

−3

(k) Cumulants or semivariants: For one variable: γk =

ln (ϕX (v))

. v=0

= EX = m1 , = E [X − EX]2 = m2 − m21 = µ2 = E [X − EX]3 = m3 − 3m1 m2 + 2m31 = µ3 = m4 − 3m22 − 4m1 m3 + 12m21 m2 − 6m41 = µ4 − 3µ22 = γ1 = γ2 + γ12 = γ2 + 3γ1 γ2 + γ13 = γ4 + 3γ22 + 4γ1 γ3 + 6γ12 γ2 + γ14

14:27

(ii) m1 m2 m3 m4

July 28, 2005

(i) γ1 γ2 γ3 γ4

1 ∂k j k ∂v k

P1: Rakesh Tung.cls Tung˙C02

(j) Excess coefficient: εX = κX − 3 =

TABLE 2.1 Product-Moments of Random Variables

Moment Measure of First Second

Third

Fourth

Definition

Continuous variable

Central location

Mean, expected value E( X ) = µx

µx =

Dispersion

Variance, Var( X ) = µ2 = σx2

σx2 =

Asymmetry

Peakedness

∞

−∞

∞

−∞

Discrete variable

x f x (x) dx

µx =

(x − µx ) 2 f x (x) dx

σx2 =



σx =



x p(xk ) all x  s k





all x  s

Standard deviation, σx

σx =

Coefficient of variation, x

x = σx /µx

x = σx /µx

Skewness

µ3 =

µ3 =

Skewness coefficient, γx Kurtosis, κx

γx = µ3 /σx3 ∞ µ4 = −∞ (x

Excess coefficient, εx

κx = µ4 /σx4

κx = µ4 /σx4

εx = κx − 3

εx = κx − 3

Var( X )

∞

−∞

(x − µx ) 3 f x (x) dx



− µx ) 4 f x (x) dx

Var( X )



all x  s γx = µ3 /σx3

µ4 =

(xk − µx ) 2 Px (xk )



all x  s

Sample estimator x¯ =



s2 = s=

xi /n

1 n−1





1 n−1

Cv = s/x¯ (xk − µx ) 3 p x (xk )

m3 =

m4 =



(xi − x¯ ) 2

n (n−1)(n−2)

g = m3 (xk − µx ) 4 p x (xk )

(xi − x¯ ) 2

/s3



(xi − x¯ ) 3

n(n+1) (n−1)(n−2)(n−3)

k = m4 /s4



(xi − x¯ ) 4

Figure 18: Product-Moments of Random Variables [25]

37

8.13. • For c ∈ R, E [c] = c

• E [·] is a linear operator: E [aX + bY ] = aEX + bEY . • In general, Var[·] is not a linear operator.

8.14. All pairs of mean and variance are possible. A random variable X with EX = m and Var X = σ 2 can be constructed by setting P [X = m − a] = P [X = m + a] = 12 . Definition 8.15. • Correlation between X and Y : E [XY ]. 82

Model

E [X]

Var [X]

U{0,1,...,n−1} B(n, p) G(β) P(λ) U(a, b) E(λ)

n−1 2

n2 −1 12

np

np(1-p)

β 1−β

β (1−β)2

Par(α)



λ

λ

a+b 2 1 λ

(b−a)2 12 1 λ2

α , α−1

∞,

L(α) N (m, σ 2 ) N (m, Λ) Γ (q, λ)

α>1 0 0. Then, pX,Y (x0 , g(x0 )) = pX (x0 ). We only need to show that pY (g(x0 )) 6= 1 to show that X and Y are not independent. For example, let X be uniform on {±1, ±2} and Y = |X|. Consider the point x0 = 1. • Continuous: Let Θ be uniform on an interval of length 2π. Set X = cos Θ and Y = sin Θ. See (12.6). 8.17. Suppose X and Y are i.i.d. random variables. Then, p √ E [(X − Y )2 ] = 2σX .

8.18. See (5.49) for relationships between expectation and independence. Example 8.19 (Martingale betting strategy). Fix a > 0. Suppose X0 , X1 , X2 , . . . are independent random variables with P [Xi = 1] = p and P [Xi = 0] = 1 − p. Let N = inf {i : Xi = 1}. Also define   0,  N =0 N −1 P i L(N ) = r , N ∈N  a i=0

and G(N ) = arN − L(N ). To have G(N ) > 0, need

k−1 P i=0

ri < rk ∀k ∈ N which turns out to

require r ≥ 2. In fact, for r ≥ 2, we have G(N ) ≥ a ∀N ∈ N ∪ {0}. Hence, E [G(N )] ≥ a. It is exactly a when r = P 2. n rn −1 Now, E [L (N )] = a ∞ n=1 r−1 p (1 − p) = ∞ if and only if r(1 − p) ≥ 1. When 1 − p ≤ 12 , because we already have r ≥ 2, it is true that r(1 − p) ≥ 1.

9

Inequalities

9.1. Let (Ai : i ∈ I) be a finite family of events. Then 2  P ! P (Ai ) [ X i PP ≤P Ai ≤ P (Ai ) P (Ai ∩ Aj ) i i i

j

85

a.s. a) X n   X  X n  1 , X  1 , and   X n     X  . P b) X n   X   X n    X  .

9.2. Classical [23, p. 14]Probability

! 64) Interchanging summation.  nand 1) Birthdayintegration Paradox: In a group of\ n23 randomly n selected people, the probability that Y 1 at least two (assuming − will 1 −share a birthday ≤P Ani − birthdays P (Ai )are ≤ equally (n − 1)likely n− n−1to. occur  n day of the year) about 0.5. See (3).and (3) ↓ i=1 i=1 also (1)on anyXgiven converges a.s., is(2) a.e., , then n

a) If



f

n

↓ 1 − e ≈ −0.37

n 1

Events

k 1

k

g

gL

1

f n , ⎛ f n 1 ⎞Ln and⎛ n ⎞ f nnd     f n d − n.n−1 2) n− ⎜19. 1 − ⎟ ≤ P ⎜ ∩ Ain⎟ − ∏ P ( Ai ) n≤ ( n − 1) n See figure ⎝



n⎠

⎝ i =1



• |P(A−11 ≈− ∩0.37 A2 ) − P (A1 ) P (A2 )| ≤ 14 .

b) If

   X e

n 1

n

X

   , then

n1 1

[Szekely’86, p 14].

↓ 1

i =1

converges absolutely a.s., and integrable.

n

1

   Moreover,    X n      X n  .  n 1  x n 1 ⎛ ⎝

− ⎜ 1−

1⎞



0.5

x⎠ −x

Inequality

( x− 1) ⋅ x

x− 1

0

65) Let  Ai : i  I  be a finite −family of events. Then 1 e 2

a)

4 6 8 10  2   P  A1i   x 10    i        P A P  Ai  . i n n 1 T Q P A  A P ( A1 ∩ A2 Figure − P ( A119: Ai2 ) ≤ j for  P  i Ai  − i P (Ai). ) ) P( Bound i

4

j

i=1

i=1

Random Variable

1   X: P[|X| a  ≥ a]≤ X1 E,|X|, 66) Markov’s Inequality: a > a0.> 0. 9.3. 3) Markov’s Inequality a on a finite set {a1 ,… , an } . Then, the Let i.i.d. X 1 ,…, X r be uniformly distributed a r −1

− i⎞ ⎛ probability that X 1 ,…, X r are all distinct is pu ( nx, r ) = 1 − ∏ ⎜ 1 − ⎟ ≈ 1 − e n⎠ i =1 ⎝ a) Special case: the birthday paradox in (1).

r ( r −1) 2n

pu(n,r) for n = 365 1

a

0.9

pu ( n, r )

a1 x  a  

0.6



0.8 0.7

1− e

0.6



p = 0.9 p = 0.7 p = 0.5

0.5

r ( r −1) 2n

0.4

r n

0.5 0.4

r = 23

0.3

n = 365

0.2

0.2

0

5

10

15

20

25

30

35

40

45

50

x

0.1

a

0.1 0

r = 23 n = 365

p = 0.3 p = 0.1

0.3

55

0

0

50

100

150

r

200

250

300

350

n

20: from Proof Inequality a  Figure X comes a) Useless when . Hence, good the “tails” of a b) The approximation 1 +ofx Markov’s ≈for e x . bounding c) From the approximation, to have pu ( n, r ) = p , we need distribution.

b) Remark: P  X  a   P  X  a  . 86 1 c)   X  a X   , a > 0. a

.

(a) Useless when a ≤ E |X|. Hence, good for bounding the “tails” of a distribution. (b) Remark: P [|X| > a] ≤ P [|X| ≥ a] (c) P [|X| ≥ aE |X|] ≤ a1 , a > 0. (d) Suppose g is a nonnegative function. Then, ∀α > 0 and p > 0, we have (i) P [g (X) ≥ α] ≤

(E [(g (X))p ])

1 αp

(ii) P [g (X − EX) ≥ α] ≤

(E [(g (X − EX))p ])

1 αp

(e) Chebyshev’s Inequality : P [|X| > a] ≤ P [|X| ≥ a] ≤ (i) P [|X| ≥ α] ≤

1 αp

1 EX 2 , a2

a > 0.

(E [|X|p ])

(ii) P [|X − EX| ≥ α] ≤

2 σX ; α2

that is P [|X − EX| ≥ nσX ] ≤

• Useful only when α > σX

(iii) For a < b, P [a ≤ X ≤ b] ≥ 1 −

4 (b−a)2



2 σX

+ EX −

(f) One-sided Chebyshev inequalities: If X ∈ L2 , for a > 0, (i) If EX = 0, P [X ≥ a] ≤

1 n2

 a+b 2 2



EX 2 EX 2 +a2

(ii) For general X,

i. P [X ≥ EX + a] ≤ ii. P [X ≤ EX − a] ≤

2 σX 2 σX +a2 2 σX 2 σX +a2 2 2σX 2 +a2 σX 2 σX X a2

iii. P [|X − EX| ≥ a] ≤ better bound than

; that is P [X ≥ EX + nσX ] ≤

1 1+n2

; that is P [X ≤ EX − nσX ] ≤

1 1+n2

; that is P [|X − EX| ≥ nσX ] ≤

2 1+n2

This is a

iff σ > a

(g) Chernoff bounds: (i) P [X ≤ b] ≤ (ii) P [X ≥ b] ≤

E[e−θX ] e−θb E[eθX ] eθb

∀θ > 0

∀θ > 0

This can be optimized over θ 9.4. Suppose |X| ≤ M a.s., then P [|X| ≥ a] ≥ 9.5. X ≥ 0 and E [X 2 ] < ∞ ⇒ P [X > 0] ≥

E|X|−a ∀a M −a

∈ [0, M )

(EX)2 E[X 2 ]

Definition 9.6. If p and q are positive real numbers such that p + q = pq, or equivalently, 1 + 1q = 1, then we call p and qa pair of conjugate exponents. p • 1 < p, q < ∞

• As p → 1, q → ∞. Consequently, 1 and ∞ are also regarded as a pair of conjugate exponents. 87

9.7. H¨ older’s Inequality : X ∈ Lp , Y ∈ Lq , p > 1,

1 p

+

1 q

= 1. Then,

(a) XY ∈ L1 1

1

(b) E [|XY |] ≤ (E [|X|p ]) p (E [|Y |q ]) q with equality if and only if E [|Y |q ] |X (ω)|p = E [|X|p ] |Y (ω)|q a.s. 9.8. Cauchy-Bunyakovskii-Schwartz Inequality: If X, Y ∈ L2 , then XY ∈ L1 and   1   1 |E [XY ]| ≤ E [|XY |] ≤ E |X|2 2 E |Y |2 2 or equivalently,    2  E Y (E [XY ])2 ≤ E X 2

with equality if and only if E [Y 2 ] X 2 = E [X 2 ] Y 2 a.s. (a) |Cov (X, Y )| ≤ σX σY . (b) (EX)2 ≤ EX 2 (c) (P (A ∩ B))2 ≤ P (A)P (B)

1

9.9. Minkowski’s Inequality : p ≥ 1, X, Y ∈ Lp ⇒ X + Y ∈ Lp and (E [|X + Y |p ]) p ≤ 1 1 (E [|X|p ]) p + (E [|Y |p ]) p 9.10. p > q > 0 (a) E [|X|q ] ≤ 1 + E [|X|p ] 1

1

(b) Lyapounov’s inequality : (E [|X|q ]) q ≤ (E [|X|p ]) p q     1 • E [|X|] ≤ E |X|2 ≤ E |X|3 3 ≤ · · · q   • E|X − Y | ≤ E (X − Y )2

◦ If X and Y are independent, then the RHS is √ ◦ If X and Y are i.i.d., then the RHS is 2σX .

p 2 σX + σY2 + (EX − EY )2 .

9.11. Jensen’s Inequality : For a random variable X, if 1) X ∈ L1 (and ϕ (X) ∈ L1 ); 2) X ∈ (a, b) a.s.; and 3) ϕ is convex on (a, b), then ϕ (EX) ≤ E [ϕ (X)]  1 • For X > 0 (a.s.), E X1 ≥ EX . 9.12.

• For p ∈ (0, 1], E [|X + Y |p ] ≤ E [|X|p ] + E [|Y |p ].

• For p ≥ 1, E [|X + Y |p ] ≤ 2p−1 (E [|X|p ] + E [|Y |p ]).

9.13 (Covariance Inequality). Let X be any random variable and g and h any function such that E [g(X)], E [h(X)], and E [g(X)h(X)] exist. 88

• If g and h are either both non-decreasing or non-increasing, then E [g(X)h(X)] ≥ E [g(X)] E [h(X)] . In particular, for nondecreasing g, E [g(X)(X − EX)] ≥ 0.

• If g is non-decreasing and h is non-increasing, then

E [g(X)h(X)] ≤ E [g(X)] E [h(X)] . See also (25), (26), and [2, p. 191–192].

10

Random Vectors

In this article, a vector is a column matrix with dimension n × 1 for some n ∈ N. We use 1 to denote a vector with all element being 1. Note that 1(1T ) is a square matrix with all element being 1. Finally, for any matrix A and constant a, we define the matrix A + a to be the matrix A with each of the components are added by a. If A is a square matrix, then A + a = A + a1(1T ). Definition 10.1. Suppose I is an index set. When Xi ’s are random variables, we define a random vector XI by XI = (Xi : i ∈ I). For example, if I = [n], we have XI = (X1 , X2 , . . . , Xn ). Note also that X[n] is usually denoted by X1n . Sometimes, we simply write X to denote X1n . • For disjoint A, B, XA∪B = (XA , XB ).

• For vector xI , yI , we say x ≤ y if ∀i ∈ I, xi ≤ yi [9, p 206].

• When the dimension of X is implicit, we simply write X and x to represent X1n and xn1 , respectively.  • For random vector X, Y , we use (X, Y ) to represent the random vector X or Y  T T T equivalently X Y .

Definition 10.2. Half-open cell or bounded rectangle in Rk is set of the form Ia,b = k

{x : ai < xi ≤ bi , ∀i ∈ [k]} = × (ai , bi ]. For a real function F on Rk , the difference of F i=1

around the vertices of Ia,b is ∆Ia,b F =

X v

 X sgnIa,b (v) F (v) = (−1)|{i:vi =ai }| F (v)

(27)

v

where the sum extending over the 2k vertices v of Ia,b . (The ith coordinate of the vertex v could be either ai or bi .) In particular, for k = 2, we have F (b1 , b2 ) − F (a1 , b2 ) − F (b1 , a2 ) + F (a1 , a2 ) .

89

10.3 (Joint cdf). FX (x) = FX1k

xk1



h

i

= P [X1 ≤ x1 , . . . , Xk ≤ xk ] = P X ∈ Sxk1 = P X (Sx )

where Sx = {y : yi ≤ xi , i = 1, . . . , k} consists of the points “southwest” of x. • ∆Ia,b FX ≥ 0 • The set Sx is an orthanlike or semi-infinite corner with “northeast” vertex (vertex in the direction of the first orthant) specified by the point x [9, p 206]. C1 FX is nondecreasing in each variable. Suppose ∀i yi ≥ xi , then FX (y) ≥ FX (x) C2 FX is continuous from above: lim FX (x1 + h, . . . , xk + h) = FX (x) h&0

C3 xi → −∞ for some i (the other coordinates held fixed), then FX (x) → 0 If ∀i xi → ∞, then FX (x) → 1. • lim FX (x1 − h, . . . , xk − h) = P X (Sx◦ ) where Sx◦ = {y : yi < xi , i = 1, . . . , k} is the h&0

interior of Sx

• Given a ≤ b, P [X ∈ Ia,b ] = ∆Ia,b FX . This comes from (12) with Ai =[Xi ≤ ai ] and B = [∀ Xi ≤ bi ]. Note that ai , i ∈ I, P [∩i∈I Ai ∩ B] = F (v) where vi = bi , otherwise. • For any function F on Rk with satisfies (C1), (C2), and (C3), there is a unique probability measure µ on BRk such that ∀a, ∀b ∈ Rk with a ≤ b, we have µ (Ia,b ) = ∆Ia,b F (and ∀x ∈ Rk µ (Sx ) = F (x)). • TFAE: (a) FX is continuous at x (b) FX is continuous from below (c) FX (x) = P X (Sx◦ ) (d) P X (Sx ) = P X (Sx◦ ) (e) P X (∂Sx ) = 0 where ∂Sx = Sx − Sx◦ = {y : yi ≤ xi ∀i, ∃j yj = xj } • If k > 1, FX can have discontinuity points even if P X has no point masses. • FX can be discontinuous at uncountably many points. • The continuity points of FX are dense. • For any j, we have lim FX (x) = FXI\{j} (xI\{j} ) xj →∞

90

10.4 (Joint pdf). A function f is a multivariate or joint pdf (commonly called a density) if and only if it satisfies the following two conditions: (a) f ≥ 0; R (b) f (x)d(x) = 1.

• The integrability of the pdf f implies that for all i ∈ I lim f (xI ) = 0.

xi →±∞

• P [X ∈ A] =

R

A

fX (x)dx.

• Remarks: Roughly, we may say the following: (a) fX (x) =

lim

∀i, ∆xi →0

P [∀i, xiQ 0. such that pX x(0) > 0. Pick a point y(1) such that  Then, pX,Y x(0) , y (1) = 0 but pX x(0) pY y (1) > 0. Note that this technique does not always work. For example, if g is a constant function which maps all values of x to a constant c. Then, we will not be able to find y (1) . Of course, this is to be expected because we know that a constant is always independent of other random variables.

12.1

SISO case

There are many techniques for finding the cdf and pdf of Y = g(X). (a) One may first find FY (y) = P [g(X) ≤ y] first and then find fY from FY0 (y). In which case, the Leibniz’ rule in (56) will be useful. (b) Formula (31) below provides a convenient way of arriving at fY from fX without going through FY . 12.3 (Linear transformation). Y = aX + b where a 6= 0. (  FX y−b , a>0 a    FY (y) = y−b − 1 − FX , a0 b

Erlang x>0 b, n

X1 + ⋅⋅⋅ + XK Standard uniform 0 < x 0 n

n2 → ∞ n1 X X1/n1 X2 /n2

1 X

a=b=1

a=n

X1 + ⋅⋅⋅ + XK

n→∞

X1 X1 + X2

Gamma x>0 a, b

n = 2a a = 1/2

XK2

Beta 00 b

Weibull x>0 a, b

Triangular –1 < x < 1

Figure 2.15 Relationships among univariate distributions. (After Leemis, 1986.)

Figure 21: Relationships among univariate distributions [25] 104

a=0 b=1 Uniform a 0 0

[19, eq (9.13), p. 318)]. (e) For Z =

Y , X

Z Z y  y , y dy = |x| fX,Y (x, xz)dx. fZ (z) = fX,Y z z

Similarly, when Z =

X , Y

fZ (z) = (f) For Z =

Z

|y| fX,Y (yz, y)dy.

min(X,Y ) max(X,Y )

where X and Y are strictly positive, Z ∞ Z ∞ FZ (z) = FY |X (zx|x)fX (x)dx + FX|Y (zy|y)fY (y)dy, 0 0 Z ∞ Z ∞ fZ (z) = xfY |X (zx|x)fX (x)dx + yfX|Y (zy|y)fY (y)dy, 0

0

[11, Ex 7.17]. 107

0 < z < 1.

1 2



.

(34)

12.9 (Random sum). Let S =

PN

i=1

Vi where Vi ’s are i.i.d. ∼V independent of N .

(a) ϕS (u) = ϕN (−i ln (ϕV (u))).

• ϕS (u) = GN (ϕV (u)) • For non-negative integer-valued summands, we have GS (z) = GN (GV (z)) (b) ES = EN EV . (c) Var [S] = EN (Var V ) + (EV )2 (Var N ). Remark : If N ∼ P (λ), then ϕS (u) = exp (λ (ϕV (u) − 1)), the compound Poisson distribution CP (λ, L (V )). Hence, the mean and variance of CP (λ, L (V )) are λEV and λEV 2 respectively.

12.3

MIMO case

Definition 12.10 (Jacobian). In vector calculus, the Jacobian is shorthand for either the Jacobian matrix or its determinant, the Jacobian determinant. Let g be a function from a subset D of Rn to Rm . If g is differentiable at z ∈ D, then all partial derivatives exists at z and the Jacobian matrix of g at a point z ∈ D is   ∂g1 ∂g1 (z) (z) · · · ∂x   ∂x1 n ∂g ∂g   .. . . .. ..  = (z) , . . . , (z) . dg (z) =  . ∂x1 ∂xn ∂gm ∂gm (z) · · · ∂xn (z) ∂x1

∂(g1 ,...,gn ) [9, p 242], Jg (x) where the it Alternative notations for the Jacobian matrix are J, ∂(x 1 ,...,xn ) is assumed that the Jacobian matrix is evaluated at z = x = (x1 , . . . , xn ).

• Let A be an n-dimensional Q “box” defined by the corners x and x+∆x. The “volume” of the image g(A) is ( i ∆xi ) |det dg(x)|. Hence, the magnitude of the Jacobian determinant gives the ratios (scaling factor) of n-dimensional volumes (contents). In other words, ∂(y1 , . . . , yn ) dx1 · · · dxn . dy1 · · · dyn = ∂(x1 , . . . , xn ) • d(g −1 (y)) is the Jacobian of the inverse transformation. • In MATLAB, use jacobian. See also (A.16). 12.11 (Jacobian formulas). Suppose g is a vector-valued function of x ∈ Rn , and X is an Rn -valued random vector. Define Y = g(X). (Then, Y is also an Rn -valued random vector.) If X has joint density fX , and g is a suitable invertible mapping (such that the inversion mapping theorem is applicable), then fY (y) =

    1 −1 det d g −1 (y) fX g −1 (y) . f g (y) = X |det (dg (g −1 (y)))| 108

• Note that for any matrix A, det(A) = det(AT ). Hence, the formula above could tolerate the incorrect “Jacobian”. In general, let X = (X1 , X2 , . . . , Xn ) be a random vector with pdf fX (x). Let S = {x : fX (x) > 0}. Consider a new random vector Y = (Y1 , Y2 , . . . , Yn ), defined by Yi = gi (X). Suppose that A0 , A1 , . . . , Ar form a partition of S with these properties. The set A0 , which may be empty, satisfies P [X ∈ A0 ] = 0. The transformation Y = g(X) = (g1 (X), . . . , gn (X)) is a one-to-ont transformation from Ai onto some common set B for each i ∈ [k]. Then, for each i, the inverse functions from B to Ai can be found. Denote (k) the kth inverse x = h(k) (u) by xj = hj (y). This kth inverse gives, for y ∈ B, the unique x ∈ Ak such that y = g(x). Assuming that the Jacobians det(dh(k) (y)) do not vanish identically on B, we have fY (y) =

r X k=1

[2, p. 185].

fX (h(k) (y)) det(dh(k) (y)) ,

y∈B

• Suppose for some k, Yk is some functions of other Yi . In particular, suppose Yk = h(yI ) for some index set I and some deterministic function h. Then, the kth row of the Jacobian matrix is a linear combination of other rows. In particular,   ∂yi ∂yk X ∂ = h (yI ) . ∂xj ∂yi ∂xj i∈I Hence, the Jacobian determinant is 0. 12.12. Suppose Y = g(X) where both X and Y have the same dimension, then the joint density of X and Y is fX,Y (x, y) = fX (x)δ(y − g(x)). • In most cases, we can show that X and Y are not independent, pick a point x(0)   such that fX x(0) > 0. Pick a point y(1) such that y (1) 6= g x(0) and fY y (1) > 0.  Then, fX,Y x(0) , y (1) = 0 but fX x(0) fY y (1) > 0. Note that this technique does not always work. For example, if g is a constant function which maps all values of x to a constant c. Then, we will not be able to find y (1) . Of course, this is to be expected because we know that a constant is always independent of other random variables. Example 12.13. (a) For Y = AX + b, where A is a square, invertible matrix, fY (y) =

 1 fX A−1 (y − b) . |det A|

(35)

(b) Transformation between Cartesian coordinates (x, y) and polar coordinates (r, θ) 109

p  • x = r cos θ, y = r sin θ, r = x2 + y 2 , θ = tan−1 xy . ∂x ∂x cos θ −r sin θ ∂r ∂θ = r. (Recall that dxdy = rdrdθ). • ∂y ∂y = sin θ r cos θ ∂r ∂θ

We have

fR,Θ (r, θ) = rfX,Y (r cos θ, r sin θ) , and fX,Y (x, y) = p

1 x2 + y 2

fR,Θ

r ≥ 0 and θ ∈ (−π, π),

p  y  x2 + y 2 , tan−1 . x

If, furthermore, Θ is uniform on (0, 2π) and independent of R. Then,  p 1 1 p fX,Y (x, y) = x2 + y 2 . fR 2π x2 + y 2

(c) A related transformation is given by Z = X 2 + Y 2 and Θ = tan−1 √ √ X = Z cos Θ, Y = Z sin Θ, and

which gives (33).

(36)

 √ √ 1 fZ,Θ (z, θ) = fX,Y z cos θ, z sin θ 2

Y X



. In this case,

(37)

√ Y are 12.14. Suppose X, Y are i.i.d. N (0, σ 2 ). Then, R = X 2 + Y 2 and Θ = arctan X   2 1 r independent with R being Rayleigh fR (r) = σr2 e− 2 ( σ ) U (r) and Θ being uniform on [0, 2π]. 12.15 (Generation of a random sample of a normally distributed random variable). Let U1 , U2 be i.i.d. U(0, 1). Then, the random variables p X1 = −2 ln U1 cos(2πU2 ) p X2 = −2 ln U1 sin(2πU2 )

are i.i.d. N (0, 1). Moreover,

p −2σ 2 ln U1 cos(2πU2 ) p Z2 = −2σ 2 ln U1 sin(2πU2 ) Z1 =

are i.i.d. N (0, σ 2 ).

• The idea is to generate R and Θ according to (12.14) first. • det(dx(u)) = − 2π , u1 = e− u1

2 x2 1 +x2 2

.

12.16. In (12.11), suppose dim(Y ) = dim(g) ≤ dim(X). To find the joint pdf of Y , we introduce “arbitrary” Z = h(X) so that dim YZ = dim(X). 110

12.4

Order Statistics

Given a sample of n random variables X1 , . . . , Xn , reorder them so that Y1 ≤ Y2 ≤ · · · ≤ Yn . Then, Yi is called the ith order statistic, sometimes also denoted Xi:n , X hii , X(i) , Xn:i , Xi,n , or X(i)n . In particular • Y1 = X1:n = Xmin is the first order statistic denoting the smallest of the Xi ’s,

• Y2 = X2:n is the second order statistic denoting the second smallest of the Xi ’s . . ., and • Yn = Xn:n = Xmax is the nth order statistic denoting the largest of the Xi ’s. In words, the order statistics of a random sample are the sample values placed in ascending order [2, p 226]. Many results in this section can be found in [5]. 12.17. Events properties: [Xmin [Xmin [Xmin [Xmin

T ≥ y] = Ti [Xi > y] = Si [Xi ≤ y] = Si [Xi < y] = i [Xi

≥ y] > y] ≤ y] < y]

[Xmax [Xmax [Xmax [Xmax

S ≥ y] = Si [Xi > y] = Ti [Xi ≤ y] = Ti [Xi < y] = i [Xi

≥ y] > y] ≤ y] < y]

Let Ay = [Xmax ≤ y], By = [Xmin > y]. Then, Ay = [∀i Xi ≤ y] and By = [∀i Xi > y]. 12.18 (Densities). Suppose the Xi are absolutely continuous with joint density fX . Let Sy be the set of all n! vector which comes from permuting the coordinates of y. X fY (y) = fX (x), y1 ≤ y2 ≤ · · · ≤ yn . (38) x∈Sy

 Q ∆y is the probability that Yj is in the small interval of To see this, note that fY (y) j j length ∆yj around yj . This probability can be calculated from finding the probability that all Xk fall into the above small regions. From the joint density, we can find the joint pdf/cdf of YI for any I ⊂ [n]. However, in many cases, we can directly reapply the above technique to find the joint pdf of YI . This is especially useful when the Xi are independent or i.i.d. (a) The marginal density fYk can be found by approximating fYk (y) ∆y with n X

X

j=1 I∈([n]\{j}) k−1

P [Xj ∈ [y, y + ∆y) and ∀i ∈ I, Xi ≤ y and ∀r ∈ (I ∪ {k})c , Xr > y],

 where for any set A and integer ` ∈ |A|, we define A` to be the set of all k-element subsets of A. Note also that we assume (I ∪ {k})c = [n] \ (I ∪ {k}). To see this, we first choose the Xj that will be Yk with value around y. Then, we must have k − 1 of the Xi below y and have the rest of the Xi > y. 111

(b) For integers r < s, the joint density fYr ,Ys (yr , ys )∆yr ∆ys can be approximated by the probability that two of the Xi are inside small regions around yr and ys . To make them Yr and Ys , for the other Xi , r − 1 of them before yr , s − r − 1 of them between yr and ys , and n − s of them beyond ys . • fXmax ,Xmin (u, v)∆u∆v can be approximated by by X

(j,k)∈S

P [Xj ∈ [u, u+∆u), Xj ∈ [v, v+∆v), and ∀i ∈ [n]\{j, k} , v < Xi ≤ u],

v ≤ u,

where S is the set of all n(n − 1) pairs (j, k) from [n] × [n] with j 6= k. This is simply choosing the j, k so that Xj will be the maximum with value around u, and Xk will be the minimum with value around v. Of course, the rest of the Xi have to be between the min and max. ◦ When n = 2, we can use (38) to get fXmax ,Xmin (u, v) = fX1 ,X2 (u, v) + fX1 ,X2 (v, u),

v ≤ u.

Note that the joint density at point yI is0 if the the elements in yI are not arranged in the “right” order. 12.19 (Distribution functions). We note again the the cdf may be obtained by integration of the densities in (12.18) as well as by direct arguments valid also in the discrete case. (a) The marginal cdf is FYk (y) =

n X X

j=k I∈([n]) j

P [∀i ∈ I, Xi ≤ y and ∀r ∈ [n] \I, Xr > y].

This is because the event [Yk ≤ y] Pis the same as event that at least k of the Xi are ≤ y. In other words, let N (a) = ni=1 1[Xi ≤ a] be the number of Xi which are ≤ a. " " # # X [ X [Yk ≤ y] = N (y) ≥ k = N (y) = j , (39) i

j≥k

i

where the union is a disjoint union. Hence, we sum the probability that exactly j of the Xi are ≤ y for j ≥ k. Alternatively, note that the event [Yk ≤ y] can also be expressed as a disjoint union [ [Xi ≤ k and exactly k − 1 of the X1 , . . . , Xj−1 are ≤ y] . j≥k

This gives FYk (y) =

n X X

j=k I∈([j−1]) k−1

P [Xj ≤ y, ∀i ∈ I, Xi ≤ y, and ∀r ∈ [j − 1] \ I, Xr > y] .

112

(b) For r < s, Because Yr ≤ Ys , we have [Yr ≤ yr ] ∩ [Ys ≤ ys ] = [Ys ≤ ys ] ,

ys ≤ yr .

By (39), for yr < ys , n [

[Yr ≤ yr ] ∩ [Ys ≤ ys ] = =

!

[N (yr ) = j]

j=r n m [ [



n [

[N (ys ) = m]

m=s

!

[N (yr ) = j and N (ys ) = m],

m=s j=r

where the upper limit of the second union is changed from n to m because we must have N (yr ) ≤ N (ys ). Now, to have N (yr ) = j and N (ys ) = m for m > j is to put j of the Xi in (−∞, yr ], m − j of the Xi in (yr , ys ], and n − m of the Xi in (ys , ∞). (c) Alternatively, for Xmax , Xmin , we have FXmax ,Xmin (u, v) = P (Au ∩ Bvc ) = P (Au ) − P (Au ∩ Bv ) = P [∀i Xi ≤ u] − P [∀i v < Xi ≤ u] where the second term is 0 when u < v. So, FXmax ,Xmin (u, v) = FX1 ,...,Xn (u, . . . , u) when u < v. When v ≥ u, the second term can be found by (27) which gives X FXmax ,Xmin (u, v) = FX1 ,...,Xn (u, . . . , u) − (−1)|i:wi =v| FX1 ,...,Xn (w) =

X

w∈S

(−1)

|i:wi =v|+1

FX1 ,...,Xn (w).

w∈S\{(u,...,u)}

where S = {u, v}n is the set of all 2n vertices w of the “box” × (ai , bi ]. The joint i∈[n]

density is 0 for u < v. 12.20. For independent Xi ’s, (a) fY (y) =

n P Q

fXi (xi )

x∈Sy i=1

(b) Two forms of marginal cdf: FYk (y) =

n X X

j=k I∈([n]) j

=

n X X

j=k I∈([j−1]) k−1

Y i∈I

!

FXi (y) 

FXj (y)

Y

r∈[n]\I

Y i∈I

113



(1 − FXi (y))

!

FXi (y) 

Y

r∈[j−1]\I



(1 − FXr (y))

Q

(c) FXmax ,Xmin (u, v) =

k∈[n]

FXk (u) −

(

(d) The marginal cdf is FYk (y) =

n X X

j=k I∈([n]) j

(e) FXmin (v) = 1 − (f) FXmax (u) =

Q

Q i

0,Q

k∈[n]

Y i∈I

)

u≤v (FXk (u) − FXk (v)), v < u !

FXi (y) 

Y

r∈[n]\I



(1 − FXr (y)) .

(1 − FXi (v)).

FXi (u).

i

i.i.d. 12.21. Suppose Xi ∼ X with common density f and distribution function F . (a) The joint density is given by fY (y) = n!f (y1 )f (y2 ) . . . f (yn ),

y1 ≤ y2 ≤ · · · ≤ yn .

If we define y0 = −∞, yk+1=∞ , n0 = 0, nk+1 = n + 1, then for k ∈ [n] and 1 ≤ n1 < · · · < nk ≤ n, the joint density fYn1 ,Yn2 ,...,Ynk (yn1 , yn2 , . . . , ynk ) is given by n!

k Y

!

f (ynj )

j=1

k Y (F (ynj+1 ) − F (ynj ))nj+1 −nj −1

(nj+1 − nj − 1)!

j=1

.

In particular, for r < s, the joint density fYr ,Ys (yr , ys ) is given by n! f (yr )f (ys )F r−1 (yr )(F (ys ) − F (yr ))s−r−1 (1 − F (ys ))n−s (r − 1)!(s − r − 1)!(n − s)! [2, Theorem 5.4.6 p 230]. (b) The joint cdf FYr ,Ys (yr , ys ) is given by  ys ≤ yr ,  FYs (ys ), n P m P n! (F (yr ))j (F (ys ) − F (yr ))m−j (1 − F (ys ))n−m , yr < ys .  j!(m−j)!(n−m)! m=s j=r

n

(c) FXmax ,Xmin (u, v) = (F (u)) − (d) fXmax ,Xmin (u, v) =





0, u≤v (F (u) − F (v))n , v < u



.

0, u≤v n−2 n (n − 1) fX (u) fX (v) (F (u) − F (v)) , v 0, fR (x) = n(n − 1) (F (u) − F (u − x))n−2 f (u − x)f (u)du. R (ii) For x ≥ 0, FR (x) = n (F (u) − F (u − x))n−1 f (u)du.

Both pdf and cdf above are derived by first finding the distribution of the range conditioned on the value of the Xmax = u.

See also [5, Sec. 2.2] and [2, Sec. 5.4]. 12.22. Let X1 , X2 , . . . , Xn be a random sample from a discrete distribution with pmf pX (xi ) = pi , where Pix1 < x2 < . . . are the possible values of X in ascending order. Define P0 = 0 and Pi = k=1 pk , then P [Yj ≤ xi ] =

and P [Yj = xi ] =

n   X n k=j

k

n   X n k=j

k

Pik (1 − Pi )n−k

 k Pik (1 − Pi )n−k − Pi−1 (1 − Pi−1 )n−k . 115

Example 12.23. If U1 , U2 , . . . , Uk are independently uniformly distributed on the interval 0 to t0 , then they have joint pdf  1  , 0 ≤ ui ≤ t0 k tk0 fU1k u1 = 0, otherwise

The order statistics τ1 , τ2 , . . . , τk corresponding to U1 , U2 , . . . , Uk have joint pdf ( k!  , 0 ≤ t1 ≤ t2 ≤ · · · ≤ tk ≤ t0 tk0 fτ1k tk1 = 0, otherwise

Example 12.24 (n = 2). Suppose U = max(X, Y ) and V = min(X, Y ) where X and Y have joint cdf FXY .  FX,Y (u, u), u ≤ v, FU,V (u, v) = , FX,Y (v, u) + FX,Y (u, v) − FX,Y (v, v), u > v FU (u) = FX,Y (u, u), FV (v) = FX (v) − FY (v) − FX,Y (v, v).

[11, Ex 7.5, 7.6]. The joint density is fU,V (u, v) = fX,Y (u, v) + fX,Y (v, u),

v < u.

The marginal densities is given by ∂ ∂ FX,Y (x, y) + FX,Y (x, y) fU (u) = x=u, x=u, ∂x ∂y y=u

=

Zu

y=u

fX,Y (x, u)dx +

−∞

Zu

fX,Y (u, y) dy,

−∞

fV (v) = fX (v) + fY (v) − fU (v)

! ∂ ∂ = fX (v) + fY (v) − FX,Y (x, y) + FX,Y (x, y) x=v, x=v, ∂x ∂y y=v y=v  v  Z Zv = fX (v) + fY (v) −  fX,Y (x, u)dx + fX,Y (u, y) dy  . =

Z

v



fX,Y (v, y)dy +

−∞ Z ∞

−∞

fX,Y (x, v)dx

v

If, furthermore, X and Y are independent, then  FX (u) FY (u) , u≤v FU,V (u, v) = FX (v) FY (u) + FX (u) FY (v) − FX (v) FY (v) , u > v, FU (u) = FX (u) FY (u) , FV (v) = FX (v) + FY (v) − FX (v) FY (v) , fU (u) = fX (u) FY (u) + FX (u) fY (u) , fV (v) = fX (v) + fY (v) − fX (v) FY (v) − FX (v) fY (v) . 116

If, furthermore, X and Y are i.i.d., then FU (u) = F 2 (u) , FV (v) = 2F (v) − F 2 (v) , fU (u) = 2f (u) F (u) , fV (v) = 2f (v) (1 − F (v)) . 12.25. Let the Xi Xmax − Xmin .  FR (x) =  fR (x) =

be i.i.d. with density f and cdf F . The range R is defined as R = 0,R x 0 P

hn

oi ω : lim |Xn (ω) − X (ω)| < ε = 1 n→∞

13.3. Properties of convergence a.s. a.s.

a.s.

(a) Uniqueness: if Xn −−→ X and Xn −−→ Y , then X = Y a.s. a.s.

a.s.

(b) If Xn −−→ X and Yn −−→ Y , then a.s.

(i) Xn + Yn −−→ X + Y a.s.

(ii) Xn Yn −−→ XY

a.s.

a.s.

(c) g continuous, Xn −−→ X ⇒ g (Xn ) −−→ g (X) (d) Suppose that ∀ε > 0,

∞ P

n=1

a.s.

P [|Xn − X| > ε] < ∞, then Xn −−→ X a.s.

(e) Let A1 , A2 , . . . be independent. Then, 1An −−→ 0 if and only if

P n

P (An ) < ∞

Definition 13.4 (Convergence in probability). The following statements are all equivalent conditions/notations for a sequence of random variables (Xn ) to converges in probability to a random variable X P

→X (a) Xn − (i) Xn →P X

(ii) p lim Xn = X n→∞

P

(b) (Xn − X) − →0 (c) ∀ε > 0 lim P [|Xn − X| < ε] = 1 n→∞

(i) ∀ε > 0 lim P ({ω : |Xn (ω) − X (ω)| > ε}) = 0 n→∞

(ii) ∀ε > 0 ∀δ > 0 ∃Nδ ∈ N such that ∀n ≥ Nδ P [|Xn − X| > ε] < δ

(iii) ∀ε > 0 lim P [|Xn − X| > ε] = 0 n→∞

(iv) The strict inequality between |Xn − X| and ε can be replaced by the corresponding “non-strict” version. 13.5. Properties of convergence in probability P

P

(a) Uniqueness: If Xn − → X and Xn − → Y , then X = Y a.s. 118

P

P

(b) Suppose Xn − → X, Yn − → Y , and an → a, then P

(i) Xn + Yn − →X +Y P

(ii) an Xn − → aX P

(iii) Xn Yn − → XY (c) Suppose (Xn ) i.i.d. with distribution U [0, θ]. Let Zn = max {Xi : 1 ≤ i ≤ n}. Then, P Zn − →θ P

P

(d) g continuous, Xn − → X ⇒ g (Xn ) − → g (X) P

(i) Suppose that g : Rd → R is continuous. Then, ∀i Xi,n − → Xi implies n→∞

P

g (X1,n , . . . , Xd,n ) − → g (X1 , . . . , Xd ) n→∞

P

P

(e) Let g be a continuous function at c. Then, Xn − → c ⇒ g (Xn ) − → g (c) P

(f) Fatou’s lemma: 0 ≤ Xn − → X ⇒ lim inf EXn ≥ EX n→∞

P

(g) Suppose Xn − → X and |Xn | ≤ Y with EY < ∞, then EXn → EX P

(h) Let A1 , A2 , . . . be independent. Then, 1An − → 0 iff P (An ) → 0 P

(i) Xn − → 0 iff ∃δ > 0 such that ∀t ∈ [−δ, δ] we have ϕXn (t) → 1 Definition 13.6 (Weak convergence for probability measures). Let Pn and P be probability d measure on R R (d ≥R1). The sequence Pn converges weakly to P if the sequence of real numbers gdPn → gdP for any g which is real-valued, continuous, and bounded on Rd

13.7. Let (Xn ), X be Rd -valued random variables with distribution functions (Fn ), F , distributions (µn ) , µ, and ch.f. (ϕn ) , ϕ respectively. The following are equivalent conditions for a sequence of random variables (Xn ) to converge in distribution to a random variable X (a) (Xn ) converges in distribution (or in law ) to X (i) Xn ⇒ X

L

i. Xn − →X D

ii. Xn − →X

(ii) Fn ⇒ F

i. FXn ⇒ FX ii. Fn converges weakly to F 119

iii. lim P Xn (A) = P X (A) for every A of the form A = (−∞, x] for which n→∞ X

P {x} = 0

(iii) µn ⇒ µ

i. P Xn converges weakly to P X

(b) Skorohod’s theorem: ∃ random variables Yn and Y on a common probability space D D (Ω, F, P ) such that Yn = Xn , Y = X, and Yn → Y on (the whole) Ω (c) lim Fn = F for all continuity points of F n→∞

(i) FXn (x) → FX (x) ∀x such that P [X = x] = 0 (d) ∃ a (countable) set D dense in R such that Fn (x) → F (x) ∀x ∈ D (e) Continuous Mapping theorem: lim Eg (Xn ) = Eg (X)for all g which is realn→∞

valued, continuous, and bounded on Rd (i) lim Eg (Xn ) = Eg (X) for all bounded real-valued function g such that P [X ∈ Dg ] = n→∞ 0 where Dg is the set of points of discontinuity of g (ii) lim Eg (Xn ) = Eg (X) for all bounded Lipschitz continuous functions g n→∞

(iii) lim Eg (Xn ) = Eg (X) for all bounded uniformly continuous functions g n→∞

(iv) lim Eg (Xn ) = Eg (X) for all complex-valued functions g whose real and imagn→∞ inary parts are bounded and continuous (f) Continuity Theorem: ϕXn → ϕX (i) For nonnegative random variables: ∀s ≥ 0 LXn (s) → LX (s) where LX (s) = Ee−sX Note that there is no requirement that (Xn ) and X be defined on the same probability space (Ω, A, P ) 13.8. Continuity Theorem: Suppose lim ϕn (t) exists ∀t; call this limit ϕ∞ Furthern→∞ more, suppose ϕ∞ is continuous at 0. Then there exists ∃ a probability distribution µ∞ such that µn ⇒ µ∞ and ϕ∞ is the characteristic function of µ∞ 13.9. Properties of convergence in distribution (a) If Fn ⇒ F and Fn ⇒ G, then F = G (b) Suppose Xn ⇒ X (i) If P [X is a discontinuity point of g] = 0, then g (Xn ) ⇒ g (X)

(ii) Eg (Xn ) → Eg (X) for every bounded real-valued function g such that P [X ∈ Dg ] = 0 where Dg is the set of points of discontinuity of g 120

i. g (Xn ) ⇒ g (X) for g continuous. P

(iii) If Yn − → 0, then Xn + Yn ⇒ X (iv) If Xn − Yn ⇒ 0, then Yn ⇒ X

(c) If Xn ⇒ a and g is continuous at a, then g (Xn ) ⇒ g (a) (d) Suppose (µn ) is a sequence of probability measures on R that are all point masses with µn ({αn }) = 1. Then, µn converges weakly to a limit µ iff αn → α; and in this case µ is a point mass at α (e) Scheff´ e’s theorem: (i) Suppose P Xn and P X have densities δn and δ w.r.t. the same measure µ. Then, δn → δ µ-a.e. implies

i. ∀B ∈ BR P Xn (E) → P X (E) • FXn → FX • FXn ⇒ FX R R ii. Suppose g is bounded. Then, g (x) P Xn (dx) → g (x) P X (dx). In Equivalently, E [g (Xn )] → E [g (X)] where the E is defined with respect to appropriate P .

(ii) Remarks: i. For absolutely continuous random variables, µ is the Lebesgue measure, δ is the probability density function. ii. For discrete random variables, µ is the counting measure, δ is the probability mass function. (f) Normal r.v. (i) Let Xn ∼ N (µn , σn2 ). Suppose µn → µ ∈ R and σn2 → σ 2 ≥ 0. Then, Xn ⇒ N (µ, σ 2 )

(ii) Suppose that Xn are normal random variables and let Xn ⇒ X. Then, (1) the mean and variance of Xn converge to some limit m and σ 2 . (2) X is normal with mean m and variance σ 2 (g) Xn ⇒ 0 if and only if ∃δ > 0 such that ∀t ∈ [−δ, δ] we have ϕXn (t) → 1 13.10 (Convergence in distribution of products of randomvariables). Let (Xn,k ) be a triangular array of random variables in (0, c], where c < 1, and let X be a nonnegative random variable. Assume X 2 Xn,k ⇒ 0 or sup Xn,k ⇒ 0 as n → ∞. k

Then as n → ∞,

Y k

[12]. See also A.4.

k

(1 − Xn,k ) ⇒ e−X if and only if

121

X k

Xn,k ⇒ X

13.11. Relationship between convergences

(X ) nk

L X n ⎯⎯ →X p

a.s. X n ⎯⎯ →X

Xn ≤ Y ∈ L

p

f ( ⋅ ) continuous E Xn

p

→E X

p

(X )

X n ≤ Y ∈ Lp

nk

P X n ⎯⎯ →X

f X n ⎯⎯ → fX

∃c P ( X = c ) = 1

f ( ⋅ ) continuous

D X n ⎯⎯ →X

f ( ⋅ ) continuous

{

}

∀x ∈ x : F ( x − ) = F ( x ) dense in R

∀x ∈ D dense in R

lim Fn ( x ) = F ( x )

n →∞

a.s. P a) For discrete probability space, X n ⎯⎯ → X if and only if X n ⎯⎯ →X .

Figure 23: Relationship between convergences

L P b) Suppose X n ⎯⎯ →X . → X , ∀n X n ≤ Y , Y ∈ Lp . Then, X ∈ Lp and X n ⎯⎯ p

p p L a.s. c) Suppose X n ⎯⎯ →X . → X , ∀n X n ≤ Y a.s., Y ∈ L . Then, X ∈ L and P X n ⎯⎯ p

(a) For discrete probability space, Xn −−→ X if and only if Xn − →X P D d) X n ⎯⎯ → X ⇒ X n ⎯⎯ →X

P

Lp

(b) Suppose Xn − → X, ∀n |Xn | ≤ Y , Y ∈ Lp . Then, X ∈P Lp and Xn −→ X D e) If X n ⎯⎯ → X and if ∃a ∈

X = a a.s., then X n ⎯⎯ →X

Lp

X =|X a na.s., i) Hence, when →LXp . and X n ⎯⎯ X p are (c) Suppose Xn −−→ X, ∀n | ≤ YX n, ⎯⎯ Y P∈ Then, XD → ∈L andequivalent. Xn −→ X a.s.

f) P Example:

D

(d) Xn − → X ⇒ Xn − →X



1⎤ ⎡ n 2 ⎦⎥ ⎪ X ⎣ (e) If Xn i)− →Ω X =and if ∃a ∈ RX = a a.s., then X − → n [0,1] , P is uniform on [0,1]. X n (ω ) = ⎨ ⎪eDn , ω ∈ ⎛ 1 − 1 ,1⎤ P ⎜ equivalent. 2 ⎥ (i) Hence, when X = a a.s., Xn − → X and Xn⎪⎩ − → X are ⎝ n ⎦ ⎪ P

D

0,

ω ∈ ⎢0,1 −

L See also Figure(1) 23.X n ⎯⎯ →0 p

a .s. (2) X n ⎯⎯→ 0 Example 13.12.

  0, ω ∈ 0, 1 − n12  (a) Ω = [0, 1] , P is uniform on [0,1]. Xn (ω) = en , ωDefine ∈ 1 − n12 , 1 ii) Let Ω = [ 0,1] with uniform probability distribution. P (3) X n ⎯⎯ →0

X n 0(ω ) = ω + 1⎡ j −1 (i) Xn 9 Lp

a.s.

(ii) Xn −−→ 0

j⎤ , ⎥ ⎢ ⎣ k k⎦

(ω ) , where

⎡1 k=⎢ ⎢2

(



)

⎤ 1 + 8n − 1 ⎥ and ⎥

1 j = n − k ( k − 1) . The sequence of intervals 2 122 function is shown below:

⎡ j −1 j ⎤ , ⎥ ⎢ ⎣ k k⎦

under the indicator

P

(iii) Xn − →0 (b) Let Ω = [0, 1] with uniform probability distribution. Define Xn (ω) = ω +1[ j−1 , j ] (ω) , k k   √ 1 + 8n − 1 and j = n − 21 k (k − 1). The sequence of intervals where k = 12  j−1 j  , k under the indicator function is shown in Figure 24. k

[0,1]

⎡ 1⎤ ⎡1 ⎤ ⎢ 0, 2 ⎥ → ⎢ 2 ,1⎥ ⎣ ⎦ ⎣ ⎦

⎡ 1 ⎤ ⎡1 2 ⎤ ⎡ 2 ⎤ ⎢ 0, 3 ⎥ → ⎢ 3 , 3 ⎥ → ⎢ 3 ,1⎥ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦

⎡ 1⎤ ⎡1 1⎤ ⎡1 3⎤ ⎡3 ⎤ ⎢ 0, 4 ⎥ → ⎢ 4 , 2 ⎥ → ⎢ 2 , 4 ⎥ → ⎢ 4 ,1⎥ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ Let X (ω ) = ω .

Figureof24: Diagram (13.12) real numbers does not converge for any ω . (1) The sequence (ω ) )example ( X nfor L →X (2) X n ⎯⎯ p

Let X (ω) = ω.

P (3) X n ⎯⎯ →X

(i) The sequence (Xn (ω)) does not converge for any ω 1 ⎧ 1 of real numbers ⎪⎪ n , w.p. 1 − n . ⎪n, w.p. 1 ⎪⎩ n

Lp

iii) → XXn = ⎨ (ii) Xn − P

(iii) Xn − →X P  1 (1) X n ⎯⎯→ 01 , w.p. 1L− n n (c) Xn = X n ⎯⎯1→ 0 . n, (2) w.p. n p

15)

P

(i)Summation Xn − → 0 of random variables Lp n

(ii)SetXSnn 9 = ∑0X i i =1

n

and S n( c ) = ∑ X i( c ) where X n( c ) = X n1 ⎡⎣ X n ≤ c ⎦⎤ is X n truncated at c. i =1

16) Two sequence ( X ) and (Y ) are tail equivalent if ∑ P [ X ≠ Y ] < ∞ . Summation of random variables n ⎛ ⎞ P Set Sn = Suppose Xi . ( X ) and (Y ) are tail equivalent ⎜⎝ ∑ P [ X ≠ Y ] < ∞ ⎟⎠ , then n

13.1

n

n

n

n

n

n

n

n

n

i=1

a) P ( lim inf [ X n = Yn ]) = 1 n →∞ 13.13 (Markov’s theorem; Chebyshev’s inequality). For finite variance (Xi ), if P ∀ ω ∈ liminf X such that ∀n ≥ N X n (ω ) = Yn (ω ) . b) [ n = Yn ] ∃N ∈ →∞ 0 0, then 1 S − 1 ES n− → n

n

n

1 n2

Var Sn →

n

Y c) With probability 1, the sequence ( X n ) converges iff the sequence n ( n)

converges. • If (Xi ) are pairwise independent, then

1 n2

123

Var Sn =

1 n2

P

i=1

Var Xi See also (13.17).

13.2

Summation of independent random variables

13.14. For independent Xn , the probability that

n P

Xi converges is either 0 or 1.

i=1

13.15 SLLN). Consider a sequence (Xn ) of independent random variables. P Var(Kolmogorov’s a.s. Xn 1 If < ∞, then S − n1 ESn −−→ 0 n2 n n n

• In particular, for independent Xn , if

1 n

n P

i=1

a.s.

EXn → a or EXn → a, then n1 Sn −−→ a

13.16. Suppose that X1 , X2 , . . . are independent and that the series X = X1 + X2 + · · · converges a.s. Then, ϕX = ϕX1 ϕX2 · · · 13.17. For pairwise independent (Xi ) with finite variances, if 1 S n n

P

→0 − n1 ESn −

1 n2

n P

i=1

Var Xi → 0, then

(a) Chebyshev’s Theorem: For pairwise independent (Xi ) with uniformly bounded P variances, then n1 Sn − n1 ESn − →0

13.3

Summation of i.i.d. random variable

Let (Xi ) be i.i.d. random variables.   n 1 P Xn − EX1 ≥ ε ≤ 13.18 (Chebyshev’s inequality). P n i=1

1 Var[X1 ] ε2 n

• The Xi ’s don’t have to be independent; they only need to be pairwise uncorrelated.

13.19 (WLLN). Weak Law of Large Numbers: (a) L2 weak law: (finite variance) Let (Xi ) be uncorrelated (or pairwise independent) random variables (i) V arSn =

n P

Var Xi

i=1

(ii) If

1 n2

n P

i=1

P

Var Xi → 0, then n1 Sn − n1 ESn − →0

(iii) If EXi = µ and V arXi ≤ C < ∞. Then,

1 n

n P

i=1

L2

Xi −→ µ

(b) Let (Xi ) be i.i.d. random variables with EXi = µ and V arXi = σ 2 < ∞. Then,  n  1 P 1] (i) P n Xn − EX1 ≥ ε ≤ ε12 Var[X n i=1

(ii)

1 n

n P

i=1

L2

Xi −→ µ

124

(The fact that σ 2 < ∞ implies µ < ∞). (c)

1 n

n P

i=1

L2

Xi −→ µ implies

1 n

n P

i=1

P

Xi − → µ which in turn imply

1 n

n P

i=1

D

Xi − →µ (n) P

(d) If Xn are i.i.d. random variables such that lim tP [|X1 | > t] → 0, then n1 Sn −EX1 t→∞ 0

− →

(e) Khintchine’s theorem: If Xn are i.i.d. random variables and E |X1 | < ∞, then P (n) EX1 → EX1 and n1 Sn − → EX1 . • No assumption about the finiteness of variance. 13.20 (SLLN). (a) Kolmogorov’s SLLN : Consider a sequence (Xn ) of independent random variables. P Var Xn a.s. < ∞, then n1 Sn − n1 ESn −−→ 0 If n2 n

• In particular, for independent Xn , if

1 n

n P

i=1

a.s.

EXn → a or EXn → a, then n1 Sn −−→ a

(b) Khintchin’s SLLN : If Xi ’s are i.i.d. with finite mean µ, then

1 n

n P

i=1

a.s.

Xi −−→ µ

(c) Consider the sequence (Xn ) of i.i.d. random variables. Suppose EX1− < ∞ and a.s. EX1+ = ∞, then n1 Sn −−→ ∞ a.s.

• Suppose that Xn ≥ 0 are i.i.d. random variables and EXn = ∞. Then, n1 Sn −−→ ∞ 13.21 (Relationship between LLN and the convergence of relative to the probaPfrequency n 1 1 bility). Consider i.i.d. Zi ∼ Z. Let Xi = 1A (Zi ). Then, n Sn = n i=1 1A (Zi ) = rn (A), the relative  Pfrequency of an event A. Via LLN and appropriate conditions, rn (A) converges to E n1 ni=1 1A (Zi ) = P [Z ∈ A].

13.4

Central Limit Theorem (CLT)

Suppose that (Xk )k≥1 isPa sequence of i.i.d. random variables with mean m and variance 0 < σ 2 < ∞. Let Sn = nk=1 Xk . 13.22 (Lindeberg-L´evy theorem). (a) (b)

Sn −mc √ σ n

=

Sn√ −mc n

=

√1 n

√1 n

n P

k=1 n P

k=1

Xk −m σ

⇒ N (0, 1).

(Xk − m) ⇒ N (0, σ).

125

Pn Xk −m iid ∼ To see this, let Z Z and Y = k = n k=1 Zk . Then, EZ = 0, Var Z = 1, and ϕYn (t) = σ n  t x ϕZ ( √n ) . By approximating e ≈ 1 + x + 12 x2 . We have ϕX (t) ≈ 1 + jtEX − 12 t2 E [X 2 ] (see also (28)) and  n t2 1 t2 ϕYn (t) = 1 − → e− 2 . 2n • The case of Bernoulli(1/2) was derived by Abraham de Moivre around 1733. The case of Bernoulli(p) for 0 < p < 1 was considered by Pierre-Simon Laplace [11, p. 208]. 13.23 (Approximation of densities and pmfs using the CLT). Approximate the distribution of Sn by N (nm, nσ 2 ).   √ • FSn (s) ≈ Φ s−nm σ n • fSn (s) ≈

−1 √ 1√ e 2 2πσ n



x−nm √ σ n

2

• If the Xi are integer-valued, then    2 1 1 1 √ − 12 k−nm σ n ≈√ P [Sn = k] = P k − < Sn ≤ k + √ e 2 2 2πσ n [11, eq (5.14), p. 213]. The approximation is best for s, k near nm [11, p. 211]. √ 1 • The approximation n! ≈ 2πnn+ 2 e−n can be derived from approximating the density n−1 e−s of Sn when Xi ∼ E(1). We know that fSn (s) = s(n−1)! . Approximate the density at √ n− 12 −n e . Multiply through by n. [11, Ex. 5.18, p. 212] s = n, gives (n − 1)! ≈ 2πn • See also normal approximation for the binomial in (16).

14 14.1

Conditional Probability and Expectation Conditional Probability

Definition 14.1. Suppose conditioned on X = x, Y has distribution Q. Then, we write Y |X = x ∼ Q [26, p 40]. It might be clearer to write P Y |X=x = Q. 14.2. Discrete random variables (a) pX|Y (x|y) =

pX,Y (x,y) pY (y)

(b) pX,Y (x, y) = pX|Y (x|y)pY (y) = pY |X (y|x)pX (x) (c) The law of total probability: P [Y ∈ C] = P particular, pY (y) = x pY |X (y|x)pX (x). 126

P

x

P [Y ∈ C|X = x]P [X = x]. In

(d) pX|Y (x|y) =

P

pY |X (y|x)pX (x) pY |X (y|x0 )pX (x0 )

x0

(e) If Y = X + Z where X and Z are independent, then • pY |X (y|x) = pZ (y − x) P • pY (y) = x pZ (y − x)pX (x) • pX|Y (x|y) =

P pZ (y−x)p0X (x) 0 . x0 pZ (y−x )pX (x )

(f) The substitution law of conditional probability: P [g(X, Y ) = z|X = x] = P [g(x, Y ) = z|X = x]. • When X and Y are independent, we can “drop the conditioning”. 14.3. Absolutely continuous random variables (a) fY |X (y|x) = (b) FY |X (y|x) =

fX,Y (x,y) . fY (y)

Ry

−∞

fY |X (t|x)dt

14.4. P [(X, Y ) ∈ A] = E [P [(X, Y ) ∈ A|X]] = 14.5.

∂ F ∂z Y |X

(g (z, x) |x ) =

∂ ∂z

g(z,x) R −∞

R

A

fX,Y (x, y)d(x, y)

∂ fY |X (y |x)dy = fY |X (g (z, x) |x ) ∂z g (z, x) P [X≤x,y−∆ 0. If we replace x by x1 , we have ln(x) ≥ 1 − x1 . This give the fundamental inequalities of information theory: 1−

1 ≤ ln(x) ≤ x − 1 for x > 0 x

with equality if and only if x = 1. Alternative forms are listed below. (i) For x > −1,

x 1+x

≤ ln(1 + x) < x with equality if and only if x = 0.

(ii) For x < 1, x ≤ − ln(1 − x) ≤ A.3. For |x| ≤ 0.5, we have

x 1−x

with equality if and only if x = 0.

2

ex−x ≤ 1 + x ≤ ex .

(44)

x − x2 ≤ ln (1 + x) ≤ x,

(45)

This is because which is semi-proved by the plot in Figure 25. A.4. Consider a triangular array of real numbers (xn,k ). Suppose (i) rn P

k=1

rn P

k=1

x2n,k → 0. Then,

Prn

rn Y

k=1

xn,k → x and (ii)

(1 + xn,k ) → ex .

Moreover, suppose the sum k=1 |xn,k | converges as n → ∞ (which automatically implies that condition (i) is true for some x). Then, condition (ii) is equivalent to condition (iii) where condition (iii) is the requirement that max |xk,n | → 0 as n → ∞. k∈[rn ]

137

-0.6838

1

𝑥 0.5

0

ln 1 + 𝑥 𝑥 − 𝑥2

-0.5

𝑥 𝑥+1

-1

-1.5 -0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

𝑥

Figure 25: Bounds for ln(1 + x) when x is small. Proof. When n is large enough, conditions (ii) and (iii) each implies P 2that |xn,k2 | ≤ 0.5. (For 2 (ii), note that we can find n large enough such that |xn,k | ≤ k xn,k ≤ 0.5 .) Hence, we can apply (A.3) and get rn P

ek=1 Suppose

Prn

k=1

xn,k −

rn P

k=1

x2n,k



rn Y

k=1

rn P

(1 + xn,k ) ≤ ek=1

xn,k

.

(46)

|xn,k | → x0 . To show that (iii) implies (ii), let an = max |xk,n |. Then, k∈[rn ]

0≤

rn X k=1

x2n,k

≤ an

rn X k=1

|xn,k | → 0 × x0 = 0.

On the other hand, suppose we have (ii). Given any ε > 0, by (ii), ∃n0 such that ∀n ≥ n0 , rn rn P P x2n,k ≤ ε2 . Hence, for any k, x2n,k ≤ x2n,k ≤ ε2 and hence |xn,k | ≤ ε which implies k=1 an ≤

k=1

ε.

PrnNote that when the xk,n are non-negative, condition (i) already implies that the sum k=1 |xn,k | converges as n → ∞. Alternative versions of A.4 are as followed. rn P (a) Suppose (ii) x2n,k → 0 as n → ∞. Then, as n → ∞ we have k=1

rn Y

k=1

x

(1 + xn,k ) → e if and only if

rn X k=1

xn,k → x.

(47)

Proof. We already know from A.4 that the RHS of (47) implies the LHS. Also, condition (ii) allows the use of A.3 which implies rn Y

k=1

rn P

(1 + xn,k ) ≤ ek=1

xn,k

138

rn P

≤ ek=1

x2n,k

rn Y

k=1

(1 + xn,k ).

(46b)

(b) Suppose the xn,k are nonnegative and (iii) an → 0 as n → ∞. Then, as n → ∞ we have rn rn Y X −x (1 − xn,k ) → e if and only if xn,k → x. (48) k=1

k=1

Proof. We already know from A.4 that the RHS of (47) implies the LHS. Also, condition (iii) allows the use of A.3 which implies rn Y

k=1

(1 − xn,k ) ≤ e



rn P

rn P

xn,k

≤e

k=1

k=1

x2n,k

rn Y

k=1

(1 − xn,k ).

(46c)

Furthermore, by (45), we have rn X k=1

x2n,k ≤ an −

!

rn X

ln (1 − xn,k )

k=1

→ 0 × x = 0.

A.5. Let αi and βi be complex numbers with |αi | ≤ 1 and |βi | ≤ 1. Then, m m m Y X Y βi ≤ |αi − βi |. αi − i=1

i=1

i=1

In particular, |αm − β m | ≤ m |α − β|.

A.6. Suppose lim an = a. Then lim 1 − n→∞

n→∞

 an n n

Proof. Use (A.4) with rn = n, xn,k = − ann . Then, a2n n1 → a · 0 = 0.

= e−a [11, p 584]. Pn

Alternatively, from L’Hˆopital’s rule, lim 1 − n→∞

k=1

 a n n

xn,k = −an → −a and

Pn

k=1

x2n,k =

= e−a . (See also [22, Theorem 3.31,

p 64]) This gives a direct proof for the case when a > 0. For n large enough, note that an a both 1− and 1 − are ≤ 1 where we need a > 0 here. Applying (A.5), we get n n  n n 1 − an − 1 − a ≤ |an − a| → 0. n n   n n −1 b −1 For a < 0, we use the fact that, for bn → b > 0, (1) 1 + n = 1 + nb → e−b −1 −1 and (2) for n large enough, both 1 + nb and 1 + bnn are ≤ 1 and hence  −1 !n bn 1+ − n

 −1 !n b |bn − b|   → 0. 1+ ≤ n 1 + bnn 1 + nb 139

A.2

Summations

A.7. Basic formulas: n P (a) k = n(n+1) 2 k=0

(b)

n P

k2 =

k=0

(c)

n P

3

k =

k=0

n(n+1)(2n+1) 6



n P

k

k=0

2

= 61 (2n3 + 3n2 + n)

= 41 n2 (n + 1)2 = 14 (n4 + 2n3 + n2 )

A nicer formula is given by n X k=1

A.8. Let g (n) =

k (k + 1) · · · (k + d) =

n P

1 n (n + 1) · · · (n + d + 1) d+2

(49)

h (k) where h is a polynomial of degree d. Then, g is a polynomial

k=0

of degree d + 1; that is g (n) =

d+1 P

am x m .

m=1

• To find the coefficients am , evaluate g (n) for n = 1, 2, . . . , d + 1. Note that the case when n = 0 gives a0 = 0 and hence the sum starts with m = 1. • Alternative, first express h (k) in terms of summation of polynomials: h (k) =

d−1 X i=0

!

bi k (k + 1) · · · (k + i)

To do this, substitute k = 0, −1, −2, . . . , − (d − 1). • k 3 = k (k + 1) (k + 2) − 3k (k + 1) + k Then, to get g (n), use (49). A.9. Geometric Sums: ∞ P 1 for |ρ| < 1 (a) ρi = 1−ρ i=0

(b)

∞ P

ρi =

ρk 1−ρ

ρi =

ρa −ρb+1 1−ρ

iρi =

ρ (1−ρ)2

i=k

(c)

b P

i=a

(d)

∞ P

i=0

140

+ c.

(50)

(e)

b P

iρi =

ρb+1 (bρ−b−1)−ρa (aρ−a−ρ) (1−ρ)2

iρi =

kρk 1−ρ

i=a

(f)

∞ P

i=k

(g)

∞ P

i2 ρi =

i=0

+

ρk+1 (1−ρ)2

ρ+ρ2 (1−ρ)3

A.10. Double Sums:  n 2 n P n P P (a) ai = ai aj i=1

(b)

i=1 j=1

∞ P ∞ P

f (i, j) =

j=1 i=j

∞ P i P

i=1 j=1

A.11. Exponential Sums: ∞ k P λ =1+λ+ • eλ = k! k=0

f (i, j) =

P

(i,j)

λ2 2!

+

λ3 3!

1 [i ≥ j]f (i, j)

+ ... 3

2

• λeλ + eλ = 1 + 2λ + 3 λ2! + 4 λ3! + . . . =

∞ P

k=1

k−1

λ k (k−1)!

A.12. Suppose h is a polynomial of degree d, then ∞ X k=0

h(k)

λk = g(λ)eλ , k!

where g is another polynomial of the same degree. For example, ∞ X

k3

k=0

 λk = λ3 + 3λ2 + λ eλ . k!

(51)

This result can be obtained by several techniques. P λk (a) Start with eλ = ∞ k=0 k! . Then, we have    ∞ k X d d d λ 3λ k =λ λ λ e . k! dλ dλ dλ k=0 (b) We can expand k 3 = k (k − 1) (k − 2) + 3k (k − 1) + k.

(52)

similar to (50). Now note that ∞ X k=0

k (k − 1) · · · (k − (` − 1))

λk = λ` eλ . k!

Therefore, the coefficients of the terms in (52) directly becomes the coefficients in (51) 141

A.13. Zeta function ξ (s) is defined for any complex number s with Re {s} > 1 by the ∞ P 1 Dirichlet series: ξ (s) = . ns n=1

• For real-valued nonnegative x (a) ξ (x) converges for x > 1

(b) ξ (x) diverges for 0 < x ≤ 1 [11, Q2.48 p 105]. • ξ (1) = ∞ corresponds to harmonic series. A.14. Abel’s theorem: Let a = (ai : i ∈ N) be any sequence of real or complex numbers and let ∞ X Ga (z) = ai z i , i=0

be the power series with coefficients a. Suppose that the series lim Ga (z) =

z→1−

∞ X

P∞

i=0

ai converges. Then,

ai .

(53)

i=0

In the special case where all the coefficientsPai are nonnegative real numbers, then the above formula (53) holds also when the series ∞ i=0 ai does not converge. I.e. in that case both sides of the formula equal +∞.

A.3

Calculus

A.3.1

Derivatives

A.15. Basic Formulas (a)

d u a dx

(b)

d dx

= au ln a du dx

loga u =

loga e du , u dx

• In particular,

a 6= 0, 1 d dx

ln x = x1 .

(c) Derivatives of the products: Suppose f (x) = g (x) h (x), then f In fact,

(n)

n   X n (n−k) (x) = g (x) h(k) (x). k k=0

r X dn Y f (t) = i dtn i=1 n +···+n 1

r

Y dni n! f (t). ni i n !n ! · · · n ! dt 1 2 r =n i=1

142

r

Definition A.16 (Jacobian). In vector calculus, the Jacobian is shorthand for either the Jacobian matrix or its determinant, the Jacobian determinant. Let g be a function from a subset D of Rn to Rm . If g is differentiable at z ∈ D, then all partial derivatives exists at z and the Jacobian matrix of g at a point z ∈ D is  ∂g1  ∂g1 (z) · · · ∂x (z)   ∂x1 n ∂g ∂g  ..  . . .. ..  = dg (z) =  . (z) , . . . , (z) . ∂x1 ∂xn ∂gm ∂gm (z) · · · ∂xn (z) ∂x1 ∂(g1 ,...,gn ) Alternative notations for the Jacobian matrix are J, ∂(x [9, p 242], and Jg (x) where 1 ,...,xn ) the it is assumed that the Jacobian matrix is evaluated at z = x = (x1 , . . . , xn ).

• Linear approximation around z: g(x) ≈ dg(z)(x − z) + g(z) . | {z }

(54)

`(x)

The function ` is the linearization [24] of g at a point z. ∂gk • Let g : D → Rm with open D ⊂ Rn , y ∈ D. If ∀k ∀j partial derivatives ∂x exists j in a neighborhood of z and continuous at z, then g is differentiable at z. And dg is continuous at z.

• Let A be an n-dimensional Q “box” defined by the corners x and x+∆x. The “volume” of the image g(A) is ( i ∆xi ) |det dg(x)|. Hence, the magnitude of the Jacobian determinant gives the ratios (scaling factor) of n-dimensional volumes (contents). In other words, ∂(y1 , . . . , yn ) dx1 · · · dxn . dy1 · · · dyn = ∂(x1 , . . . , xn ) • d(g −1 (y)) is the Jacobian of the inverse transformation. • In MATLAB, use jacobian.

• Change of variable: Let g be a continuous differentiable map of the open set U onto V . Suppose that g is one-to-one and that det(dg(x)) 6= 0 for all x. Z Z h(g(x)) |det(dg(x))| dx = h(y)dy. U

V

A.17. When m = 1, we have a scalar function, and we can talk about the gradient vector. The gradient (or gradient vector field) of a scalar function f (x) with respect to a vector variable x = (x1 , . . . , xn ) is  ∂f  (z) ∂x1   .. T ∇x f (z) =   = (df (z)) . . ∂f (z) ∂xn 143

(a) The RHS of the linear approximation (54) characterizes the tangent hyperplane L(x) = dg(z)(x − z) + g(z) at the point z. The level surface that pass through the point z is given by {x : g(x) = g(z)} . Using the linear approximation (54), we see that around the point z, the point on the level surface must satisfy dg(z)(x − z) ≈ 0. Hence, the tangent plane at z for the level surface is given by {x : dg(z)(x − z) = 0} . Note also that the gradient vector ∇g(z) = (dg(z))T is perpendicular to the tangent plane through z: for any two point x(1) , x(2) on the tangent plane though z, dg(z)(x(2) − x(1) ) = 0. Hence, the gradient vector at z is perpendicular to the level surface at z. (b) When n = 2, given a point P = (x0 , y0 ) we have a tangent plane   ∂g ∂g L(x, y) = (P ) (x − x0 ) + (P ) (y − y0 ) + g (P ) ∂x ∂x through P . The level curve that pass through the point P is given by {(x, y) : g(x, y) = g(P )} . Approximating g by L, we find that the tangent line at P for the level curve is given by   ∂g ∂g (P ) (x − x0 ) + (P ) (y − y0 ) = 0 . (x, y) : ∂x ∂x This line is perpendicular to the gradient vector. (c) When n = 3, given a point P = (x0 , y0 , z0 ), we think about the level surface S = {(x, y, z) : f (x, y, z) = c} where c = f (P ). The tangent plane at the point P on the level surface S is given by   ∂g ∂g ∂g (x, y, z) : (P ) (x − x0 ) + (P ) (y − y0 ) + (P ) (z − z0 ) = 0 . ∂x ∂y ∂z This plane is perpendicular to the gradient vector. A.18. For gradient, if the argument is row vector, then,   ∂f1 ∂f2 ∂fm   ∂θ1 ∇θ f T (θ) =  ...

∂f1 ∂θn

144

∂θ1

.. .

∂f2 ∂θn

...

∂θ1

.. .

∂fm ∂θn

 .

Definition A.19. Given a scalar-valued function f , the Hessian matrix of second partial derivatives  2 2f ∂ f · · · ∂θ∂1 ∂θ ∂θ12 n  . . . .. . ∇2θ f (θ) = ∇θ (∇θ f (θ))T =  . .  . ∂2f ∂2 · · · ∂θ 2 ∂θn ∂θ1 n

It is symmetric for nice function.

matrix is the square 

 . 

 A.20. ∇x f T (x) = (df (x))T .

A.21. Let f, g : Ω → Rm . Ω ⊂ Rn . h (x) = hf (x) , g (x)i : Ω → R. Then dh (x) = (f (x))T dg (x) + (g (x))T df (x) . • For an n × n matrix A, let f (x) = hAx, xi. Then df (x) = (Ax)T I + xT A = xT AT + xT A. ◦ If A is symmetric, then df (x) = 2xT A. So,

∂ ∂xj

hAx, xi = 2 (Ax)j .

A.22. Chain rule: If f is differentiable at y and g is differentiable at z = f (y), then g ◦ f is differentiable at y and d (g ◦ f ) (y) = dg (z) df (y) p×n

p×m

m×n

(matrix multiplication). • In particular, d g (x (t) , y (t) , z (t)) = dt



     ∂ d ∂ d g (x, y, z) x (t) + g (x, y, z) y (t) ∂x dt ∂y dt    ∂ d + g (x, y, z) z (t) . ∂z dt (x,y,z)=(x(t),y(t),z(t))

A.23. Let f : D → Rm where D ⊂ Rn is open and connected (so arcwise-connected). Then, df (x) = 0 ∀x ∈ D ⇒ f is constant. A.24. If f is differentiable at y, then all    ∂f1 (y) ∇f1 (y) ∂x1    . . .. .. df (y) =  = ∂fm ∇fm (y) (y) ∂x1

partial and directional derivative exists at y  ∂f1 · · · ∂x (y)   n ∂f ∂f  .. ... (y) , . . . , (y) . = . ∂x1 ∂xn ∂fm · · · ∂xn (y)

A.25. Inversion (Mapping) Theorem: Let open Ω ⊂ Rn , f : Ω → Rn is C 1 , c ∈ Ω. df (c) is bijective, then ∃U open neighborhood of c such that n×n

145

(a) V = f (U ) is an open neighborhood of f (c). (b) f |U : U → V is bijective. (c) g = (f |U )−1 : V → U is C 1 . (d) ∀y ∈ V dg (y) = [df (g (y))]−1 .

 ∇ x xT = I 2 ∇x kxk = 2x   ∇x (Ax + b)T = AT  ∇ x aT x = a

d (x) = I  d kxk2 = 2xT

d (Ax +  b) = A d aT x = aT d f T (x) g (x) T



 ∇x f T (x) g (x)

= (dg (x))T f (x) + (df (x))T g (x)   = ∇x g T (x) f (x) + ∇x f T (x) g (x)

T

= f (x) dg (x) + g (x) df (x)

For symmetric Q,

For symmetric Q,  d f T (x) Qf (x)

 ∇x f T (x) Qf (x)

= 2 (df (x))T Qf (x)  = 2∇x f T (x) Qf (x)

= f T (x) Qdf (x) + f T (x) QT df (x)

= 2f T (x) Qdf (x)  d xT Qx = 2xT Q  d kf (x)k2 = 2f T (x) df (x)

A.3.2

 ∇x xT Qx = 2Qx.   ∇x kf (x)k2 = 2∇x f T (x) f (x)

Integration

A.26. Basic Formulas R u (a) au du = lna a , a > 0, a 6= 1.  1  1 R1 α R∞ α , α > −1 , α < −1 α+1 α+1 (b) t dt = and t dt = So, the integration of the ∞, α ≤ −1 ∞, α ≥ −1 0 1 R1 R∞ function 1t is the test case. In fact, 1t dt = 1t dt = ∞. 0

(c)

R

m

x ln xdx =



xm+1 m+1 1 ln2 2

ln x − x,

1 m+1



1

, m 6= −1 m = −1

A.27. Integration by Parts: Z

udv = uv −

146

Z

vdu

4 4 t t

2 1.5

3

t t

0.5

2 1 t t

− 0.5 −1

1

0 0

0

1

2

3

0

4

5

t

5

e ax n ( −1) n! nα n ax e dx = 26: If n is a positive integer, ∫ xFigure ∑ Plots of xt − k . a k = 0 a k ( n − k )! k





n = 1:

e ax ⎛ 1⎞ ⎜x− ⎟ a ⎝ a⎠

R e axwith 2 ⎞ ⎛ 2 2an integral (a) Basic idea: Start of the form f (x) g (x)dx. • n = 2: R ⎜x − x+ 2 ⎟ a a a ⎝ ⎠ Match this with an integral of the form udv by choosing dv to be part of the k n k t at n n n! n − k (f−1(x) integrand including and ( −1)possibly ) n! or ( −1) n! t n − k + n! e dx e atg (x). • ∫ x n e ax dx = t − = ∑ ∑ n +1 a k = 0 a k ( n − k )! a n +1 a k = 0 a k ( n − k )! ( −a ) 0 (b) In particular, repeated application of integration by parts gives ∞

Z

n − ax ∫ x e dx =



e − at a

n! e − at n n! j n−k t = n−1 ∑ ∑ an−k j!t k X a i j = 0(i) k = 0 a ( n − k )! n

f (x) g (x)dx = f (x) G1 (x) + t



∫x e



n ax

dx =

n!

(−1) f

n

(x) Gi+1 (x) + (−1)

i=1 , a < 0. (See also Gamma function)

Z

f (n) (x) Gn (x) dx

(55) ( −a ) R R di (i) where f (x) = ∞dxi f (x), G1 (x) = g (x)dx, and Gi+1 (x) = Gi (x)dx. Figure 27 n! derived = ∫ e − t t n dt .(55). can be used• to n +1

0

0

In MATLAB, considerf using g  x  in stead of factorial(n).  x  gamma(n+1) +vector allows input. Note also that gamma() 1 f    x G1  x 

• 1

∫x



Differentiate e dx is finite if and onlyfif 2 β x> −1 .

β −x

0

f f

To see this, note that Figure





and

x e

1

1

n

n1

Gn 1  x  Gn  x 

27: Integration by Parts

f

n

 x  Gn 1  x    f  n 1  x  GZ n 1  x  dx .

g2 (x)dx 2  = f1 (x) 2 3x 2 3x  x e dx   3 x  9 x  27  e

Z



n

Integrate

f  x  g  x dx  f  x  G1  x    f   x  G1  x  dx , and

To see this, note that   f  n Z  x  Gn  x  dx  

 x    x  

n 1

G2  x 

+

f (x) G 2 1 (x) − 3x x

+

e

f 0 (x) G1 (x) dx, x

1 3x 2x e x Z sin x e dx - 3   (n)  (n) 1 3− f (x) G (x) dx = f (x) G (x) 2 n+1 ex   sin x  cos xn e x    sin x  e x dx 9 + 1 1 3x   sin x  cos x  e x 0 e 2 27 147

n ax

dx 

x n e ax n n 1 ax   x e (Integration by parts). a a

f

sin x + cos x  sin x (n+1) +

e

ex ex

(x) Gn+1 (x) dx.

To see this, note that 

 f  x  g  x dx  f  x  G  x    f   x  G  x  dx , and



       f  x  G  x  dx  f  x  G  x    f  x  G  x  dx .

1

n

1 3x e - 3 1 3x e + 9 1 3x e 27

2 0

R

n 1

e3 x

+

2x

(c)

n 1

n

n

x2



1

n 1

2 2  1 dx   x 2  x   e3 x 9 27  3



x e



  sin x  e dx

2 3x

x

  sin x  cos x  e x    sin x  e x dx 

+

ex

cos x  sin x +

ex

sin x

ex

1  sin x  cos x  e x 2

Figure of Integration by Parts using Figure 27. x n e ax28:n Examples n 1 ax

n ax  x e dx 



a

x

xn eax 1 a



n a

a

xn eax dx =

(Integration by parts).

e

R

xn−1 eax

 ,   1 R t  dt 0  R    1 = f (x) g (x) − f 0 (x) g (x)dx. (d) f(x) g (x)dx 0    1  , 1

To see this, start with the product rule: (f (x)g(x))0 = f (x)g 0 (x) + f 0 (x)g(x). Then, 1 both  integrate ,sides.   1  

(e)

Rb a

t 1

dt    1 ,

  1

f (x) g 0 (x)dx = f (x) g (x)|ba −

Rb

f 0 (x) g (x)dx

1



1 a 1 1 So, the integration of the function is the test case. In fact,  dt   dt   . t t t 0 1

A.28. If n is a positive integer, Z n eax X (−1)k n! n−k x . xn eax dx = a k=0 ak (n − k)! 

(a) n = 1 :

eax a

x−

(b) n = 2 :

eax a

x2 − a2 x +

(c)

Rt

eat a

xn eax dx =

0

(d)

R∞

xn e−ax dx =

t

(e)

R∞ 0

R∞

n P

k=0

e−at a

2 a2



(−1)k n! n−k t ak (n−k)!

n P

k=0

n! , (−a)n+1

xn eax dx = • n! =

1 a



n! tn−k ak (n−k)!

a

(−1)n n! an+1

e−at a

=

= n P

eat a

j=0

n P

k=0

(−1)k n! n−k t ak (n−k)!

+

n! (−a)n+1

n! tj an−k j!

< 0. (See also Gamma function)

e−t tn dt.

0

• In MATLAB, consider using gamma(n+1) in stead of factorial(n). Note also that gamma() allows vector input. (f)

R1 0

xβ e−x dx is finite if and only if β > −1.

Note that

1 e

R1 0

xβ dx ≤

R1 0

xβ e−x dx ≤

R1

xβ dx.

0

148

(g) ∀β ∈ R,

R∞ 1

For β ≤ 0, For β > 0,

xβ e−x dx < ∞. R∞

1 R∞ 1

(h)

R∞ 0

xβ e−x dx ≤ xβ e−x dx ≤

R∞

1 R∞ 1

R∞

e−x dx
−1. 2

A.29. Integrations involving the Gaussian function e−x .   2 R x2 x2 d x (a) xe− 2 dx = −e− 2 by chain rule dx = x . 2 (b) Gaussian integral (or probability integral): Z∞

−∞

(x)2 1 √ e− 2 dx = 1. 2π

To see this, we first square the integral: 2  ∞  ∞   ∞ Z Z Z∞ Z∞ Z y2 x2 +y 2 x2 x2 − − −  e 2 dx =  e 2 dx  e 2 dy  = e− 2 dxdy. −∞

−∞

−∞

−∞ −∞

Then, compute the double integration in polar coordinates which has x2 + y 2 = r2 and dxdy = rdr. Hence,  ∞ 2  2π  Z Z∞ Z2π Z∞ Z 2 r r − r2 − r2   e− x2 dx = re dθdr = re dθdr −∞

• • • •

R∞

−∞

R∞

√ 1 e− 2πσ

2 − x2 σ

e

(x−m)2 2σ 2

dx =

e−(ax

2 +bx+c

−∞

R∞

−∞

R∞

=

0 ∞ Z 0

√ 1 xe− 2πσ



x2



e

 2

σ √ 2

2

dx =

) dx = p π e b a

(x−m)2 2σ 2

0

0

r r ∞ rr re− 2 2πdr = −2π e− 2 = 2π.

dx = 1.

−∞

−∞

R∞

0





√ 2π √σ2 = σ π

2 −4ac 4a

dx = m. 149

0



R∞

−∞

x2

esx √12π e− 2 dx =

R∞

−∞

2

(x−s) √1 e− 2 2π

s2

s2

e 2 dx = e 2

◦ Characteristic function (s = +jt): ϕX (t) =

−∞

◦ Fourier transform (s = −jω): 

R∞

x2

t2

ejtx √12π e− 2 dx = e− 2 .

 Z∞ ω2 x2 1 1 − x2 2 2 F √ e 2 = e−jωx √ e− 2 dx = e− 2 = e−2π f . 2π 2π −∞

x2

x2

(c) Using integration by parts, splitting x2 e− 2 into x × xe− 2 , we have Z∞

2

2

2 − x2

xe

dt = −xe

− x2

R∞

−∞

−∞

−∞



Z∞ ∞ √ √ x2 + e− 2 dt = 0 + 2π = 2π.

2

x2 e−x dx =



π 4

0

A.30 (Differential of integral). Leibniz’s Rule: Let g : R2 → R, a : R → R, and b(x) R b : R → R be C 1 . Then f (x) = g (x, y)dy is C 1 and a(x)

0

0

0

f (x) = b (x) g (x, b (x)) − a (x) g (x, a (x)) +

Zb(x)

∂g (x, y)dy. ∂x

(56)

a(x)

In particular, we have d dx

Zx

f (t)dt = f (x) ,

(57)

a

Zv(x) dv d f (t)dt = f (t)dt = f (v (x)) v 0 (x) , dx dv a a   u(x) Zv(x) Zv(x) Z d d  f (t)dt = f (t)dt − f (t)dt = f (v (x)) v 0 (x) − f (u (x)) u0 (x) . dx dx d dx

Zv(x)

a

u(x)

(58)

(59)

a

Note that (56) can be derived from (A.22) by considering f (x) = h(a(x), b(x), x) where Rb h (a, b, c) = g (c, y)dy. [11, p 318–319]. a

150

A.4

Gamma and Beta functions

A.31. Gamma function: R∞ (a) Γ (q) = xq−1 e−x dx. ;

q > 0.

0

(b) Γ (n) = (n − 1)! for n ∈ N. Γ (n + 1) = n! if n ∈ N ∪ {0}. (c) 0! = 1.  √ (d) Γ 21 = π

(e) Γ (x + 1) = xΓ (x) (Integration by parts).

(f)

• This relationship is used to define the gamma function for negative numbers. √  √  √  • From Γ 12 = π, we have Γ 32 = 2π and Γ 52 = 3 4 π .

Γ(q) αq

=

R∞

xq−1 e−αx dx, α > 0.

0

20

15

10

5

0

-5

-10

-15

-20 -6

-4

-2

0 x

2

4

6

Figure 29: Plot of gamma function Γ(x). (g) By limiting argument (as q → 0+ ), we have Γ (0) = ∞ which implies Γ(−n) = ∞, for n ∈ N,. A.32. The incomplete beta function is defined as Z x B(x; a, b) = ta−1 (1 − t)b−1 dt. 0

For x = 1, the incomplete beta function coincides with the (complete) beta function. 151

The regularized incomplete beta function (or regularized beta function for short) is defined in terms of the incomplete beta function and the (complete) beta function: Ix (a, b) =

B(x; a, b) . B(a, b)

• For integers m, k, Ix (m, k) =

m+k−1 X j=m

(m + k − 1)! xj (1 − x)m+k−1−j . j!(m + k − 1 − j)!

• I0 (a, b) = 0, I1 (a, b) = 1 • Ix (a, b) = 1 − I1−x (b, a)

152

References [1] Patrick Billingsley. Probability and Measure. John Wiley & Sons, New York, 1995. 6.30, 2b [2] George Casella and Roger L. Berger. Statistical Inference. Duxbury Press, 2001. 5.43, 7.9, 11, 9.13, 12.11, 12.4, 1, 6, 12.21 [3] Donald G. Childers. Probability And Random Processes Using MATLAB. McGrawHill, 1997. 2.3, 2.3 [4] F. N. David. Games, Gods and Gambling: A History of Probability and Statistical Ideas. Dover Publications, unabridged edition, February 1998. 3.8 [5] Herbert A. David and H. N. Nagaraja. Order Statistics. Wiley-Interscience, 2003. 12.4, 12.21 [6] Rick Durrett. Elementary Probability for Applications. Cambridge University Press, 1 edition, July 2009. 3.10, 1 [7] W. Feller. An Introduction to Probability Theory and Its Applications, volume 2. John Wiley & Sons, 1971. 5.37 [8] William Feller. An Introduction to Probability Theory and Its Applications, Volume 1. Wiley, 3 edition, 1968. 2.15 [9] Terrence L. Fine. Probability and Probabilistic Reasoning for Electrical Engineering. Prentice Hall, 2005. 4.1, 4.10, 4.14, 5.18, 5.24, 6.27, 10.1, 10.3, 12.10, 16.3, A.16 [10] B. V. Gnedenko. Theory of probability. Chelsea Pub. Co., New York, 4 edition, 1967. Translated from the Russian by B.D. Seckler. 6.30 [11] John A. Gubner. Probability and Random Processes for Electrical and Computer Engineers. Cambridge University Press, 2006. 2.2, 5.31, 5.34, 5.36, 4, 7.45, 8.9, 10.9, 11.1, 2, 3, 6, 12.24, 13.22, 13.23, 15.3, 3, A.6, A.13, A.30 [12] M.O. Jones and R.F. Serfozo. Poisson limits of sums of point processes and a particlesurvivor model. Ann. Appl. Probab, 17(1):265–283, 2007. 13.10 [13] Samuel Karlin and Howard E. Taylor. A First Course in Stochastic Processes. Academic Press, 1975. 5.6, 10.10, 11.1 [14] J. F. C. Kingman. Poisson Processes. 0198536933. 6.24, 6.25

Oxford University Press, 1993.

ISBN:

[15] A.N. Kolmogorov. The Foundations of Probability. 1933. 4.14 [16] B. P. Lathi. Modern Digital and Analog Communication Systems. Oxford University Press, 1998. 1.1, 1.5

153

[17] Nabendu Pal, Chun Jin, and Wooi K. Lim. Handbook of Exponential and Related Distributions for Engineers and Scientists. Chapman & Hall/CRC, 2005. 7.25 [18] Athanasios Papoulis. Probability, Random Variables and Stochastic Processes. McGraw-Hill Companies, 1991. 6.18, 12.5 [19] E. Parzen. Stochastic Processes. Holden Day, 1962. 3, 4, 12.25 [20] Sidney I. Resnick. Adventures in Stochastic Processes. Birkhuser Boston, 1992. 5 [21] Kenneth H. Rosen, editor. Handbook of Discrete and Combinatorial Mathematics. CRC, 1999. 1, 2.11, 4, 5, 6, 7, 8 [22] Walter Rudin. Principles of Mathematical Analysis. McGraw-Hill, 1976. A.6 [23] Gbor J. Szkely. Paradoxes in Probability Theory and Mathematical Statistics. 1986. 9.2 [24] George B. Thomas, Maurice D. Weir, Joel Hass, and Frank R. Giordano. Thomas’ Calculus. Addison Wesley, 2004. A.16 [25] Tung. Fundamental of Probability and Statistics for Reliability Analysis. 2005. 18, 21 [26] Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2004. 5.16, 5.18, 5.44, 8.9, 3, 14.1 [27] John M. Wozencraft and Irwin Mark Jacobs. Principles of Communication Engineering. Waveland Press, June 1990. 1.2

154

Index Affine estimator, 131 Associativity, 5

Markov’s Inequality, 83 Minkowski’s Inequality, 85 Monte Hall’s Game, 29 Multinomial Coefficient, 17 Multinomial Counting, 17 Multinomial Theorem, 17

Bayes Theorem, 24 Binomial theorem, 14 Birthday Paradox, 28 Cardinality, 6 Cauchy-Bunyakovskii-Schwartz Inequality, 85 Central binomial coefficient, 15 Chevalier de Mere’s Scandal of Arithmetic, 26 Coefficient of Variation (CV), 78 Commutativity, 5 de Morgan laws, 5 Delta function, 18 Dice, 25 Die, 25 Dirac delta function, see Delta function Disjoint Sets, 5 Distributivity, 5

Order Statistics, 108 Outcome, 4 Partition, 6 Probability of coincidence birthday, 27 Random experiment, 4 Relative frequency, 5 Sample space, 4 Standard Deviation, 78 Total Probability Theorem, 24 uncorrelated not independent, 82, 100

Event Algebra, 29

Wiener filters, 131

Fano Factor, 78

Zeta function, 139 Zipf or zeta random variable, 64

gradient, 140 gradient vector field, 140 H¨older’s Inequality, 85 Hessian matrix, 142 Idempotence, 5 Inclusion-Exclusion Principle, 6 Integration by Parts, 143 Isserlis’s Theorem, 128 Jacobian, 105, 140 Jacobian formulas, 105 Jensen’s Inequality, 85 Leibniz’s Rule, 147 Linear MMSE estimator, 131 Lyapounov’s Inequality, 85

155