A Maximum Likelihood Mixture Approach for Multivariate Hypothesis

XX 20XX, Volume X, No. X (Serial No. X) Journal of Mathematics and System Science, ISSN 2159-5291, David Publishing Company, USA

A Maximum Likelihood Mixture Approach for Multivariate Hypothesis Testing in case of Incomplete Data Loc Nguyen1 1. Vietnam Institute of Mathematics

Received: June 30, 2013 / Accepted: June 30, 2013 / Not Published Yet. Abstract: Multivariate hypothesis testing becomes more and more necessary when data is in the process of changing from scalar and univariate format to multivariate format, especially financial and biological data is often constituted of n-dimension vectors. Likelihood ratio test is the best method that applies the test on mean of multivariate sample with known or unknown covariance matrix but it is impossible to use likelihood ratio test in case of incomplete data when the data incompletion gets popular because of many reasons in reality. Therefore, this research proposes a new approach that gives an ability to apply likelihood ratio test into incomplete data. Instead of replacing missing values in incomplete sample by estimated values, this approach classifies incomplete sample into groups and each group is represented by a potential or partial distribution. All partial distributions are unified into a mixture model which is optimized via expectation maximization (EM) algorithm. Finally, likelihood ratio test is performed on mixture model instead of incomplete sample. This research provides a thorough description of proposed approach and mathematical proof that is necessary to such approach. The comparison of mixture model approach and filling missing values approach is also discussed in this research. Key words: maximum likelihood, mixture model, multivariate hypothesis testing, incomplete data

1. Likelihood ratio test Suppose data sample is n-dimension vector space X, which contains m observation vectors X1, X2,…, Xm where Xi = {xi1, xi2,…, xin}. Note that Xi is identically distributed random variable and it can be called observation, data point, or sample point. We conventionalize that upper-case letter denotes variable and lower-case letter denotes value or instance of variable, for example, xi is the instance of variable Xi. We test on mean of normal distribution when variance known or unknown, so Xi (s) conforms N(μ, Σ) and the null hypothesis H0: μ = μ0 and H1: no constraint on μ. Let L0(μ0, Σ | X) and L1(μ, Σ | X) be the likelihoods for 1

Corresponding author: Loc Nguyen, postdoctoral, research field: computer science, statistics, and mathematics. E-mail: [email protected].

null hypothesis H0 and alternative hypothesis H1, respectively. 𝑚

𝐿0 (𝜇0 , Σ| 𝑋) = ∏ 𝑃(𝑥𝑖 |Θ) 𝑖=1 1 𝑚

= |2𝜋Σ|−𝑛/2 𝑒 −2 ∑𝑖=1(𝑥𝑖 −μ0)

𝑇 Σ−1 (𝑥 −μ ) 𝑖 0

𝑚

𝐿1 (𝜇, Σ| 𝑋) = ∏ 𝑃(𝑥𝑖 |Θ) 𝑖=1 𝑛

1 𝑚

= |2𝜋Σ|− 2 𝑒 −2 ∑𝑖=1(𝑥𝑖 −𝜇)

𝑇 Σ−1 (𝑥 −𝜇) 𝑖

(Where n is the number of dimensions of sample X) We take the logarithm of likelihood functions so as to convert the repeated multiplication into repeated

2

A Maximum Likelihood Mixture Approach for Multivariate Hypothesis Testing in case of Incomplete Data

addition, so the log-likelihood function of X is [1, pp. 185-186]: 𝑛 𝐿𝑜𝑔𝐿0 (𝜇0 , Σ|X) = − log|2𝜋Σ| 2 𝑚

1 − ∑(𝑥𝑖 − 𝜇0 )𝑇 Σ−1 (𝑥𝑖 − 𝜇0 ) 2 𝑖=1

𝑛 𝑛 = − log|2𝜋Σ| − tr(Σ−1 𝑆) 2 2 𝑛 − (𝑥̅ − 𝜇0 )𝑇 Σ −1 (𝑥̅ − 𝜇0 ) 2 𝑛 𝐿𝑜𝑔𝐿1 (𝜇, Σ|X) = − log|2𝜋Σ| 2 𝑚

1 − ∑(𝑥𝑖 − 𝜇)𝑇 Σ −1 (𝑥𝑖 − 𝜇) 2 𝑖=1

𝑛 𝑛 = − log|2𝜋Σ| − tr(Σ−1 𝑆) 2 2 𝑛 − (𝑥̅ − 𝜇)𝑇 Σ−1 (𝑥̅ − 𝜇) 2 Where trace operator denoted tr(A) is the sum of all elements that are on the diagonal of matrix A and 𝑥̅ = 1 𝑚

∑𝑚 𝑖=1 𝑥𝑖 is sample mean and S is the sample

covariance matrix. L0(μ0, Σ | X) and L1(μ, Σ | X) get maximal if and only if LogL0(Θ | X) and LogL1(Θ | X) get maximal. It is proved that LogL0(μ0, Σ | X) and LogL1(μ, Σ | X) get maximal if and only if 𝜇 = 𝑥̅ and Σ = S when 𝜇 = 𝑥̅ and Σ = S are solutions of equation formed by setting first-order derivative with regard to μ and Σ to be zero. 𝑛 𝑛 𝑛 𝜕 (− log|2𝜋Σ| − 𝑡𝑟(Σ−1 𝑆) − (𝑥̅ − 𝜇)𝑇 Σ −1 (𝑥̅ − 𝜇)) 2 2 2 =Ø 𝜕𝜇 𝑛 𝑛 𝑛 𝜕 (− log|2𝜋Σ| − 𝑡𝑟(Σ −1 𝑆) − (𝑥̅ − 𝜇)𝑇 Σ −1 (𝑥̅ − 𝜇)) 2 2 2 = [0] { 𝜕Σ ⟺{

𝜇 = 𝑥̅ 𝛴 = 𝑆

Let L0* and L1* be the maximal likelihoods for null hypothesis H0 and alternative hypothesis H1. If we substitute 𝜇 = 𝑥̅ and Σ = S to L0(μ0, Σ | X) and L1(μ, Σ

| X), then L0(μ0, Σ | X) and L1(μ, Σ | X) is totally determined. 𝐿∗0 = 𝐿0 (𝜇0 , 𝑆| 𝑋) 1 𝑚

= |2𝜋𝑆|−𝑛/2 𝑒 −2 ∑𝑖=1(𝑥𝑖 −μ0)

𝑇 𝑆 −1 (𝑥 −μ ) 𝑖 0

𝐿∗1 = 𝐿1 (𝑥̅ , 𝑆| 𝑋) 1 𝑚 ∑𝑖=1(𝑥𝑖 −x̅)𝑇 𝑆 −1 (𝑥𝑖 −x̅)

= |2𝜋𝑆|−𝑛/2 𝑒 −2

The likelihood ratio R [1, pp. 184-192] is defined as the ratio of the maximum likelihood of null hypothesis to the maximum likelihood of alternative hypothesis. 𝐿∗0 (𝑋) 𝑅= ∗ 𝐿1 (𝑋) It is proved that –2log(R) is approximate to chi-square distribution χ2 with n degrees of freedom when Σ is known; so H0 is rejected in flavor of H1 if –2log(R) > χ2α,n with significant level α. This introduction is a brief principle of likelihood ration test. In general, likelihood ration test is a very effective testing method because it simplifies the complexity of n-dimension data by transforming multivariate testing criterion into univariate testing criterion based on scalar likelihood ratio.

2. Mixture model test on incomplete data Suppose data sample is n-dimension vector space, which contains m observation vectors X1, X2,…, Xm where Xi = {xi1, xi2,…, xin}. Thus, X = {X1, X2,…, Xm} compose a matrix whose each row is observation Xi. In case of incomplete data, X is sparse matrix and observation Xi is not always complete vector when it can lack some components xij (s). Likelihood ration test cannot apply into incomplete sample as X because it is impossible to calculate statistics such as mean and variance in terms of incomplete sample. This research tries to overcome this drawback by discovering potential probability distributions under observations Xi (s) regardless of data incompletion. Suppose we test on mean of normal distribution when variance known or unknown, so Xi (s) conforms N(μ, Σ) and the null hypothesis H0: μ = μ0 is in flavor of alternative


hypothesis H1: no constraint on μ. Based on such potential distributions, the maximum likelihood of these hypotheses is determined. Firstly, we estimate the number of potential probability distributions under observations Xi (s). Observations are classified into k classes following two conditions below. 1. Class c1 represents observations whose components xij (s) are completed. Note that class c1 may not exist. 2. Classes c2, c3,…, ck represent observations lacking the same components xij (s). For example, we have four observations X1 = {x11=1, x12=2, x13 (empty), x14 (empty)}, X2 = {x21=2, x22=3, x23 (empty), x24 (empty)}, X3 = {x31 (empty), x32 (empty), x33=1, x34 =2} and X4 = {x41 (empty), x42 (empty), x43=2, x44=3}. Thus, X1 and X2 belong to the same class (class 1); X3 and X4 belong to the same class (class 2) according to this condition. Note that empty value is considered as missing value. Two or many classes can overlap together. For example, if X5 = {x11=3, x12=(empty), x13 =(empty), x14 (empty)} then X5 belongs to different class – class 3, thus, class 1 and class 3 are overlapped. The number of potential probability distributions is initialized to be k. Note that k should be much smaller than the size m of sample. Note that potential probability distribution is also called partial probability distribution. Let pi be the potential probability distributions corresponding to class k. Suppose pi conforms normal distribution N(μi, Σi) with mean μi and variance Σi. The probabilistic mixture model [2, p. 3] is defined as below. 𝑘

𝑃(𝑥𝑖 |Θ) = ∑ 𝛼𝑗 𝑝𝑗 (𝑥𝑖 |𝜃𝑗 )

3

indicates that the distribution of data point xi is constituted of potential (or partial) distributions pj (s). Each potential distribution pj is weighted by the weight αj such that ∑𝑘𝑗=1 𝛼𝑗 = 1. The weight αj is the probability of pj if we consider pj as random variable. The weights αj (s) are learned and updated from sample, which will be discussed later. The likelihood function of X is: 𝑚

𝑚

𝑘

L(Θ|𝑋) = ∏ 𝑃(𝑥𝑖 |Θ) = ∏ ∑ 𝛼𝑗 𝑝𝑗 (𝑥𝑖 |𝜃𝑗 ) 𝑖=1

𝑖=1 𝑗=1

We take the logarithm of likelihood function so as to convert the repeated multiplication into repeated addition, so the log-likelihood function of X is [2, p. 3]: 𝐿𝑜𝑔𝐿(Θ|𝑋) = log(𝐿(Θ|𝑋)) 𝑚

𝑘

= ∑ log(∑ 𝛼𝑗 𝑝𝑗 (𝑥𝑖 |𝜃𝑗 )) 𝑖=1

𝑗=1

Let Y1, Y2,…, Ym be variables indicating that data point Xi comes from which potential distribution. The value of each Yi ranges in {1, 2, 3,…, m}. Concretely, if Yi = yi then data point Xi conforms distribution 𝑝𝑦𝑖 . When X = {X1, X2,…, Xm} is incomplete data, Y = {Y1, Y2,…, Ym} is observational data that let us exploit X. The probability of Yi = yi is the prior probability of 𝑝𝑦𝑖 which indicates that how weighted data point xi comes from distribution 𝑝𝑦𝑖 , so we have P(yi) = 𝛼𝑦𝑖 . Similarly, the conditional probability of data point xi given yi is the partial probability pj given xi, so we have P(xi | yi) = 𝑝𝑦𝑖 (𝑥𝑖 |𝜃𝑦𝑖 ) . Let (X, Y) be the complete data, the log-likelihood function is re-written as below [2, p. 3]. 𝐿𝑜𝑔𝐿(Θ|𝑋, 𝑌) = log(𝐿(Θ|𝑋, 𝑌)) 𝑚

(1)

𝑗=1

Where pj is the potential distribution and xi is instance of variable Xi. Let Θ = {α1, α2,…, αk, θ1, θ2,…, θk} and θj = {μj, Σj} be probabilistic parameters of sample and potential distribution pj, respectively. Equation (1)

= ∑ log(𝑃(𝑥𝑖 |𝑦𝑖 )𝑃(𝑦𝑖 )) 𝑖=1 𝑚

= ∑ log(𝛼𝑦𝑖 𝑝𝑦𝑖 (𝑥𝑖 |𝜃𝑦𝑖 )) 𝑖=1

(2)

4


Our goal is to find out the optimal parameter Θ* = {α1*, α2*,…, αk*, θ1*, θ2*,…, θk*} that maximizes the log-likelihood function in (2). Expectation maximization (EM) algorithm being the iterative process is applied in order to find out Θ*. Let Θ𝑡 = {𝛼1𝑡 , 𝛼2𝑡 , … , 𝛼𝑘𝑡 , 𝜃1𝑡 , 𝜃2𝑡 , … , 𝜃𝑘𝑡 } be the optimal parameter at tth iteration. Note that θjt = {μjt, Σ jt} is the optimal parameter of partial probability pj at tth iteration. In general, starting with initial estimate Θ0, each iteration in EM algorithm has two steps [3, p. 8]: - E-step: computing the conditional expectation based on the previous estimate Θt - M-step: finding out current estimate Θt+1 = Θ* that maximizes such conditional expectation. Note that Θt+1 is reserved for next iteration. EM algorithm stops when it meets the terminating condition, for example, the difference of previous estimate Θt and current estimate Θt+1 is smaller than some pre-defined threshold ε, namely, |Θt+1 – Θt| < ε. In E-step, the conditional expectation Q(Θ, Θt) is determined as below [2, p. 4].

Suppose y1, y2,… and ym are mutually independent, the probability of y given X is following product [2, p. 4]. 𝑚

𝑃(𝑦|𝑋, Θ

Consequently, the conditional expectation Q(Θ, Θt) is expended as below [2, p. 4]. 𝑄(Θ, Θ𝑡 ) 𝑚

𝑚

= ∑ ∑ log (𝛼𝑦𝑖 𝑝𝑦𝑖 (𝑥𝑖 |𝜃𝑦𝑖 )) ∏ 𝑃(𝑦𝑗 |𝑥𝑗 , Θ𝑡 ) 𝑦∈Ψ 𝑖=1 𝑘

𝑗=1

𝑘

𝑘

𝑚

= ∑ ∑ … ∑ ∑ log (𝛼𝑦𝑖 𝑝𝑦𝑖 (𝑥𝑖 |𝜃𝑦𝑖 )) ∗ 𝑦1 =1 𝑦2 =1

𝑦𝑚 =1 𝑖=1 𝑚

∏ 𝑃(𝑦𝑗 |𝑥𝑗 , Θ𝑡 ) 𝑗=1

(because Ψ denotes all combinational values over Y) 𝑘

𝑘

𝑘

𝑚

𝑘

= ∑ ∑ … ∑ ∑ ∑ 𝛿𝑐,𝑦𝑖 log(𝛼𝑐 𝑝𝑐 (𝑥𝑖 |𝜃𝑐 )) ∗ 𝑦𝑚 =1 𝑖=1 𝑐=1 𝑚

𝑦∈Ψ

∏ 𝑃(𝑦𝑗 |𝑥𝑗 , Θ𝑡 )

𝑚

= ∑ ∑ log (𝛼𝑦𝑖 𝑝𝑦𝑖 (𝑥𝑖 |𝜃𝑦𝑖 )) 𝑃(𝑦|𝑋, Θ𝑡 )

= ∏ 𝑃(𝑦𝑗 |𝑥𝑗 , Θ𝑡 ) 𝑗=1

𝑦1 =1 𝑦2 =1

𝑄(Θ, Θ𝑡 ) = ∑ 𝐿𝑜𝑔𝐿(Θ|𝑋, 𝑦)𝑃(𝑦|𝑋, Θ𝑡 )

𝑡)

(3)

𝑗=1

(where 𝛿𝑐,𝑦𝑖 = 1 if and only if yi = c and where c ∈

𝑦∈Ψ 𝑖=1

Note that Q(Θ, Θ ) is the function of variable Θ and value Θt is known estimate in previous iteration. Q(Θ, Θt) sums over all possible instances of Y={Y1, Y2,…, Ym} and so Ψ denotes all combinational values over Y. Now we need to specify the probability of y given X and Θt, P(y | X, Θt). Note that y = {y1, y2,…, ym} is instance of observation variable data Y = {Y1, Y2,…, Ym}. Applying Bayesian rule, we have [2, p. 3]: 𝑃(𝑦𝑗 |Θ𝑡 )𝑃(𝑥𝑗 |𝑦𝑗 , Θ𝑡 ) 𝑃(𝑦𝑗 |𝑥𝑗 , Θ𝑡 ) = 𝑚 ∑ℎ=1 𝑃(𝑦ℎ |Θ𝑡 )𝑃(𝑥ℎ |𝑦ℎ , Θ𝑡 ) (4) 𝛼𝑦𝑡 𝑗 𝑝𝑦𝑗 (𝑥𝑗 |𝜃𝑦𝑡𝑗 ) = 𝑚 ∑ℎ=1 𝛼𝑦𝑡 ℎ 𝑝𝑦ℎ (𝑥ℎ |𝜃𝑦𝑡ℎ ) t

{1,2,…, k} represents classes of potential probabilities pc ) 𝑘

𝑚

𝑘

𝑘

𝑘

= ∑ ∑ log(𝛼𝑐 𝑝𝑐 (𝑥𝑖 |𝜃𝑐 )) ∑ ∑ … ∑ 𝛿𝑐,𝑦𝑖 ∗ 𝑐=1 𝑖=1

𝑦1 =1 𝑦2 =1

𝑦𝑚=1

𝑚

∏ 𝑃(𝑦𝑗 |𝑥𝑗 , Θ𝑡 ) 𝑗=1 𝑘

𝑚

= ∑ ∑ log(𝛼𝑐 𝑝𝑐 (𝑥𝑖 |𝜃𝑐 )) ∗ 𝑐=1 𝑖=1 𝑘

𝑘

𝑘

𝑚 𝑡

∑ ∑ … ∑ 𝛿𝑐,𝑦𝑖 𝑃(𝑦𝑖 |𝑥𝑗 , Θ ) ∏ 𝑃(𝑦𝑗 |𝑥𝑗 , Θ𝑡 ) 𝑦1 =1 𝑦2 =1

𝑦𝑚

𝑗=1,𝑗≠𝑖


𝑘

It is easy to recognize that both G(αc) and H(θc) are less than or equal to 0 because of 0 ≤ 𝛼𝑐 , 𝑝𝑐 (𝑥𝑖 |𝜃𝑐 ) ≤ 1. Hence, the expectation Q(Θ, Θt) is maximal if and only if both G(αc) and H(θc) are minimal. In other words, we find out extreme points αc* and θc* = {μc*, ∑c*} that minimize both G(αc) and H(μc,∑c) when Q(Θ, Θt) is considered as function of variables αc and θc = {μc,∑c}. Applying Lagrange function into G(αc) with constraint ∑𝑘𝑐=1 𝛼𝑐 = 1 , extreme point αc* is the

𝑚

= ∑ ∑ log(𝛼𝑐 𝑝𝑐 (𝑥𝑖 |𝜃𝑐 ))𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) ∗ 𝑐=1 𝑖=1 𝑘

𝑘

𝑘

𝑘

𝑚

∑ … ∑ ∏ 𝑃(𝑦𝑗 |𝑥𝑗 , Θ𝑡 )

∑… ∑ 𝑦1 =1

𝑘

𝑦𝑖−1 =1 𝑦𝑖+1 =1

𝑦𝑚 =1 𝑗=1,𝑗≠𝑖

𝑚

= ∑ ∑ log(𝛼𝑐 𝑝𝑐 (𝑥𝑖 |𝜃𝑐 ))𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) ∗ 𝑐=1 𝑖=1 𝑚

𝑘

∏ ∑ 𝑃(𝑦𝑗 |𝑥𝑗 , Θ𝑡 )

solution of the equation formed by setting first-order derivative of sum of G(αc) and Lagrange constraint to be zero.

𝑗=1,𝑗≠𝑖 𝑦𝑗 =1 𝑘

𝑚

= ∑ ∑ log(𝛼𝑐 𝑝𝑐 (𝑥𝑖 |𝜃𝑐 ))𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 )

5

(𝟓)

𝑘

𝜕 (𝐺(𝛼𝑐 ) + 𝜆 (∑ 𝛼𝑐 − 1)) = 0 𝜕𝛼𝑐

𝑐=1 𝑖=1

𝑐=1

𝑘 𝑡

(due to ∑ 𝑃(𝑦𝑗 |𝑥𝑗 , Θ ) = 1) (where 𝜆 is Lagrange multiplier)

𝑦𝑗 =1

In general, we have: 𝑄(Θ, Θ𝑡 ) 𝑘

𝑚

⟺∑ 𝑖=1

𝑚

= ∑ ∑ log(𝛼𝑐 𝑝𝑐 (𝑥𝑖 |𝜃𝑐 ))𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 )

(5)

Note that P(yi = c | xi) is computed following equation (4). Because equation (5) has two parameters αc and θc = {μc, Σc}, the conditional expectation Q(Θ, Θt) is split into two parts as below. 𝑄(Θ, Θ𝑡 )

𝑐=1 𝑖=1

𝑖=1

We have equation (7) to determine λ. 𝑚

∑ 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) + 𝛼𝑐 𝜆 = 0

(7)

𝑖=1

Summing (7) over k classes {1, 2,…, k}, we have [2, p. 5]:

𝑚

= ∑ ∑ log(𝛼𝑐 )𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) 𝑘

𝑚

⟺ ∑ 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) + 𝛼𝑐 𝜆 = 0

𝑐=1 𝑖=1

𝑘

𝑘 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) + 𝑘𝜆 = 0 𝛼𝑐

(6)

𝑚

𝑘

𝑘

∑ ∑ 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) + 𝜆 ∑ 𝛼𝑐 = 0

𝑚

+ ∑ ∑ log(𝑝𝑐 (𝑥𝑖 |𝜃𝑐 ))𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 )

𝑖=1 𝑐=1

𝑐=1

⟺ 𝑚+𝜆 = 0

𝑐=1 𝑖=1

Let

𝑘 𝑚

𝐺(𝛼𝑐 ) = ∑ log(𝛼𝑐 )𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) 𝑖=1 𝑚

𝐻(𝜃𝑐 ) = ∑ log(𝑝𝑐 (𝑥𝑖 |𝜃𝑐 ))𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) 𝑖=1

𝑘

(due to ∑ 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) = 1 and ∑ 𝛼𝑐 = 1) 𝑐=1

𝑐=1

⟺ 𝜆 = −𝑚 Substituting λ = –m into equation (7), extreme point αc* is totally determined.

6


𝑚

1 𝛼𝑐∗ = ∑ 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) 𝑚 𝑖=1

Where P(yi = c | xi) is calculated following equation (4). Now we find out another extreme point θc* = {μc*, Σc*}. Suppose potential distribution pc(xi|θc) is normal, it is expended as below. 1

1

𝑝𝑐 (𝑥𝑖 |𝜃𝑐 ) = |2𝜋Σ𝑐 |−2 𝑒−2(𝑥𝑖 −𝜇𝑐)

⟺ 𝜇𝑐∗ =

∑𝑚 𝑖=1 𝑥𝑖 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) ∑𝑚 𝑖=1 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 )

The first-order partial derivative of H(μc, ∑c) with respect to ∑c is: 𝑛

𝜕𝐻(𝜇𝑐 , ∑𝑐 ) 1 = ∑ (− ∑−1 𝜕∑𝑐 2 𝑐 𝑖=1

1 (𝑥𝑖 + ∑−1 2 𝑐

𝑇 Σ −1 (𝑥 −𝜇 ) 𝑐 𝑖 𝑐

It implies that

− 𝜇𝑐 )(𝑥𝑖 − 𝜇𝑐 )𝑇 ∑−1 𝑐 ) 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 )

𝑚

𝑛 1 𝐻(𝜇𝑐 , Σ𝑐 ) = ∑ (− log(2π) − log|Σc | 2 2

Due to: 𝜕 log(|∑𝑐 |) = ∑−1 𝑐 𝜕∑𝑐

𝑖=1

1 − (𝑥𝑖 − 𝜇𝑐 )𝑇 Σ𝑐−1 (𝑥𝑖 2 − 𝜇𝑐 )) 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) (where n is the dimension of sample space X) The first-order partial derivative of H(μc, ∑c) with respect to μc is [4, p. 35]: 𝑚

𝜕𝐻(𝜇𝑐 , ∑𝑐 ) = ∑ 2Σ𝑐−1 (𝑥𝑖 − 𝜇𝑐 ) 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) 𝜕𝜇𝑐 𝑖=1

(due to

𝜕(𝑥𝑖 − 𝜇𝑐 )𝑇 Σ𝑐−1 (𝑥𝑖 − 𝜇𝑐 ) 𝜕𝜇𝑐 = 2Σ𝑐−1 (𝑥𝑖 − 𝜇𝑐 ) when Σ𝑐 is symmetric)

The optimal μc* maximizing H(μc, Σc) is the solution of equation created by setting partial derivatives of H(μc, Σc) with regard to mean μc to be zero. Note that Ø denotes zero vector (0, 0,…, 0). 𝜕𝐻(𝜇𝑐 , Σ𝑐 ) =∅ 𝜕𝜇𝑐 𝑚

⇔

∑ 2Σ𝑐−1 (𝑥𝑖

− 𝜇𝑐 ) 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) = ∅

𝑖=1 𝑚

⟺ ∑ (𝑥𝑖 − 𝜇𝑐 ) 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) = ∅ 𝑖=1

And 𝜕(𝑥𝑖 − 𝜇)𝑇 ∑−1 𝑐 (𝑥𝑖 − 𝜇) 𝜕∑𝑐 𝜕tr((𝑥𝑖 − 𝜇)(𝑥𝑖 − 𝜇)𝑇 ∑−1 𝑐 ) 𝜕∑𝑐 Because author [2, p. 5] mentioned: 𝑇 −1 (𝑥𝑖 − 𝜇)𝑇 ∑−1 𝑐 (𝑥𝑖 − 𝜇) = tr((𝑥𝑖 − 𝜇)(𝑥𝑖 − 𝜇) ∑𝑐 ) Where tr(A) is trace operator which takes sum of diagonal elements of matrix tr(𝐴) = ∑𝑖 𝑎𝑖𝑖 . It implies [4, p. 45] 𝜕(𝑥𝑖 − 𝜇)𝑇 ∑−1 𝑐 (𝑥𝑖 − 𝜇) 𝜕∑𝑐 𝑇 −1 = −∑−1 𝑐 (𝑥𝑖 − 𝜇)(𝑥𝑖 − 𝜇) ∑𝑐 Where ∑c is symmetric and invertible matrix. The optimal Σc* maximizing H(μc, Σc) is the solution of equation created by setting partial derivatives of H(μc, Σc) with regard to mean Σc to zero. =

0 ⋯ 0 Note that [0] denotes zero matrix [ ⋮ ⋱ ⋮ ], we 0 ⋯ 0 have: 𝜕𝐻(𝜇𝑐 , Σ𝑐 ) = [0] 𝜕Σ𝑐


𝑚

1 1 −1 ⟺ ∑ (− ∑−1 𝑐 + ∑𝑐 (𝑥𝑖 2 2 𝑖=1

− 𝜇𝑐 )(𝑥𝑖 − 𝜇𝑐 )𝑇 ∑−1 𝑐 ) 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) = [0] 𝑚

⇔ ∑(−∑𝑐 + (𝑥𝑖 − 𝜇𝑐 )(𝑥𝑖 − 𝜇𝑐 )𝑇 )𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) 𝑖=1

= [0] 𝑚

⇔ ∑((𝑥𝑖 − 𝜇𝑐 )(𝑥𝑖 − 𝜇𝑐 )𝑇 )𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) 𝑖=1 𝑚

− ∑𝑐 ∑ 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) = [0] 𝑖=1

⟺ Σ𝑐∗ =

∑𝑚 𝑖=1(𝑥𝑖

− 𝜇𝑐 )(𝑥𝑖 − 𝜇𝑐 )𝑇 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) ∑𝑚 𝑖=1 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 )

In general, the optimal parameters αc* and θc* = {μc*, Σc*} for each potential (partial) probability pc form a following triple shown as equation (8). Note that P(yi = c | xi) is computed following equation (4). 𝑚

𝛼𝑐∗

1 = ∑ 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) 𝑚 𝑖=1

∑𝑚 𝑥𝑖 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) (8) 𝜇𝑐∗ = 𝑖=1 ∑𝑚 𝑖=1 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) ∑𝑚 (𝑥𝑖 − 𝜇𝑐∗ )(𝑥𝑖 − 𝜇𝑐∗ )𝑇 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) Σ𝑐∗ = 𝑖=1 ∑𝑚 { 𝑖=1 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) Of course, we have k groups (α1*, θ1*), (α2*, θ2*),… and (α2*, θ2*) like (8) and so the global optimal parameter Θ* = {α1*, α2*,…, αk*, θ1*, θ2*,…, θk*} in M-step of EM algorithm is totally specified. The basic idea is to divide the global optimal parameter Θ* into k groups of partial parameters (αj*, θj*) and find out such partial parameters in particular. Back the test on mean of normal distribution when the null hypothesis H0: μ = μ0 is in flavor of alternative hypothesis H1: no constraint on μ. The likelihood ratio R(X) is defined as the ratio of the maximum likelihood of null hypothesis to the maximum likelihood of alternative hypothesis.

𝑅=

7

𝐿∗0 (𝑋) 𝐿∗1 (𝑋)

Where 𝐿∗0 (𝑋) and 𝐿∗1 (𝑋) are maximum likelihoods of null hypothesis and alternative hypothesis, respectively. This research proposes a 4-step testing process based on mixture model in order to test on mean of normal distribution in case of incomplete data. 1. Specifying k classes and k respective partial probabilities pc(xc | θc) with 𝑐 = ̅̅̅̅̅ 1, 𝑘 and θc = {μc, Σc}. The weights 𝛼𝑐 (s) of partial probabilities pc (s) are initialized by the ratio of the number Xi (s) belonging to class c to m, the number of total Xi (s), 𝛼𝑐 =

The number of 𝑋𝑖 (𝑠) belonging to class 𝑐 𝑚

a. Class c1 represents observations whose components xij (s) are completed. Note that class c1 may not exist. b. Classes c2, c3,…, ck represent observations lacking the same components xij (s). 2. Specifying the likelihood functions of null hypothesis and alternative hypothesis such as L0(X) and L1(X). a. 𝐿0 (Θ|𝑋) = ∏𝑚 𝑖=1 𝑃(𝑥𝑖 |Θ) = 𝑚 𝑘 ∏𝑖=1 ∑𝑐=1 𝛼𝑐 𝑝𝑐 (𝑥𝑖 |𝜃𝑐 ) Where parameter θc = {μc, Σc} is constant and is assigned by sample mean μ0 and known covariance matrix Σ (or sample covariance matrix S). b. 𝐿1 (Θ|𝑋) = ∏𝑚 𝑖=1 𝑃(𝑥𝑖 |Θ) = 𝑚 𝑘 ∏𝑖=1 ∑𝑐=1 𝛼𝑐 𝑝𝑐 (𝑥𝑖 |𝜃𝑐 ). 3. Applying EM algorithm into find out optimal parameters αc* and θc* = {μc*, Σc*} for each potential (partial) probability pc. a. Optimal parameters αc* with respect to null 1

hypothesis: 𝛼𝑐∗ = 𝑚 ∑𝑚 𝑖=1 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ). Note that θc = {μc, Σc} is constant. b. Optimal parameters αc* and θc* = {μc*, Σc*} with respect to alternative hypothesis:

8


𝑚

𝛼𝑐∗ =

1 ∑ 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) 𝑚

𝜇𝑐∗ =

∑𝑚 𝑖=1 𝑥𝑖 𝑃(𝑦𝑖 = 𝑐 |𝑥𝑖 ) ∑𝑚 𝑖=1 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 )

Σ𝑐∗ =

𝑇 ∑𝑚 𝑖=1(𝑥𝑖 − 𝜇𝑐 )(𝑥𝑖 − 𝜇𝑐 ) 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 ) ∑𝑚 𝑖=1 𝑃(𝑦𝑖 = 𝑐|𝑥𝑖 )

𝑖=1

4. Substituting αc and θc = {μc , into likelihood functions L0(X) and L1(X) so that the likelihood ratio is totally determined. It is proved that –2log(R) is approximate to chi-square distribution χ2 with n degrees of freedom when population covariance Σ is known; so H0 is rejected in flavor of H1 if –2log(R) > χ2α,n with significant level α. Please pay attention to step 1 and step 3 because it is slightly complicated. Suppose data sample is n-dimension vector space X, which contains m observation vectors X1, X2,…, Xm where Xi = {xi1, xi2,…, xin}. Thus, X = {X1, X2,…, Xm} compose a matrix whose each row is observation Xi. Suppose Xi (s) conforms N(μ, Σ). If class c is composed of u non-empty components 𝑎𝑐1 , 𝑎𝑐2 , … , 𝑎𝑐𝑢 where cu = ̅̅̅̅̅ 1, 𝑛, then the mean μc contains only c1th, c2th,…, cuth *

*

*

Σc*}

components which correspond with c1th, c2th,…, cuth columns in matrix X. Similarly, the covariance matrix Σc contains only variances among c1th, c2th,…, cuth components which correspond with c1th, c2th,…, cuth columns in matrix X. All arithmetical operators in step 3 are performed based on such u non-empty components 𝑎𝑐1 , 𝑎𝑐2 , … , 𝑎𝑐𝑢 . This important thing is the tip of this research. The mean μc in the likelihood function of null hypothesis is constant and is assigned by c1th, c2th,…, cuth components of sample mean μ0. The covariance matrix Σc in the likelihood function of null hypothesis is constant and is assigned by c1th, c2th,…, cuth components of population covariance Σ or sample covariance S in case of unknown Σ. It means that it is not necessary to computed the optimal parameters μc* and Σc* with respect to likelihood function of null hypothesis L0(X).

For example, sample X has four observations X1 = {x11=1, x12=2, x13 (empty), x14 (empty)}, X2 = {x21=2, x22=3, x23 (empty), x24 (empty)}, X3 = {x31 (empty), x32 (empty), x33=1, x34 =2} and X4 = {x41 (empty), x42 (empty), x43=2, x44=3}. These observations formed a 4x4 matrix shown in table 1. X1

x11 = 1

x12 = 2

x13 = ?

x14 = ?

X2

x21 = 2

x22 = 3

x23 = ?

x24 = ?

X3

x31 = ?

x32 = ?

x33 = 1

x34 = 2

X4

x41 = ?

x42 = ?

x43 = 2

x44 = 3

Table 1. An example of incomplete sample X where question mark (?) denotes missing value (or empty value). Suppose we test on mean of normal distribution and so Xi (s) conforms N(μ, Σ) and the null hypothesis H0: μ = μ0 = (0, 0, 0, 0)T and H1: no constraint on μ. 1 0 0 0 Suppose variance is known, Σ = [0 1 0 0] . 0 0 1 0 0 0 0 1 Sample X is classified into two classes such as class 1 containing X1, X2 and class 2 containing X3, X4. Let p1 and p2 are partial probabilities attaching to class 1 and class 2, respectively. Both p1 and p2 conform normal distribution, p1 ~ N(μ1, Σ1) and p2 ~ N(μ2, Σ2). Let Y = {Y1, Y2, Y3, Y4} be the set of observations where Yi indicates Xi belongs to which class (1 or 2). Because X1, X2 belong to class 1 and X3, X4 belong to class 2, we have Y = {Y1 = 1, Y2 = 1, Y3 = 2, Y4 = 2}, p1(x1) =0.5, p1(x2) = 0.5, p1(x3) = 0, p1(x4) = 0, p2(x1) =0, p2(x2) = 0, p2(x3) = 0.5 and p2(x4) = 0.5. When class 1 lacks columns (components) 3 and 4 and relates only X1 and X2, the mean μ1 receives two first components of μ and the covariance Σ1 containing variances between X1 and X2 is the sub-matrix of Σ. Similarly, the mean μ2 receives two last components of μ and the covariance Σ2 is the sub-matrix of Σ which contains variances between X3 and X4. We have: 0 1 0 0 0 1 0 𝜇10 = [ ] , Σ10 = [ ] , 𝜇 = [ ] , Σ20 = [ ] 0 0 1 2 0 0 1


Let α1 and α1 be the weights of partial probabilities p1 and p2, respectively. Hence, αi is initialized as the number of Xi (s) belonging to class i. We have: 2 2 = 0.5, 𝛼20 = = 0.5 4 4 At the first iteration of EM algorithm, the probability of y given X denoted P( yi | xi) is calculated according to equation (4), for example: 𝛼10 =

𝑃(𝑦1 = 1|𝑥1 ) =

𝛼𝑦01 𝑝𝑦1 (𝑥1 ) 𝛼10 𝑝1 (𝑥1 ) 0.5 ∗ 0.5 = = 1 𝑃(𝑥𝑖 ) 𝑃(𝑥𝑖 ) 4

=1 In the similar way, we have: - P(y1 = 1 | x1) = 1, P(y1 = 1 | x2) = 1, P(y1 = 1 | x3) = 0, P(y1 = 1 | x4) = 0, P(y1 = 2 | x1) = 0, P(y1 = 2 | x2) = 0, P(y1 = 2 | x3) = 1, P(y1 = 2 | x4) = 1. - P(y2 = 1 | x1) = 1, P(y2 = 1 | x2) = 1, P(y2 = 1 | x3) = 0, P(y2 = 1 | x4) = 0, P(y2 = 2 | x1) = 0, P(y2 = 2 | x2) = 0, P(y2 = 2 | x3) = 1, P(y2 = 2 | x4) = 1. - P(y3 = 1 | x1) = 1, P(y3 = 1 | x2) = 1, P(y3 = 1 | x3) = 0, P(y3 = 1 | x4) = 0, P(y3 = 2 | x1) = 0, P(y3 = 2 | x2) = 0, P(y3 = 2 | x3) = 1, P(y3 = 2 | x4) = 1. - P(y4 = 1 | x1) = 1, P(y4 = 1 | x2) = 1, P(y4 = 1 | x3) = 0, P(y4 = 1 | x4) = 0, P(y4 = 2 | x1) = 0, P(y4 = 2 | x2) = 0, P(y4 = 2 | x3) = 1, P(y4 = 2 | x4) = 1. Parameters μ1, Σ1, μ2, Σ2, α1 and α2 are re-calculated according to equation (8). 𝑚

1 𝛼1∗ = ∑ 𝑃(𝑦𝑖 = 1|𝑥𝑖 ) 4 𝑖=1

1 (𝑃(𝑦1 = 1|𝑥1 ) + 𝑃(𝑦2 = 1|𝑥2 ) 4 + 𝑃(𝑦3 = 1|𝑥3 ) + 𝑃(𝑦4 = 1|𝑥4 )) = 0.5 =

𝑚

1 𝛼2∗ = ∑ 𝑃(𝑦𝑖 = 1|𝑥𝑖 ) 4 𝑖=1

1 (𝑃(𝑦1 = 2|𝑥1 ) + 𝑃(𝑦2 = 2|𝑥2 ) 4 + 𝑃(𝑦3 = 2|𝑥3 ) + 𝑃(𝑦4 = 2|𝑥4 )) = 0.5 =

9

1 2 2 [ ]1 + [ ]1 ∑ ) 𝑥 𝑃(𝑦 = 1|𝑥 𝑖 𝑖 𝑖 𝑖=1 2 3 = [1.5] 𝜇1∗ = 2 = 2.5 ∑𝑖=1 𝑃(𝑦𝑖 = 1|𝑥𝑖 ) 1+1 𝜇2∗

1 2 ∑4𝑖=3 𝑥𝑖 𝑃(𝑦𝑖 = 2|𝑥𝑖 ) [2] 1 + [3] 1 1.5 = 4 = =[ ] 2.5 ∑𝑖=3 𝑃(𝑦𝑖 = 2|𝑥𝑖 ) 1+1

𝑇 ∑2𝑖=1 (𝑥𝑖 − [1.5]) (𝑥𝑖 − [1.5]) 𝑃(𝑦𝑖 = 1|𝑥𝑖 ) 2.5 2.5 Σ1∗ = ∑𝑚 𝑃(𝑦 = 1|𝑥𝑖 ) 𝑖 𝑖=1

0.25 =[ 0.25

0.25 ] 0.25

𝑇 ∑4𝑖=3 (𝑥𝑖 − [1.5]) (𝑥𝑖 − [1.5]) 𝑃(𝑦𝑖 = 2|𝑥𝑖 ) 2.5 2.5 Σ2∗ = ∑𝑚 𝑖=1 𝑃(𝑦𝑖 = 2|𝑥𝑖 )

0.25 0.25 =[ ] 0.25 0.25 Where μ1*, Σ1*, μ2*, Σ2*, α1* and α2* are optimal parameters that maximizing log-likelihood function. Because the deviation between α1* and α10 is zero and the deviation between α2* and α20 is zero, algorithm is stopped at the first iteration. Let R(X) be the likelihood ratio. 𝐿∗0 (𝑋) 𝑅= ∗ 𝐿1 (𝑋) (where L0*(X) and L1*(X) are maximum likelihoods of null hypothesis and alternative hypothesis, respective) ⇔ −2𝑙𝑜𝑔𝑅 = 2(𝐿𝑜𝑔𝐿∗1 − 𝐿𝑜𝑔𝐿∗0 ) (where LogL0*(X) and LogL1*(X) are maximum likelihoods of null hypothesis and alternative hypothesis, respective) 𝑚

⇔ −2𝑙𝑜𝑔𝑅 = 2 (∑ log (𝛼𝑦∗𝑖 𝑝𝑦𝑖 (𝑥𝑖 |𝜃𝑦𝑖 )) 𝑖=1 𝑚

− ∑ log (𝛼𝑦0𝑖 𝑝𝑦𝑖 𝑝𝑦𝑖 (𝑥𝑖 |𝜃𝑦𝑖 ))) 𝑖=1

⇔ −2𝑙𝑜𝑔𝑅 = 0 (due to 𝛼𝑦∗𝑖 = 𝛼𝑦0𝑖 = 0.5, ∀𝑦𝑖 ) Thus, hypothesis H0: μ0 = (0, 0, 0, 0)T cannot be rejected in flavor of H1: no constraint at significant level 0.05% because –2logR = 0 < 9.49 = χ20.05,4.

3. Mixture model vs. filling missing values

10 A Maximum Likelihood Mixture Approach for Multivariate Hypothesis Testing in case of Incomplete Data

Filling missing values method is to try to estimate missing values (or empty values) so as to transform incomplete data into complete data and after that, hypothesis testing is done on complete data as usual. We discuss some methods of filling missing values and compare them with mixture model. Suppose data sample is n-dimension vector space, which contains m observation vectors X1, X2,…, Xm where Xi = {xi1, xi2,…, xin}. Thus, X = {X1, X2,…, Xm} composes a matrix whose each row is observation Xi. In case of incomplete data, X is sparse matrix and observation Xi is not always complete vector when it can lack some components xij (s). Table 2 is an example of sample X. v=1

v=2

v=3

v=4

u=1

x11 = 1

x12 = 2

x13 = ?

x14 = ?

u=2

x21 = 2

x22 = 3

x23 = ?

x24 = ?

u=3

x31 = ?

x32 = ?

x33 = 1

x34 = 2

u=4

x41 = ?

x42 = ?

x43 = 2

x44 = 3

Table 2. Incomplete sample where question mark (?) denotes missing value (or empty value), u denotes row index and v denotes column index. In the easiest way, if each column is considered as random variable, the missing value will be estimated as the mean of column vector. This way is called average method. 1 𝑥𝑖𝑗 = ∑ 𝑥𝑘𝑗 𝐾 𝑘

(Where 𝐾 is the number of non − empty values) According to average method, we have x31 = x41 = (1+2) / 2 = 1.5, x32 = x42 = (2+3) / 2 = 2.5, x13 = x23 = (1+2) / 2 = 1.5, x14 = x24 = (2+3) / 2 = 2.5. The strong point of average method is easy to calculated mean value and its cost is low but it is based on the assumption that column is random variable but this thing is not totally exact because the null hypothesis requires that only Xi (s) being row vectors are random variables. On the contrary, mixture model approach follow hard on the requirement of null hypothesis. Another way to predict empty values is to use Bayesian rule as estimation tool. It is better than average method because it takes advantage of

probability distribution under 2-dimension data. Let z = xuv be the random variable that are dependent on row index u and column index v which, in turn, are random variables. According to Bayesian rule, the probability of z given u and v is: 𝑃(𝑢, 𝑣|𝑧0 )𝑃(𝑧0 ) 𝑃(𝑧 = 𝑧0 |𝑢, 𝑣) = ∑𝑧 𝑃(𝑢, 𝑣|𝑧)𝑃(𝑧) Assuming that u and v are mutually independent given z, P(z | u, v) is re-written. 𝑃(𝑢|𝑧0 )𝑃(𝑣|𝑧0 )𝑃(𝑧0 ) 𝑃(𝑧 = 𝑧0 |𝑢, 𝑣) = ∑𝑧 𝑃(𝑢|𝑧)𝑃(𝑣|𝑧)𝑃(𝑧) Let θ = P(z) = {θ1, θ2,…, θk} be the parameter of P(z | u, v) where θi corresponds with the ith value of z, we have: 𝑃(𝑢|𝑧0 )𝑃(𝑣|𝑧0 )𝜃0 𝑃(𝑧 = 𝑧0 |𝑢, 𝑣) = ∑𝑧 𝑃(𝑢|𝑧)𝑃(𝑣|𝑧)𝜃 The goal of Bayesian estimate method is to determine P(z | u, v) via finding out optimal parameter θ. This method is iterative process whose iteration includes two steps: calculating P(z | u, v) based on θ and re-calculating θ. Let t and θt denotes tth iteration and the estimate of parameter θ at tth iteration. Note that θ0 can be initialized by 0.5. These two steps are described as below. 1. The current probability P(z | x, y) is calculated based on previous estimate of parameter θt-1. 𝑃(𝑢|𝑧0 )𝑃(𝑣|𝑧0 )𝜃0𝑡−1 𝑃(𝑧 = 𝑧0 |𝑢, 𝑣) = ∑𝑧 𝑃(𝑢|𝑧)𝑃(𝑣|𝑧)𝜃𝑡−1 2. Assigning current P(z | x, y) to current estimate of parameter θt. 𝜃 𝑡 = 𝑃(𝑧|𝑢, 𝑣) If the deviation | θt – θt-1| is less than a threshold ε, then algorithm is stopped; otherwise go back step 1. With example in table 2, suppose ranges of u, v and z are {1, 2, 3, 4}, {1, 2, 3, 4} and {1, 2, 3}, the conditional probabilities P(u | z) and P(v | z) are calculated and shown in table 3. u=1

z=1

z=2

z=3

0.5

0.25

0


u=2

0

0.25

0.5

u=3

0.5

0.25

0

u=4

0

0.25

0.5

v=1

0.5

0.25

0

v=2

0

0.25

0.5

v=3

0.5

0.25

0

v=4

0

0.25

0.5

Table 3. Conditional probabilities P(u | z) and P(v | z) Suppose the parameter is initialized by 0.5, we have 0 θ = P0(z | u, v) = 0.5. It means that the prior probabilities of z is 0.5. It is easy to calculate the posterior probabilities P(z | u, v), for example: 𝑃(𝑧 = 1|𝑢 = 1, 𝑣 = 1) =

𝑃(𝑢 = 1|𝑧 = 1)𝑃(𝑣 = 1|𝑧 = 1)𝜃10 𝑃(𝑢 = 1|𝑧 = 1)𝑃(𝑣 = 1|𝑧 = 1)𝜃10 + (𝑃(𝑢 = 1|𝑧 = 2)𝑃(𝑣 = 1|𝑧 = 2)𝜃10 +) 𝑃(𝑢 = 1|𝑧 = 3)𝑃(𝑣 = 1|𝑧 = 3)𝜃10

0.5 ∗ 0.5 ∗ 0.5 0.5 ∗ 0.5 ∗ 0.5 + 0.25 ∗ 0.25 ∗ 0.5 + 0 ∗ 0 ∗ 0.5 = 0.8 By the similar way, the probabilities P(z | u, v), at the first iteration, over all values of u, v and z are calculated and shown in table 4. =

z=1

z=2

z=3

u=1, v=1

0.8

0.2

0

u=1, v=2

0

1

0

u=1, v=3

0.8

0.2

0

u=1, v=4

0

1

0

u=2, v=1

0

1

0

u=2, v=2

0

0.2

0.8

u=2, v=3

0

1

0

u=2, v=4

0

0.2

0.8

u=3, v=1

0.8

0.2

0

u=3, v=2

0

1

0

u=3, v=3

0.8

0.2

0

u=3, v=4

0

1

0

u=4, v=1

0

1

0

u=4, v=2

0

0.2

0.8

u=4, v=3

0

1

0

u=4, v=4

0

1

0

Table 4. Posterior probabilities of z, P(z | u, v)

11

At next iteration posterior probabilities z, P(z | u, v) are computed by similar way except that the parameter θ is assigned by values in table 3. In other words, the parameter for next iteration is assigned as current posterior probabilities z, P(z | u, v). Now we estimate missing values such as x31, x41, x32, x42, x13, x23, x14 and x24. The posterior probabilities for x31 = 1, x31 = 2, x31 = 4 are P(z = 1 | u = 3, v = 1) = 0.8, P(z = 2 | u = 3, v =1) = 0.2 and P(z = 3 | u = 3, v = 1) = 0, respectively. Because P(z = 1 | u = 3, v = 1) = 0.8 is maximal, x31 receives value 1, x31 = 1. By the similar way, we have x41 = 2, x32 = 2, x42 = 3, x13 = 1, x23 = 2, x14 = 2 and x24 = 3. These estimates are much better than ones resulted from average method even if the algorithm have just come over one iteration. Table 5 shows the complete sample whose missing values are estimated. v=1

v=2

v=3

v=4

u=1

x11 = 1

x12 = 2

x13 = 1

x14 = 2

u=2

x21 = 2

x22 = 3

x23 = 2

x24 = 3

u=3

x31 = 1

x32 = 2

x33 = 1

x34 = 2

u=4

x41 = 2

x42 = 3

x43 = 2

x44 = 3

Table 5. Complete sample whose missing values are estimated by Bayesian method. However, the drawback of Bayesian estimate method in hypothesis testing is that it gives three assumptions as below. - Sample scalar values xij (s) are considered as random values while vector Xi (s) are random variables in null hypothesis. Scalar distribution is different from vector distribution with regard to the modality of data. When the modality of data is transformed, the meaning of data is changed unpredictably. - Row index u and column index v are also considered as random variables but this assumption doesn’t exist in null hypothesis. - Row variable u and column variable v are mutually independent given sample value xij. In general, mixture model approach is better than filling missing values approaches in flavor of hypothesis testing because it does not give any

12 A Maximum Likelihood Mixture Approach for Multivariate Hypothesis Testing in case of Incomplete Data

additional assumption about sample space except assumptions that null hypothesis issues such as the normality of sample data. In other words, it conserves attributes of sample data.

4. Conclusions The basic idea of this research is to analyze the

global distribution as a set of partial (or potential) distributions. Because it is impossible to apply the global distribution into incomplete sample, the research applies each partial distribution into a sub-sample of incomplete sample where sub-sample is defined as a piece of incomplete sample so that such sub-sample contains complete observation vectors. Partial distributions are unified into a mixture model and expectation maximization (EM) algorithm is used to find out optimal parameters of mixture model such that these optimal parameters maximize the likelihood function. The ratio of the maximum likelihood of null hypothesis to the maximum likelihood of alternative hypothesis is totally determined based on these optimal parameters. Consequently, this ratio is used to test null hypothesis in flavor of alternative hypothesis by the normal way that is to take advantage of chi-square distribution. The excellent of this research is that there is no requirement of filling missing values existing in incomplete sample. When missing values are replaced by estimated values, inherent attributes of sample data changed or disturbed even if such estimated values are considered as the best-predicted values. In other words, this method maintains inherent attributes of sample data. Moreover, the higher the density of missing

values is, the more effective this research is. Especially, if the data incompletion occurs frequently or conforms a particular period or rule in process of collecting samples, then the corporation of partial distributions and likelihood ratio test delivers the best result. However if missing values occur randomly such as pepper-and-salt noise, this approach is not good choice for test because pepper-and-salt noise data has no inherent features and so partial distributions cannot be specified and estimated precisely.

References [1]

[2]

[3] [4]

W. Hardle and L. Simar, Applied Multivariate Statistical Analysis, Berlin: Research Data Center, School of Business and Economics, Humboldt University, 2013, p. 486. J. A. Bilmes, “A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models,” University of Washington, Berkeley, 1998. B. Sean, “The Expectation Maximization Algorithm - A short tutorial,” Sean Borman's Homepage, 2009. L. Nguyen, Matrix Analysis and Calculus, 1st ed., C. Evans, Ed., Hanoi: Lambert Academic Publishing, 2015, p. 72.

A Maximum Likelihood Mixture Approach for Multivariate Hypothesis

A Maximum Likelihood Mixture Approach for Multivariate Hypothesis

Suggest Documents

Penalized maximum likelihood for multivariate Gaussian mixture

Maximum likelihood estimation of the multivariate normal mixture model

MAXIMUM LIKELIHOOD ESTIMATION FOR MULTIVARIATE NORMAL ...

Multivariate normal approximation of the maximum likelihood ...

Multivariate Maximum Likelihood based Predictive Algorithm

Multivariate normal approximation of the maximum likelihood ...

Maximum Likelihood Estimation of Stationary Multivariate ARFIMA ...

maximum likelihood estimation in multivariate lognormal diffusion

A Maximum Likelihood Approach Using Phylogenies

A Maximum Likelihood Approach to Nonlinear

A Maximum-Likelihood Approach to Modeling

Maximum likelihood estimation of a finite mixture of logistic regression

A maximum likelihood approach for identifying dive ... - CEBC - CNRS

A maximum pseudo-likelihood approach for estimating ... - Springer Link

A Maximum Likelihood Approach to Noise Estimation for ... - Ostfalia

A Unified Maximum Likelihood Approach for Optimal Distribution ...

Maximum likelihood

A Multiple Hypothesis Gaussian Mixture Filter for

generalized maximum likelihood estimation of normal mixture densities

Performance of maximum likelihood mixture models to ... - PeerJ

Decomposition of normal mixture by maximum likelihood - Stata

MAXIMUM LIKELIHOOD ESTIMATION The maximum

The multivariate Watson distribution: Maximum-likelihood ... - Suvrit Sra

Penalized Maximum Likelihood Principle for