An Iterative Locally Auto-Weighted Least Squares ... - IEEE Xplore

2 downloads 0 Views 3MB Size Report
Mar 2, 2017 - Abstract—Microarray data often contain missing val- ues which significantly affect subsequent analysis. Exist- ing LLSimpute-based imputation ...
IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 1, JANUARY 2017

21

An Iterative Locally Auto-Weighted Least Squares Method for Microarray Missing Value Estimation Zeng Yu, Tianrui Li∗ , Senior Member, IEEE , Shi-Jinn Horng, Member, IEEE , Yi Pan, Senior Member, IEEE , Hongjun Wang, and Yunge Jing Abstract — Microarray data often contain missing values which significantly affect subsequent analysis. Existing LLSimpute-based imputation methods for dealing with missing data have been shown to be generally efficient. However, all of the LLSimpute-based methods do not consider the different importance of different neighbors of the target gene in the missing value estimation process and treat all the neighbors equally. In this paper, a locally auto-weighted least squares imputation (LAW-LSimpute) method is proposed for missing value estimation, which can automatically weight the neighboring genes based on the importance of the genes. Then, an accelerating strategy is added to the LAW-LSimpute method in order to improve the convergence. Furthermore, an iterative missing value estimation framework of LAW-LSimpute (ILAW-LSimpute) is designed. Experimental results show that the ILAW-LSimpute method is able to reduce the estimation error. Index Terms — Auto-weighted local least squares, iterative estimation, microarray data analysis, missing value estimation.

I. I NTRODUCTION

W

ITH the development of biotechnology, DNA microarray technology [1] has become an increasingly powerful analysis tool for revealing the mysteries of organism at the genetic level. Datasets from this technology are widely used in biological and medical studies, e.g., prediction of unknown gene function [2], [3], analysis of gene regulatory network [4], discovery of pathogeny relevant to major diseases [5], and identification of drug efficacy [6], [7]. However, due to a variety of reasons such as hybridization failures, Manuscript received April 6, 2016; revised October 30, 2016 and November 30, 2016; accepted November 30, 2016. Date of publication December 6, 2016; date of current version March 2, 2017. Asterisk indicates corresponding author. Z. Yu is with the School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China and also with the Department of Computer Science, Georgia State University, Atlanta, GA 30303 USA (e-mail: [email protected]). ∗ T. Li is with the School of Information Science and Technology, Southwest Jiaotong University, Chengdu, 611756, China (e-mail [email protected]). S.-J. Horng is with the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei 106, Taiwan (e-mail: [email protected]). Y. Pan is with the Department of Computer Science, Georgia State University, Atlanta, GA 30303 USA (e-mail: [email protected]). H. Wang and Y. Jing are with the School of Information Science and Technology, Southwest Jiaotong University, Chengdu, 611756, China (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TNB.2016.2636243

experimental errors, etc., microarray data often contain many missing values [8]. This can produce a negative impact on optimal extraction of information. Furthermore, the accuracy of downstream analysis like clustering [9], classification [10], detection of differential expression [11] and partitioning of data [12] are reduced by incomplete data. Therefore, recovery of missing values is necessary. A straightforward approach is to repeat the whole experiment, but this might not always be practical because of economic issues or sometimes limitations of samples. As a result, computation-based estimation of the missing values becomes necessary and crucial. The simplest methods to handle missing data are to ignore them and simply replace missing values by zeros or row averages. But these approaches usually get a poor performance and lead to high estimation errors because of not considering correlation structure of the data [13]. Accordingly, more complex and better performing algorithms have been introduced to utilize such structure. These algorithms include singular value decomposition imputation (SVDimpute) [14], k-nearest neighbors imputation (KNNimpute) [13], Bayesian principal component analysis (BPCA) [15], least squares imputation (LSimpute) [16], robust least squares estimation with principal components (RLSP) [17] and projection onto convex sets (POCS) [18]. The KNNimpute method is one of the earliest and fastest missing value estimation methods. It firstly uses some similarity measures (e.g., Euclidean distance, Pearson correlation) to preselect a set of genes nearest to the target gene that contains missing values, and then fills the missing values by the weighted average of the preselected genes. It is found that KNNimpute method is more robust and accurate than the SVDimpute method [13]. BPCA utilizes Bayesian estimation method to identify the parameters to approximate missing values by a linear combination of the first few principal axis vectors of the microarray data. This method performs well when the genes missing places occur independently or a global structure exists in microarray data. It has also been shown that BPCA exhibits better estimation ability than KNNimpute and SVDimpute methods when the number of the samples is large [15]. LSimpute method is a regression-based method that performs well when the data has a strong local correlation structure. Studies have shown that it has a better performance than KNNimpute method [16]. POCS takes the advantage of KNNimpute and SVDimpute methods and performs better than both of them [18].

1536-1241 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

22

To further improve the performance, a new estimation strategy, local least squares imputation (LLSimpute) [19] method has been recently developed. It is one of the most promising approaches for missing value estimation in the analysis of gene expression profiles data. It utilizes the least squares optimization method to find the missing values by conducting a regression of gene with missing value against the closest k genes. It is also noteworthy that an automatic parameter selection algorithm is designed by adopting an automatic k-value estimator. Studies have shown that LLSimpute method usually provides a better performance in comparison to KNNimpute and BPCAimpute methods [19]. Since the appearance of the LLSimpute algorithm, various imputation methods based on LLSimpute had been proposed. All of them utilize the least squares principle to find the missing values. To use the previous estimated values, Cai et al. [20] developed an iterated local least squares imputation (ILLSimpute) method. This method adopts an iterative approach to improve the accuracy of missing value estimation, but it may cause high estimation errors when the number of iterations is high. Zhang et al. [21] proposed a sequential local least squares imputation (SLLSimpute) method. It is a sequential approach which starts imputation from the genes with minimum missing rate, then partially utilizes these estimated values to impute the genes with higher missing rate. After that, Ching et al. [22] introduced a weighted local least squares imputation (WLLSimpute) method, in which a new object function was established by assembling the Row Average (RA) method and the LLSimpute method with a weighting factor that ranges from 0 to 1. In WLLSimpute, missing values are obtained through minimizing the new object function and the weighting factor can be identified by training on the known data. Most recently, Cheng et al. [23] presented an iterative bicluster-based least squares (Bi-ILSimpute) method. It firstly employs biclustering method to preselect the most similar genes in a subset of conditions, and then estimates the missing values with the preselected genes under the framework of local least squares imputation. An iterative framework for missing value estimation is also developed to improve the accuracy of missing value estimation. Generally, regression analysis is a necessary step for the LLSimpute-based methods. Regression results can be further applied for estimating missing values. However, all of the LLSimpute-based methods do not consider the different importance of different neighbors of the target gene that contains missing values in regression analysis. They treat all the neighbors equally. In reality, different neighbors of the target gene could have varying importance. This idea has been emphasized by some well-known missing value estimation methods such as KNNimpute method [13] and local weighted linear approximation imputation (LWLAimpute) method [24]. The KNNimpute method fills missing values by the weighted average of neighboring genes, where the weight is the reciprocal of Euclidean distance between the target gene and its neighboring gene. This indicates that the smaller the distance, the higher the weight and the greater the seeming importance. The LWLAimpute method linearly approximates the missing part of target gene, in which a monotonously decreasing

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 1, JANUARY 2017

weighting function is employed to tailor the candidate genes by emphasizing nearby genes in the approximation. With the weighting function, those genes that are closer to the target gene are going to become more important. It has been pointed out that the KNNimpute method is a special case of the LWLAimpute method [24]. In this paper, we consider that different neighboring genes of the target gene make different contributions to the identification of regression coefficients in regression analysis. The contribution of neighboring genes is represented as a weight that can be treated as the degree of neighboring genes in the contribution to regression analysis. Thus, we propose a locally auto-weighted least squares imputation (LAW-LSimpute) method for estimating missing values, which can automatically weight the neighboring genes based on the importance of the genes in regression analysis. For each target gene with missing values, the LAW-LSimpute starts from preselecting the neighboring genes, and then uses the local least squares method to find the missing values by implementing a locally auto-weighted regression on the neighboring genes. In LAW-LSimpute, we add a new step to the basic LLSimpute method to update the neighboring genes weights based on the previous regression. We present a new weight calculation formula that minimizes the square of Frobenius norm of residual matrix. In order to improve the convergence, an accelerating strategy is added to the LAW-LSimpute method. In addition, an iterative missing value estimation framework of LAW-LSimpute (ILAW-LSimpute) is designed in this paper. The performance of ILAW-LSimpute is empirically compared with the performance of the other five popular imputation methods. The results have shown that the new algorithm is able to reduce the estimation error. The main characteristics of the ILAW-LSimpute algorithm are given as follows: • Flexibility and generalization. Our new model is established by multiplying a diagonal weight matrix by the objective function of the classic LLSimpute algorithm. Thus, compared with the classic LLSimpute algorithm, the new model has two advantages. Firstly, it has the capability to represent many cases of the weighted LLSimpute algorithm, that is, the new model is flexibility. Secondly, as will be discussed in Section III, the classic LLSimpute algorithm is a special case of our model, i.e., our new model is a generalization of the classic LLSimpute algorithm. • Accuracy and reasonability. In traditional algorithms for solving the weighted least squares regression problem, most of them give the weight formula by their experience. In this paper, the weight formula is analytic rather than experiential. We present the weight calculation formula by employing the least squares principle and the Lagrange method to iteratively solve a constrained optimization problem. Hence, in our proposed algorithm, the weight calculation formula for the weighted least squares regression problem is more accurate and reasonable. The rest of this paper is organized as follows. Section II introduces the basic LLSimpute algorithm. Section III presents the ILAW-LSimpute algorithm. Experiments on eight

YU et al.: ITERATIVE LOCALLY AUTO-WEIGHTED LEAST SQUARES METHOD FOR MICROARRAY MISSING VALUE ESTIMATION

real datasets are shown in Section IV. Conclusions together with some further studies are summarized in the last section. II. R EVIEW OF L OCAL L EAST S QUARE I MPUTATION In this section, we briefly describe the classic LLSimpute method proposed by Kim et al. [19]. We use G ∈ m×n to denote the gene expression matrix with m genes and n experiments and assume m  n. In the matrix G, the row vector glT ∈ 1×n , l = 1, 2, ..., m denotes expression level of the l-th gene in n experiments. Let gtT ∈ 1×n be the target gene with p missing values. We assume without losing generality that all the missing values occur in the first p positions of the target gene, i.e., gtT = (u T v T ), where u T is a 1 × p vector representing the p missing values in the target gene and v T is a 1 × (n − p) vector representing the nonmissing values. For each target gene that has missing values, the LLSimpute method firstly preselects the k most similar genes as its neighbors based on either L2-norm or Pearson correlation coefficients, and then obtains the missing values by implementing a regression of the target gene against its neighbors. To estimate the missing part u T of the target gene, the k-nearest neighbor genes g STi ∈ 1×n , i = 1, 2, ..., k are found and re-arranged as follows: ⎛ T ⎞ g S1 ⎜ .. ⎟ (1) ⎝ . ⎠ = (B A) T g Sk where B is a k × p matrix consisting of the first p positions of the k-nearest neighbor genes and A is a k × (n − p) matrix consisting of the remaining (n − p) positions. After the matrices A and B are formed, the least squares problem for missing value estimation can be constructed as: min = AY − B  F Y

(2)

The closed form solution of (2) is given by Yˆ = A−† B, where A−† is the pseudoinverse of A. Therefore, the missing part u T of the target gene can be estimated as a linear combination of values of the non-missing part v T , that is u T = v T Yˆ = v T A−† B

(3)

23

method for missing value estimation. It can automatically distinguish the importance of different neighbors of the target gene by employing a weighting function which emphasizes nearby genes in regression analysis. To further increase the accuracy of missing value estimation, an iterative framework of LAW-LSimpute (ILAW-LSimpute) is also developed. The proposed algorithm is based on the following key assumption: Assumption: These k closest genes of the target gene with different distances could have different importance in regression analysis.

A. Mathematical Formulation of LAW-LSimpute Method For each target gene gtT , we generally assume that there are p missing values that occur in the first p positions and q = n − p non-missing values. Let u T be the missing part of the target gene and v T be the non-missing part, respectively. Thus, gtT = (u T v T ). Similar to the LLSimpute method, the LAW-LSimpute method firstly preselects the k closest genes of the target gene based on Euclidean distance and constructs the matrices A and B with the preselected k genes, and then utilizes the least squares principle to find the missing values by implementing a locally auto-weighted regression with the k closest genes. Based on the constructed matrices A and B, LAW-LSimpute is to find the weight matrix W (k) and coefficient matrix Y that minimizes the objective function P: min P(W (k), Y ) = W (k)(AY − B) 2F

W (k),Y

(4)

subject to k 

wi (k) = 1

(5)

i=1

where Y is a q × p matrix containing the regression coefficients, W (k) is a k × k diagonal weight matrix representing the different importance of neighboring genes of the target gene and wi (k) is the i -th element of weight matrix W (k). In practice, it is difficult to completely solve the problem P. However, if the weight matrix Wˆ (k) is fixed, the least squares solution of problem P can be given by Yˆ = [Wˆ (k)A]−† Wˆ (k)B

(6)

It should be pointed out that a heuristic algorithm has been adopted to identify the only parameter k in LLSimpute. So it can be viewed as a non-parametric missing value estimation method like BPCA.

where [Wˆ (k)A]−† is the pseudoinverse of Wˆ (k)A. Thus, the missing part u T of the target gene can be approximated as

III. ILAW-LS IMPUTE A LGORITHM FOR M ISSING VALUE E STIMATION

The normalized root mean squared error (NRMSE) used in [22] and [23] is employed to evaluate the performance of the proposed new algorithm:

1 2 (i, j )∈S (gi j − gˆ i j ) |S| NRMSE = (8) 1

2 (g − g) ¯ i j (i, j )∈S |S|−1

It is well-known that all of the LLSimpute-based methods utilize the least squares framework to fill missing values. These least squares frameworks similar to the problem described by (2) are in fact the classical regression analysis problems. As discussed previously, this regression analysis does not consider the different importance of different neighbors of the target gene. They treat all the neighbors equally. Therefore, in this paper, we propose the LAW-LSimpute

u T = v T Yˆ = v T [Wˆ (k)A]−† Wˆ (k)B

(7)

where gi j is the true value, gˆ i j is the estimated value, S is the set of missing entries, |S| is the cardinality of the set S and g¯ is the average of all missing entries.

24

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 1, JANUARY 2017

B. LAW-LSimpute Method In order to solve the problem P, a new iterative solution method similar to the one used in [25] and [26] is presented in this section. Denote that ⎞ ⎛ a11 a12 · · · a1q ⎜ a21 a22 · · · a2q ⎟ ⎟ ⎜ A=⎜ . .. .. .. ⎟ , ⎝ .. . . . ⎠ ak1

ak2

···

b11 ⎜ b21 ⎜ B=⎜ . ⎝ ..

b12 b22 .. .

··· ··· .. .

bk1 y11 ⎜ y21 ⎜ Y =⎜ . ⎝ ..

bk2 y12 y22 .. .

yq1

yq2





⎛ ⎜ ⎜ andW (k) = ⎜ ⎜ ⎝

β w1 (k)

0 .. . 0

akq

··· ··· ··· .. . ···

⎞ b1 p b2 p ⎟ ⎟ .. ⎟ , . ⎠ bkp ⎞ y1 p y2 p ⎟ ⎟ .. ⎟ . ⎠

E t  =0

yq p ··· .. . .. . 0

0 β w2 (k)

..

. ···

0 .. . 0

where



Ei =

⎟ ⎟ ⎟ ⎟ ⎠



ak1

b11 ⎜ b21 ⎜ −⎜ . ⎝ .. bk1 =

k 

···

ak2 b12 b22 .. . bk2

2β wi (k)

··· ··· .. .

akq ⎞⎤ 2 b1 p ⎟ ⎥ b2 p ⎟⎥ .. ⎟⎥ ⎠ ⎦ . bkp

···  q p   j =1

i=1

yq1

F

s.t

k  i=1

(12)

wˆ i (k) = 1 and wˆ i (k) = 0, f or all i = i



(13)



where i = arg min{E i }, i = 1, 2, ..., k. i

Proof: 1. We rewrite problem P2 as

=

k 

2β wi (k)

y12 y22 .. . yq2

··· ··· .. . ···

⎞ y1 p y2 p ⎟ ⎟ .. ⎟ . ⎠ yq p

ail yl j − bi j

l=1

wi (k) = 1, 0 ≤ wi (k) ≤ 1

=

k 

 q p   j =1

i=1

2

j =1

ail yˆl j − bi j

l=1

P2 (W (k), Yˆ )

l=1

i=1

2

2. When β = 21 , P2 (W (k), Yˆ ) is minimized iff

β wk (k)

Therefore, the objective function P can be simplified as follows:  q 2 p k    2β wi (k) ail yl j − bi j (9) min P(W (k), Y ) = W (k),Y

 q p   j =1

where β, a smoothing parameter for each element of weight matrix W (k), makes the weights change slowly. Hence, the problem P can be rewritten as P(W (k), Y ) = W (k)(AY − B) 2F ⎛ β ⎞ w1 (k) 0 ··· 0 ⎜ .. ⎟ .. β ⎜ 0 . w2 (k) . ⎟ ⎜ ⎟ = ⎜ . ⎟ . . .. .. ⎝ .. 0 ⎠ β 0 ··· 0 wk (k) ⎡⎛ ⎞ ⎛ a11 a12 · · · a1q y11 ⎢⎜ a21 a22 · · · a2q ⎟ ⎜ y21 ⎢⎜ ⎟ ⎜ × ⎢⎜ . .. .. .. ⎟ × ⎜ .. ⎣⎝ .. . . . ⎠ ⎝ .

The above optimization problem P can be solved by iteratively solving the following two minimization problems: 1. Problem P1 : Fix W (k) = Wˆ (k) and solve the reduced problem P1 (Wˆ (k), Y ); 2. Problem P2 : Fix Y = Yˆ and solve the reduced problem P2 (W (k), Yˆ ). In fact, problem P1 is a classical regression problem. For given Wˆ (k), the least squares solution of problem P1 is given by (6). The solution of problem P2 is given in Theorem 1. Theorem 1: Let Y = Yˆ be fixed. 1. When β > 21 or β < 0, P2 (W (k), Yˆ ) is minimized iff ⎧ if Ei = 0 ⎪ ⎪0 ⎪ ⎨ 1 if Ei = 0 (11) wˆ i (k) = 1 ⎪ k  Ei  2β−1 ⎪ ⎪ ⎩ t =1 Et

2 ail yˆl j − bi j

l=1



wi (k)E i

(14)

i=1

where E 1 , E 2 , ..., E k are k constants when Yˆ is fixed. If E i = 0, this means that the i -th neighbor makes no contribution to the objective function of regression analysis. Then we assign wˆ i = 0 to the i -th neighbor where E i = 0. If E i = 0, the Lagrange multiplier method is adopted to

ksolve the problem P2 with the constraint condition i=1 wi (k) = 1. Let λ be the multiplier and (W (k), λ) =

k 



wi (k)E i

i=1





k 

 wi (k) − 1

(15)

i=1

(10)

If (Wˆ (k), λˆ ) is to minimize (W (k), λ), its gradient in both sets of variables must vanish.

YU et al.: ITERATIVE LOCALLY AUTO-WEIGHTED LEAST SQUARES METHOD FOR MICROARRAY MISSING VALUE ESTIMATION

Thus, ˆ ∂(Wˆ (k), λ) 2β−1 = 2β wˆ i (k)E i + λˆ = 0 ∂ wˆ i (k) k ∂(Wˆ (k), λˆ )  = wˆ i (k) − 1 = 0 ∂ λˆ

(16) (17)

i=1

It follows from (16)   1 −λˆ 2β−1 wˆ i (k) = f or all 1 ≤ i ≤ k 2β E i Substituting (18) into (17), we obtain   1 k  −λˆ 2β−1 =1 2β E t

(18)

(19)

t =1 E t  =0

From (19), we derive   1 2β−1 −λˆ

⎤ 1   k 2β−1 ⎥ ⎢ 1 ⎥ = 1/ ⎢ ⎦ ⎣ 2β E t ⎡

(20)

t =1 E t  =0

Hence, it follows from (18) and (20) that wˆ i (k) =

k

t =1 E t  =0

1   Ei Et

1 2β−1

(21)

2. It is clear that, when wi (k) = 1, the corresponding objective function value is equal to E i . Let (w1 (k), w2 (k), ..., wk (k)) be the feasible

solution of problem P2 . According to the fact that ki=1 wi (k) =

1 and the feasible solution space ( ki=1 wi (k) = 1 and 0 ≤ wi (k) ≤ 1 for all 1 ≤ i ≤ k) is convex, we have k  wi (k)E i (22) Ei ≤ i=1

Therefore, we set wˆ i (k) = 1 and wˆ i (k) = 0 (for all i = i ). The optimal solution can be determined.  Given a target gene, the principal for neighbor gene weighting is to assign a larger weight to a neighbor gene that has a smaller residual sum of squares and a smaller one to a neighbor gene that has a larger residual sum of squares. The residual sum of squares is given by (12) in Theorem 1 and the weight β

wˆ i (k) is calculated by (11). However, the real weight wˆ i (k) is also dependent on the value of β. According to the above principal, we can discuss what values we can choose for β. a) when 0 < β < 1/2, the larger E i , the larger wˆ i (k), so does β wˆ i (k). This is against the neighbor weighting principal, so we cannot choose 0 < β < 1/2; b) when β > 1/2, the larger E i , β the smaller wˆ i (k), and the smaller wˆ i (k). The effect of gene g STi with large E i is reduced; c) when β < 0, the larger E i , the β larger wˆ i (k). However, wˆ i (k) becomes smaller and has less weighting to the neighbor in the distance calculation because of negative β. Hence, we can choose β > 1/2 or β < 0.

25

The flowchart of the LAW-LSimpute method is shown as follows. Algorithm 1-(LAW-LSimpute algorithm) Step 0. For each target gene, select the k closest genes from original gene expression matrix G ∈ m×n and construct the matrices A, B. Let ε > 0 (stopping tolerance) and set the maximum number of iteration Max I t Num; Step 1. Randomly choose initial weight matrix W 0 (k). Determine Y 0 such that P(W 0 (k), Y 0 ) is minimized and compute Err or F = AY 0 − B  F . Set Mi n Err or F = Err or F, Ymin = Y 0 and t = 0; Step 2. Let Wˆ (k) = W t (k), obtain Y t +1 such that P(Wˆ (k), Y ) is minimized and compute Err or F =  AY t +1 − B  F . If Err or F < Mi n Err or F, Mi n Err or F = Err or F and Ymin = Y t +1 . If  P(Wˆ (k), Y t +1 ) − P(Wˆ (k), Y t ) < ε or t = Max I t Num, go to Step 4; otherwise, go to Step 3; Step 3. Let Yˆ = Y t +1 , obtain W t +1 (k) such that P(W (k), Yˆ ) is minimized. If  P(W t +1 (k), Yˆ )−P(W t , Yˆ ) < ε, go to Step 4; otherwise, set t = t + 1 and go to Step 2. Step 4. If t = Max I t Num, go to Step 5; otherwise, estimate the missing values by u T = v T Y t +1 and go to Step 6; Step 5. If  P(Wˆ (k), Y t +1 ) − P(Wˆ (k), Y t ) < ε, estimate the missing values by u T = v T Y t +1 and go to Step 6; otherwise, estimate the missing values by u T = v T Ymin and go to Step 6; Step 6. If all the missing values have been estimated, output NRMSE and stop; otherwise, go to Step 0. Remark 1: It is clear that the proposed method can be regarded as an extension of LLSimpute method. Firstly, if β is set to be 0, the objective function (4) is equivalent to (2), regardless of wˆ i (k). That also means the classic LLSimpute algorithm is a special case of our proposed method. Secondly, if all the elements are equal in weight matrix W (k), the regression results of both the proposed method and LLSimpute method are the same. Therefore, the proposed algorithm can be regarded as an extension of LLSimpute method. Also, this algorithm may have the potential to take the advantages of LLSimpute method. Remark 2: In LAW-LSimpute, an accelerating principal is added to improve its convergence. The principal is that if the objective function (4) reaches the minimum, we will fill the missing values by u T = v T Y t +1 ; otherwise, we will find missing values by searching the minimum that minimizes the objective function (2) from the possible set that minimizes the objective function (4). This principal can guarantee our proposed algorithm is convergent even if the objective function (4) is not reaching the minimum and in worst cases can even perform as well as LLSimpute method.

C. An Iterative Framework of LAW-LSimpute Method: ILAW-LSimpute In order to use the previous estimated values, an iterative version of LAW-LSimpute method, i.e., ILAW-LSimpute, will be designed in this section. This iterative framework aims to use statistical method to estimate missing values. We employ the concept of the uncertainty that also used in [23] and [27] to update the estimates. For the target gene that

26

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 1, JANUARY 2017

contains p missing values, the j -th ( j = 1, 2, ..., p) missing value will be replaced in the current iteration only if the uncertainty is decreased. The uncertainty θ is calculated as   (23) θ = tα/2,u−r v T (A T A)−1 v + 1 σˆ 2 where tα/2,u−r is the value of t-distribution at a significant level of α with u − r degree of freedom, u and r are the numbers of rows and columns of matrix A, respectively. σˆ 2 is the unbiased estimator of noise variance in the regression model, i.e., σˆ 2 = (B j − AYˆ j )T (B j − AYˆ j )/(u − r )

(24)

where B j is the j -th column of matrix B and Yˆ j is the j -th column of matrix Yˆ . From (23), we can see that a small θ implies a small deviation between estimated value and its actual value. Moreover, according to the condition of replacement, the values of uncertainty are a non-increasing sequence. Hence, one can draw a conclusion that the iterative process would terminate and the estimated values would converge. In practice, the iterative process is terminated when the average change in estimates is insignificant. To further control the number of iterations, we set the maximum number of iterations in the implementation. Algorithm 2-(ILAW-LSimpute algorithm) Step 0. Let Max I t Num be the maximum number of iteration and t=0; Step 1. For each target gene that contains p missing values, utilize LAW-LSimpute algorithm to get the coefficient matrix Yˆ . For the j -th ( j = 1, 2, ..., p) missing value, compute the uncertainty θ . If the uncertainty is decreased, replace the j -th missing value. Step 2. If all the genes have been estimated in the current iteration, go to Step 3; otherwise, go to Step 1. Step 3. If the average change in estimates is insignificant or t = Max I t Num, output NRMSE and stop; otherwise, set t = t + 1 and go to Step 1. From the above we can see that Algorithm 2 consists of two main parts, i.e., the optimization of coefficient matrix Yˆ as shown in Algorithm 1 and the computation of uncertainty θ . Based on (9), we can see that the computation complexity of Algorithm 1 is O(t1 mkpq 2), where t1 is the number of iterations and m is the number of genes that have missing values. From (23), the computation complexity of uncertainty θ is O(k 2 q + k 3 ). Therefore, the complexity of ILAW-LSimpute method is O(t1 t2 mkpq 2 + t2 (k 2 q + k 3 )), where t2 is the number of iterations in Algorithm 2. Let n = max{k, p, q} and t = t1 t2 + t2 . Then O(t1 t2 mkpq 2 + t2 (k 2 q + k 3 )) ≤ O(tmn 4 ). It means that the complexity of ILAW-LSimpute method is less than O(tmn 4 ). IV. E XPERIMENTAL R ESULTS

A. Data Sets To evaluate the performance of our new method, eight different data sets are employed in our experiments: The first two come from a study on the identification of cellcycle regulated genes in yeast Saccharomyces cerevisiae [28].

We select samples of alpha factor to construct the first dataset called SP.Alpha and hybrid the cdc15 and cdc28 to generate the second dataset called SP.Cdc. The third data set, Infection [29], is obtained from a gene-expression program reflecting innate immune response of intestinal epithelial cells to infection by Listeria monocytogenes. To ensure the effectiveness of all the LLSimpute-based methods used in our experiments, we remove seven columns that have many zero values from the Infection data set. The fourth data set, Ronen [33], includes two time series in yeast from a study of response to environmental changes. The fifth data set, Yo.Calcineurin [30], is from the genome-wide analysis of gene expression regulated by the calcineurin/crzlp signaling pathway in Saccharomyces cerevisiae. The sixth data set, Lymphoma [31], comes from the study of distinct types of diffuse large B-cell Lymphoma identified by gene expression profiling. We selected the data from Fig. 1 as the so called Lymphoma data set. The seventh data set, Gasch [32], is obtained from the study of genomic expression analysis in the responses to yeast cells to environment changes. The last data set, Ogawa [34], is a non-time genome DNA microarray series from yeast. Each data set is pre-processed for the evaluation of the imputation methods. We remove the columns that have more than r % (r is a number and specified by users) missing values and then select gene rows with no missing value to construct the “complete” matrices. To simulate the real situation, different missing rates are randomly generated in the “complete” matrices: 1%, 5%, 10%, 15%, 20%, 25%, 30%, respectively. A description of the eight data sets is given in Table I. Note that dataset SP.Alpha was used to evaluate the performance of LLSimpute [19], ILLSimpute [20], Bi-ILSimpue [23] and recursive mutual imputation (RMI) [36]. Dataset SP.Cdc was used to assess Bi-ILSimpue and LWLAimpute [24]. Infection, Yo.Calcineurin, Ronen and Ogawa were used to test biclusterbased BPCA (Bi-BPCA) [35]. Lymphoma and Gasch were used to evaluate the performance of LWLAimpute. Ronen was also used to test Bi-ILSimpue and RMI.

B. Performance Comparison of ILAW-LSimpute With Other Methods The proposed algorithm is compared with five popular imputation methods, e.g., the LLSimpute method [19], the ILLSimpute method [20], the Bi-ILSimpute method [23], the Bi-BPCA method [35] and the MRI method [36]. Because the ILLSimpute method is an unstable imputation algorithm, the best result of its iterations is selected for comparison. In practice, the ILLSimpute method is done for the first few iterations only so as to weigh the computational cost against the performance. The accuracy of the estimation result is evaluated by an average of the NRMSE defined by (8). To get the average NRMSE values, all the experiments are conducted five times. For dataset Ronen, we provide the average NRMSE values and their standard deviations in Table II. We also give the average computation time and their standard deviations in Table III. As can be seen from

YU et al.: ITERATIVE LOCALLY AUTO-WEIGHTED LEAST SQUARES METHOD FOR MICROARRAY MISSING VALUE ESTIMATION

27

Fig. 1. Average NRMSE of the LLSimpute, the ILLSimpute, the Bi-ILSimpute, the Bi-BPCA, the RMI, the LAW-LSimpute and the ILAW-LSimpute at different missing rates for eight microarray datasets (a) SP.Alpha; (b) SP.Cdc; (c) Infection; (d) Ronen; (e) Yo.Calcineurin; (f) Lymphoma; (g) Gasch; (h) Ogawa.

Tables II and III, the NRMSE of ILAW-LSimpute is lower than that of LLSimpute, ILLSimpute, Bi-ILSimpute, Bi-BPCA and MRI, from missing rates 1% to 25%, but the computation time

is the second largest. This indicates that the ILAW-LSimpute method has good performance on missing value estimation, while it takes a lot of time.

28

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 1, JANUARY 2017

TABLE I T ESTING D ATASETS U SED IN O UR E XPERIMENTS

TABLE II C OMPARISON OF THE NRMSE ON D ATASET R ONEN

TABLE III C OMPUTATION T IME (S ECONDS ) OF D IFFERENT M ETHODS ON D ATASET R ONEN

Fig. 1 gives the experimental results in eight real microarray datasets at different missing rates of the LAW-LSimpute and ILAW-LSimpute methods in comparison to other five methods. Overall, the ILAW-LSimpute method has best performance

nearly in all datasets and all missing rates. As shown in Fig. 1, compared to other five methods, the ILAW-LSimpute method reduces the NRMSE by at least 50.1%, 42.0%, 44.1%, 19.4%, 51.5%, 9.6%, 36.2%, 7.6% at a 1% missing rate and 30.4%,

YU et al.: ITERATIVE LOCALLY AUTO-WEIGHTED LEAST SQUARES METHOD FOR MICROARRAY MISSING VALUE ESTIMATION

29

Fig. 2. Comparison of the NRMSE of various methods and effect of the number of neighboring genes for estimating missing values on two datasets with 5% and 15% missing rates: (a) SP.Alpha with 5% missing values; (b) SP.Alpha with 15% missing values; (c) Infection with 5% missing values; (d) Infection with 15% missing values.

22.9%, 19.7%, 4.7%, 31.9%, 6.2% 12.9%, 2.7% at a 5% missing rate for all datasets SP.Alpha, SP.Cdc, Infection, Ronen, Yo.Calcineurin, Lymphoma, Gasch and Ogawa, respectively. The good performance of the proposed algorithm can attribute to the use of the locally auto-weighted strategy that emphasizes nearby genes in regression analysis. With the increasing of missing rate, the superiority of the proposed method becomes less significant but it is still effective. One possible reason is that the true neighboring genes may exist in a lower missing rate, while when the missing rate gets higher, the selected neighboring genes contain more and more noise generated from pre-selection process and hence may not be the true neighbors. Nonetheless, from Fig. 1, we can draw conclusions that the ILAW-LSimpute method has best performance in all datasets when the missing rate is low (1-20%) and it is equivalent to or better than other five popular methods when the missing rate is high (25-30%). As imputation methods are not recommended by Draghici et al. [8] when the missing rate is high (e.g., more than 25%), we just present the comparison tests that the rate is less than or equal to 30%.

C. Robustness of ILAW-LSimpute With Respect to Auto-Weighted Strategy The locally auto-weighted strategy plays an important part in LAW-LSimpute and ILAW-LSimpute, which can automatically weight the neighboring genes based on the importance

of the genes. Consequently, the study of this strategy has theoretic and practical significance in the design of efficient imputation methods. In Section III, we have discussed how to automatically calculate the weights depending on the theory of optimization and designed the new algorithm. In Section IV, we have shown the effectiveness of this strategy through numerical experiments. However, there is one more question that needs to be solved: when the number of neighboring genes increases, does the auto-weighted strategy of neighboring genes still work efficiently? Fig. 2 gives the NRMSE curves of five algorithms, e.g., LLSimpute, ILLSimpute, Bi-ILSimpute, LAW-LSimpute and ILAW-LSimpute methods, on the SP.Alpha and Infection datasets under 5% and 15% missing rates. Overall, the ILAW-LSimpute method shows better performance for estimating missing values on the SP.Alpha and Infection datasets with the increasing number of neighboring genes. The NRMSE values of ILAW-LSimpute and Bi-ILSimpute methods are 0.306 and 0.449 for SP.Alpha data set at a 5% missing rate, 0.563 and 0.602 for SP.Alpha data set at a 15% missing rate, 0.459 and 0.578 for Infection data set at a 5% missing rate, 0.571 and 0.606 for Infection data set at a 15% missing rate, respectively. Such good results may benefit from the use of auto-weighted strategy. Now we empirically verify the above question by the experiments. In Fig. 2, as the number of neighboring genes increases, the NRMSE of LAW-LSimpute and ILAW-LSimpute methods

30

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 1, JANUARY 2017

Fig. 3. Average NRMSE of against iterations using the ILLSimpute, Bi-ILSimpute and ILAW-LSimpute method applied on the real microarray datasets at two missing rates: (a) Lymphoma with 10% missing values; (b) Lymphoma with 20% missing values; (c) Gasch with 10% missing values; (d) Gasch with 20% missing values.

first drops, then rises, then drops again and stabilizes. It shows that when the number of neighboring genes is greater than a certain value, the NRMSE of LAW-LSimpute and ILAW-LSimpute methods will begin to stabilize. This indicates that when the number of neighboring genes increases, the auto-weighted strategy of neighboring genes will still work efficiently. It also implies that the LAW-LSimpute and ILAW-LSimpute methods are robust with respect to the autoweighted strategy to some extent.

D. Convergence Analysis The problem of convergence is a key consideration for an iterative algorithm, so we will study the convergence of the three iterative algorithms that were implemented in this paper. For an iterative imputation method, the convergence means that the estimates remain unchanged at certain values or the estimation errors are likely to be the lowest in the statistical sense. In our algorithm, since it is difficult to give a mathematical proof, we empirically show the convergence of the proposed algorithm in this section. Fig. 3 presents the average NRMSE for the datasets Lymphoma and Gasch at 10% and 20% missing rates with iterations from 1 to 18. The results show that the Bi-ILSimpute and the ILAW-LSimpute algorithms are convergent in Lymphoma and Gasch datasets because the estimation errors of these two algorithms are the lowest in the statistical sense after a certain

number of iterations. However, the ILLSimpute algorithm cannot drop below a threshold even after 18 iterations. It is quite possible that the Bi-ILSimpute and ILAW-LSimpute algorithms use the concept of the uncertainty to update the estimates while the ILLSimpute algorithm lacks a control on convergence.

E. Parameter Analysis In order to determine the value of the smoothing parameter β, we investigate its impact on the Yo.Calcineurin dataset with different missing rates. For other datasets, the parameter β can be selected in the same way. We set a high value on the number of neighboring genes and vary the value of β. We test the performance of estimation with some integer β values ranging from -10 to 10, excluding 0. Here, we aim to see how the value of β affects the estimation results. Fig. 4 is the results of the ILAW-LSimpute algorithm on the Yo.Calcineurin data set with 20 different β values. Fig. 4(a) -(g) are the results of our new algorithm at different missing rates and Fig. 4(h) is the summary of the results at all missing rates. Now we show how to determine the value of the smoothing parameter β for the Yo.Calcineurin data set. It is known that the weight wi (k), i = 1, 2, ..., k, is ranged from 0 to 1. When the parameter β is negative, the lower the β value of β, the higher the value of wi (k). In order to guarantee the efficiency of the proposed algorithm, we do not suggest

YU et al.: ITERATIVE LOCALLY AUTO-WEIGHTED LEAST SQUARES METHOD FOR MICROARRAY MISSING VALUE ESTIMATION

31

Fig. 4. Average NRMSE of the proposed algorithm applied on Yo.Calcineurin data set at different missing rates using various values of smoothing parameter β.

selecting a lower value of β. For example, when wi (k) = 0.2 β and β = −10, the value of wi (k) is about 9.76 × 106 , which is a very large number. Consequently, when the best result is achieved at a certain value of β that is lower than -10, we recommend that the parameter β is set to be -10. In Fig. 4 (h), it illustrates that the smoothing parameter β has little influence at low and mid missing rates (1-15%). It also shows that the overall good results occurred at β ∈ {−7, −6, −5}. Hence, we set the parameter β to be -6 for the Yo.Calcineurin data set in our experiments. V. C ONCLUSIONS In this paper, we have presented LAW-LSimpute, a new LLSimpute-based algorithm that can calculate the neighboring genes weights automatically. For each target gene with missing values, the LAW-LSimpute algorithm can adaptively decide the contributions of the neighboring genes by a weighting function which emphasizes important neighboring genes for missing value estimation and plays down the neighboring genes which may contribute noise to the estimation process. We have shown that the proposed algorithm can be regarded as an extension of the LLSimpute method. By considering the efficiency, we have designed an accelerating strategy to improve the convergence of the proposed algorithm.

In addition to the use of previous estimated values, we have also developed an iterative framework of LAW-LSimpute (ILAW-LSimpute) in this paper. Eight real data sets have been applied to verify the effectiveness of the proposed algorithm. Experimental results have shown that the ILAW-LSimpute algorithm outperforms the LLSimpute algorithm, the ILLSimpute algorithm, the Bi-ILSimpute algorithm, the Bi-BPCA algorithm and the RMI algorithm at nearly all missing rates. As the proposed algorithm becomes less superior when the missing rate is higher, how to develop a more efficient one would be an interesting extension in further studies. ACKNOWLEDGMENTS This work is supported by the National Science Foundation of China (Nos. 61573292, 61572406), NSAF (No. U1230117) and the Scientific Research Foundation of Sichuan Provincial Education Department (No. 13ZB0210). R EFERENCES [1] D. J. Lockhart and E. A. Winzeler, “Genomics, gene expression and DNA arrays,” Nature, vol. 405, no. 6788, pp. 827–836, 2000. [2] Y. Saeys, I. Inza, and P. Larrañaga, “A review of feature selection techniques in bioinformatics,” Bioinformatics, vol. 23, no. 19, pp. 2507–2517, 2007.

32

[3] O. G. Troyanskaya, K. Dolinski, A. B. Owen, R. B. Altman, and D. Botstein, “A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae),” Proc. Nat. Acad. Sci. USA, vol. 100, no. 14, pp. 8348–8353, 2003. [4] M. Moškon and M. Mraz, “Systematic approach to computational design of gene regulatory networks with information processing capabilities,” IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 11, no. 2, pp. 431–440, Mar/Apr. 2014. [5] K. Leung et al., “Data mining on DNA sequences of hepatitis B virus,” IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 8, no. 2, pp. 428–440, Mar. 2011. [6] D. Pugazhenthi and S. P. Rajagopalan, “Machine learning technique approaches in drug discovery, design and development,” Inf. Technol. J., vol. 6, no. 5, pp. 718–724, 2007. [7] R. Januchowski, P. Zawierucha, M. Andrzejewska, M. Rucinski, and M. Zabel, “Microarray-based detection and expression analysis of ABC and SLC transporters in drug-resistant ovarian cancer cell lines,” Biomed. Pharmacotherapy, vol. 67, no. 3, pp. 240–245, 2013. [8] S. Draghici, P. Khatri, A. C. Eklund, and Z. Szallasi, “Reliability and reproducibility issues in DNA microarray measurements,” Trends Genetics, vol. 22, no. 2, pp. 101–109, 2006. [9] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proc. Nat. Acad. Sci. USA, vol. 95, no. 25, pp. 14863–14868, 1998. [10] Y. Sun, U. Braga-Neto, and E. R. Dougherty, “Impact of missing value imputation on classification for DNA microarray gene expression data— A model-based study,” EURASIP J. Bioinform. Syst. Biol., vol. 2009, pp. 1–17, Nov. 2009. [11] R. Jörnsten, H. Y. Wang, W. J. Welsh, and M. Ouyang, “DNA microarray data imputation and significance analysis of differential expression,” Bioinformatics, vol. 21, no. 22, pp. 4155–4161, 2005. [12] L. J. Heyer, S. Kruglyak, and S. Yooseph, “Exploring expression data: Identification and analysis of coexpressed genes,” Genome Res., vol. 9, no. 11, pp. 1106–1115, 1999. [13] O. Troyanskayaet et al., “Missing value estimation methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, 2001. [14] T. Hastie et al., “Imputing missing data for gene expression arrays,” Dept. Division Biostatist., Stanford Univ., Stanford, CA, USA, Tech. Rep., 1999. [15] S. Oba, M.-A. Sato, I. Takemasa, M. Monden, K.-I. Matsubara, and S. Ishii, “A Bayesian missing value estimation method for gene expression profile data,” Bioinformatics, vol. 19, no. 16, pp. 2088–2096, 2003. [16] T. H. Bø, B. Dysvik, and I. Jonassen, “LSimpute: Accurate estimation of missing values in microarray data with least squares methods,” Nucleic Acids Res., vol. 32, no. 3, p. e34, 2004. [17] D. Yoon, E. K. Lee, and T. Park, “Robust imputation method for missing values in microarray data,” BMC Bioinform., vol. 8, p. S6, May 2007. [18] X. C. Gan, A. W. C. Liew, and H. Yan, “Microarray missing data imputation based on a set theoretic framework and biological knowledge,” Nucleic Acids Res., vol. 34, no. 5, pp. 1608–1619, 2006. [19] H. Kim, G. H. Golub, and H. Park, “Missing value estimation for DNA microarray gene expression data: Local least squares imputation,” Bioinformatics, vol. 21, no. 2, pp. 187–198, 2005. [20] Z. Cai, M. Heydari, and G. Lin, “Iterated local least squares microarray missing value imputation,” J. Bioinform. Comput. Biol., vol. 4, no. 5, pp. 935–957, 2006. [21] X. B. Zhang, X. F. Song, H. N. Wang, and H. P. Zhang, “Sequential local least squares imputation estimating missing value of microarray data,” Comput. Biol. Med., vol. 38, no. 10, pp. 1112–1120, 2008. [22] W. K. Ching et al., “A weighted local least squares imputation method for missing value estimation in microarray gene expression data,” Int. J. Data Mining Bioinform., vol. 4, no. 3, pp. 331–347, 2010. [23] K. O. Cheng, N. F. Law, and W. C. Siu, “Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data,” Pattern Recognit., vol. 45, no. 4, pp. 1281–1289, 2012. [24] C. C. Liu, D. Q. Dai, and H. Yan, “The theoretic framework of local weighted approximation for microarray missing value estimation,” Pattern Recognit., vol. 43, no. 8, pp. 2993–3002, 2010. [25] S. Z. Selim and M. A. Ismail, “K-means-type algorithms: A generalized convergence theorem and characterization of local optimality,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-6, no. 1, pp. 81–87, Jan. 1984. [26] J. Z. Huang, M. K. Ng, H. Rong, and Z. Li, “Automated variable weighting in k-means type clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 5, pp. 657–668, May 2005. [27] B. L. Bowerman and R. O’Connell, Linear Statistical Models: An Applied Approach, 2nd ed. Duxbury, MA, USA, 1990.

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 1, JANUARY 2017

[28] P. T. Spellman et al., “Comprehensive identification of cell cycle– regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization,” Molecular Biol. Cell, vol. 9, no. 12, pp. 3273–3297, 1998. [29] D. N. Baldwin, V. Vanchinathan, P. O. Brown, and J. A. Theriot, “A gene-expression program reflecting the innate immune response of cultured intestinal epithelial cells to infection by Listeria monocytogenes,” Genome Biol., vol. 4, no. 1, p. R2, 2003. [30] H. Yoshimoto et al., “Genome-wide analysis of gene expression regulated by the calcineurin/Crz1p signaling pathway in saccharomyces cerevisiae,” J. Biol. Chem., vol. 277, no. 34, pp. 31079–31088, 2002. [31] A. A. Alizadeh et al., “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,” Nature, vol. 403, no. 6769, pp. 503–511, 2000. [32] A. P. Gasch et al., “Genomic expression programs in the response of yeast cells to environmental changes,” Molecular Biol. Cell, vol. 11, no. 12, pp. 4241–4257, 2000. [33] M. Ronen and D. Botstein, “Transcriptional response of steady-state yeast cultures to transient perturbations in carbon source,” Proc. Nat. Acad. Sci. USA, vol. 103, no. 2, pp. 389–394, 2006. [34] N. Ogawa, J. DeRisi, and P. O. Brown, “New components of a system for phosphate accumulation and polyphosphate metabolism in saccharomyces cerevisiaerevealed by genomic expression analysis,” Molecular Biol. Cell, vol. 11, no. 12, pp. 4309–4321, 2000. [35] F. Meng, C. Cai, and H. Yan, “A bicluster-based Bayesian principal component analysis method for microarray missing value estimation,” IEEE J. Biomed. Health Inform., vol. 18, no. 3, pp. 863–871, May 2014. [36] H. Li, C. Zhao, F. Shao, G.-Z. Li, and W. Xiao, “A hybrid imputation approach for microarray missing value estimation,” BMC Genomics, vol. 16, p. S1, Aug. 2015. Zeng Yu received B.S. and M.S. degrees from the Department of Mathematics, School of Sciences, China University of Mining and Technology, in 2008 and 2011, respectively. He is currently working toward the Ph.D. degree in the School of Information Science and Technology, Southwest Jiaotong University, China. He is also a visiting Ph.D. student at the Department of Computer Science, Georgia State University, Atlanta, GA, USA. His current research interests include data mining, bioinformatics, deep learning, and cloud computing. Tianrui Li (SM’10) received the B.S., M.S., and Ph.D. degrees from the Southwest Jiaotong University, Chengdu, China, in 1992, 1995, and 2002, respectively. He was a Postdoctoral Researcher with SCK•CEN, Belgium, from 2005 to 2006, and a Visiting Professor with Hasselt University, Belgium, in 2008, the University of Technology, Sydney, Australia, in 2009, and the University of Regina, Canada, 2014. He is currently a Professor and the Director of the Key Laboratory of Cloud Computing and Intelligent Techniques, Southwest Jiaotong University. He has authored or coauthored more than 120 research papers in refereed journals and conferences. His research interests include big data, cloud computing, data mining, granular computing, and rough sets. Shi-Jinn Horng (M’–) received the B.S degree in electronics engineering from National Taiwan Institute of Technology, the M.S. degree in information engineering from National Central University, and the Ph.D. degree in computer science from National Tsing Hua University, in 1980, 1984, and 1989, respectively. He is currently the chair and a distinguished professor in the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology. He has published more than 200 research papers and received many awards. Especially, the Distinguished Research Award got from the National Science Council in Taiwan in 2004. His research interests include VLSI design, biometric recognition, image processing, and information security.

YU et al.: ITERATIVE LOCALLY AUTO-WEIGHTED LEAST SQUARES METHOD FOR MICROARRAY MISSING VALUE ESTIMATION

Yi Pan received his B.Eng. and M.Eng. degrees in computer engineering from Tsinghua University, China, in 1982 and 1984, respectively, and his Ph.D. degree in computer science from the University of Pittsburgh, Pittsburgh, PA, USA, in 1991. He is a Regents’ Professor of Computer Science and an Interim Associate Dean and Chair of Biology at Georgia State University, Atlanta, GA, USA. He joined Georgia State University in 2000 and was promoted to full professor in 2004, named a Distinguished University Professor in 2013 and designated a Regents’ Professor (the highest recognition given to a faculty member by the University System of Georgia) in 2015. He also served as the Chair of Computer Science Department from 2005-2013. His profile has been featured as a distinguished alumnus in both Tsinghua Alumni Newsletter and University of Pittsburgh CS Alumni Newsletter. His research interests include parallel and cloud computing, wireless networks, and bioinformatics. He has published more than 330 papers including over 180 SCI journal papers and 60 IEEE/ACM Transactions papers. In addition, he has edited/authored 40 books. His work has been cited more than 8300 times in GoogleScholar. Dr. Pan has served as an editor-in-chief or editorial board member for 15 journals including 7 IEEE Transactions. He is the recipient of many awards including IEEE Transactions Best Paper Award, 4 other international conference or journal Best Paper Awards, 4 IBM Faculty Awards, 2 JSPS Senior Invitation Fellowships, IEEE BIBE Outstanding Achievement Award, NSF Research Opportunity Award, and AFOSR Summer Faculty Research Fellowship. He has organized many international conferences and delivered keynote speeches at over 50 international conferences around the world.

33

Hongjun Wang received his PhD degree in computer science from Sichuan University, China, in 2009. He is currently associate professor of the Key Lab of Cloud Computing and Intelligent Techniques in Southwest Jiaotong University, China. His research interests are machine learning, data mining, and ensemble learning. He published over 30 research papers in journals and conferences and he is a member of ACM and CCF. He has been a reviewer for several academic journals.

Yunge Jing received the M.S. degree from the College of Computer Science, Shanxi Normal University, Xian, China, in 2007. He was an Associate Professor in the Public Department of Computer Teaching, Yuncheng University, China. He is currently working toward the Ph.D. degree in the School of Information Science and Technology, Southwest Jiaotong University, China. His research interests include data mining, cloud computing, granular computing, and rough sets.

Suggest Documents