Coupled Matrix Factorization with Sparse Factors ... - Semantic Scholar

1 downloads 92 Views 2MB Size Report
They developed an all-at-once optimization algorithm, called. CMF-SPOPT (Coupled Matrix Factorization with SParse OPTimization), which is a gradient-based ...
22 International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012

Coupled Matrix Factorization with Sparse Factors to Identify Potential Biomarkers in Metabolomics Evrim Acar, Department of Food Science, Faculty of Science, University of Copenhagen, Copenhagen, Denmark Gozde Gurdeniz, Department of Human Nutrition, Faculty of Science, University of Copenhagen, Copenhagen, Denmark Morten A. Rasmussen, Department of Food Science, Faculty of Science, University of Copenhagen, Copenhagen, Denmark Daniela Rago, Department of Human Nutrition, Faculty of Science, University of Copenhagen, Copenhagen, Denmark Lars O. Dragsted, Department of Human Nutrition, Faculty of Science, University of Copenhagen, Copenhagen, Denmark Rasmus Bro, Department of Food Science, Faculty of Science, University of Copenhagen, Copenhagen, Denmark

ABSTRACT Metabolomics focuses on the detection of chemical substances in biological fluids such as urine and blood using a number of analytical techniques including Nuclear Magnetic Resonance (NMR) spectroscopy and Liquid Chromatography-Mass Spectrometry (LC-MS). Among the major challenges in analysis of metabolomics data are (i) joint analysis of data from multiple platforms, and (ii) capturing easily interpretable underlying patterns, which could be further utilized for biomarker discovery. In order to address these challenges, the authors formulate joint analysis of data from multiple platforms as a coupled matrix factorization problem with sparsity penalties on the factor matrices. They developed an all-at-once optimization algorithm, called CMF-SPOPT (Coupled Matrix Factorization with SParse OPTimization), which is a gradient-based optimization approach solving for all factor matrices simultaneously. Using numerical experiments on simulated data, the authors demonstrate that CMF-SPOPT can capture the underlying sparse patterns in data. Furthermore, on a real data set of blood samples collected from a group of rats, the authors use the proposed approach to jointly analyze metabolomics data sets and identify potential biomarkers for apple intake. Advantages and limitations of the proposed approach are also discussed using illustrative examples on metabolomics data sets. Keywords:

Coupled Matrix Factorization, Gradient-Based Optimization, Metabolomics, Missing Data, Sparsity

DOI: 10.4018/jkdb.2012070102 Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012 23

INTRODUCTION With the ability to collect massive amounts of data as a result of technological advances, we are commonly faced with data sets from multiple sources. For instance, metabolomics studies focus on detection of a wide range of chemical substances in biological fluids such as urine and plasma using a number of analytical techniques including Liquid Chromatography-Mass Spectrometry (LC-MS) and Nuclear Magnetic Resonance (NMR) Spectroscopy. NMR, for example, is a highly reproducible technique and powerful in terms of quantification. LCMS, on the other hand, allows the detection of many more chemical substances in biological fluids but only with lower reproducibility. These techniques often generate data sets that are complementary to each other (Richards et al., 2010). Data from these complementary methods, when analyzed together, may enable us to capture a larger proportion of the complete metabolome belonging to a specific biological system. However, currently, there is a significant gap between data collection and knowledge extraction: being able to collect a vast amount of relational data from multiple sources, we cannot still analyze these data sets in a way that shows the overall picture of a specific problem of interest, e.g., exposure to a specific diet. To address this challenge, data fusion methods have been developed in various fields focusing on specific problems of interest, e.g., missing link prediction in recommender systems (Ma et al., 2008), and clustering/community detection in social network analysis (Banerjee et al., 2007; Lin et al., 2009). Data fusion has also been studied in metabolomics mostly with a goal of capturing the underlying patterns in data (Smilde et al., 2003) and using the extracted patterns for prediction of a specific condition (Doeswijk et al., 2011), (see Richards et al., 2010) for a comprehensive review on data fusion in omics). Matrix factorizations are the common tools in data fusion studies in different fields. An effective way of jointly analyzing data from multiple sources is to represent data from

different sources as a collection of matrices. Subsequently, this collection of matrices can be jointly analyzed using collective matrix factorization methods (Long et al., 2006; Singh & Gordon, 2008). Nevertheless, applicability of available data fusion techniques is limited when the goal is to identify a limited number of variables, e.g., a few metabolites as potential biomarkers. Matrix factorization methods, without specific constraints on the factors, would reveal dense patterns, which are difficult to interpret. Therefore, motivated by the applications in metabolomics, in this paper, we formulate data fusion as a coupled matrix factorization model with penalties to enforce sparsity on the factors in order to capture sparse patterns. Our contributions in this paper can be summarized as follows: • •

• • •

Formulating a coupled matrix factorization model with penalties to impose sparsity on factor matrices; Developing a gradient-based optimization algorithm for solving the smooth approximation of the coupled matrix factorization problem with sparsity penalties, which we call CMF-SPOPT (Coupled Matrix Factorization with SParse OPTimization); Demonstrating the effectiveness of CMFSPOPT in terms of capturing the underlying sparse patterns in data using simulations; Assessing the sensitivity of the proposed approach to different penalty parameters; Identifying potential apple biomarkers based on joint analysis of metabolomics data sets collected on blood samples of a group of rats.

This is an extended version of our previous study (Acar et al., 2012), where we have imposed the same level of sparsity on coupled data sets. In this paper, we also demonstrate that the proposed approach extends to different levels of sparsity in coupled data sets and can accurately capture the underlying sparse factors using different sparsity penalties for different

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

24 International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012

data sets. Furthermore, through illustrative examples on real metabolomics data sets, we demonstrate the strengths and weaknesses of CMF-SPOPT. The rest of the paper is organized as follows. In the next section, we introduce our coupled matrix factorization model with penalties to impose sparsity and a gradient-based optimization algorithm for fitting the model. Experiments and Results section demonstrates the performance of the proposed approach on both simulated and real data. In Related Work section, we survey the related studies, and, finally, conclude with future research directions in Conclusions.

where ||.|| denotes the Frobenius norm for matrices and the 2-norm for vectors. The goal is to find the matrices A ∈ I ×R , B ∈ J ×R and C ∈ K ×R that minimize (1). Note that A, i.e., the factor matrix extracted from the shared mode, is common in factorization of both X and Y. In this paper, we extend the formulation in (1) by adding penalty terms in order to impose sparsity on factor matrices B and C, and reformulate the objective function as: f (A, B, C) = X - ABT R

+λb ∑ br

METHODS

r =1

In this section, we first introduce our model for coupled matrix factorization (CMF) with penalty terms to enforce sparsity on the factor matrices and discuss the extension of the model to coupled analysis of incomplete data. We then present our algorithmic framework called CMF-SPOPT (Coupled Matrix Factorization with SParse OPTimization), which fits the proposed model using a gradient-based optimization method.

Model We consider joint analysis of multiple matrices with one mode in common using coupled matrix factorization to capture the underlying sparse factors. We first discuss the formulation of coupled matrix factorization, which has previously been studied in various data fusion studies (Ma et al., 2008; Singh & Gordon, 2008; Van Deun et al., 2009). Without loss of generality, suppose matrices X ∈ I ×J and Y ∈ I ×K have the first mode in common. The objective function for their joint factorization can be formulated as: f (A, B, C) = X − ABT

2

2

+ Y − ACT (1)

2

R

1

+ λc ∑ cr r =1

+ Y - ACT R

1

+ α ∑ ar

2

2

(2)

r =1

where ar, br and cr correspond to the rth column of A, B and C, respectively. ||x||1 denotes 1-norm of a vector and is defined as ∑ | x i | . λb , λc

and α are penalty parameters with λb , λc , α ≥ 0 . This formulation is motivated by metabolomics applications, where we often have different types of measurements on the same samples. For instance, X may correspond to a samples by features matrix constructed using LC-MS measurements while Y may be a matrix in the form of samples by chemical shifts constructed using NMR measurements. In most metabolomics applications, we need the underlying sparse patterns in variables dimensions, e.g., metabolites, in order to relate diseases or dietary interventions with a small set of variables. Therefore, we impose sparsity only in the variables modes by adding the 1-norm penalty, which has shown to be an effective way of enforcing sparsity (Tibshirani, 1996). The 2-norm penalty on the factors in the samples mode, i.e., the last term in (2), is added to handle the scaling ambiguity. Since there is a scaling ambiguity in the matrix factorization given above, i.e., ˆ ηA)( 1 BT ) = ABT , without penalizing X=( η

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012 25

the norm of the factors in the samples mode, the sparsity penalty would not have the desired effect.

fW (A, B, C) 2

= W ∗ (X - ABT ) + Y - ACT R



Smooth Approximation: In order to minimize the objective function in (2), we need to deal with a non-differentiable optimization problem due to the 1-norm terms. However, by replacing the 1-norm terms with differentiable approximations, it can be converted into a differentiable problem. Here, we approximate the terms with 1-norm using the “epsL1” function (Lee et al., 2006) and rewrite (2) as:

f (A, B, C) = X - ABT R

2

+ Y - ACT

J

+λb ∑ ∑ bjr2 + ε

2

(3)

r =1 j =1 R K

R

r =1 k =1

r =1

+λc ∑ ∑ ckr2 + ε + α ∑ ar

2

where bjr denotes the entry in the jth row, rth column of B. Note that, for sufficiently small ε > 0, •

x i2 + ε =| x i | .

Missing Data: In the presence of missing data, we can still jointly factorize matrices and extract sparse patterns by fitting the coupled model only to the known data entries. Suppose X has missing entries and let W ∈ I ×J indicate the missing entries of X such that:

(4)

+λb ∑ ∑ bjr2 + ε r =1 j =1 K R

R

r =1 k =1

r =1

+λc ∑ ∑ ckr2 + ε + α ∑ ar

2

where * denotes the Hadamard (element-wise) product. The formulations in (3) and (4) easily generalize to joint factorization of more than two matrices, each with underlying sparse factors in the variables mode. In our objectives, we give equal weights to the factorization of each data matrix, and in the experiments, we divide each data set by its Frobenius norm so that the model does not favor one part of the objective. However, determining the right weighting scheme remains to be an open research question.

Algorithm With the smooth approximation, we have obtained differentiable objective functions in (3) and (4), which can be solved using any first-order optimization algorithm (Nocedal & Wright, 2006). In order to use a first-order optimization method, we only need to derive the gradient. The gradient of fW in (4), which is a vector of size P=R(I +J +K), can be formed by vectorizing the partial derivatives with respect to each factor matrix and concatenating them all, i.e.:

1 if x is known, ij wij =  0 if x ij is missing,  for all i ∈ {1,..., I } and j ∈ {1,..., J } . To jointly analyze matrix Y and the incomplete matrix X, we can then modify the objective function in (3) as:

J

2

∇fW

  vec( ∂fW  ∂A   ∂fW =  vec(  ∂B  ∂fW   vec( ∂C 

 )   )    ) 

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

26 International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012

Let Z = ABT. Assuming each term of fW in (4) is multiplied by 0.5 for the ease of computation, the partial derivatives of fW with respect to factor matrices, A, B and C, can be computed as: ∂f W ∂A ∂f W ∂B ∂f W ∂C

= (W ∗ Z − W ∗ X)T A + = −Y T A + CAT A +

λc 2

2

In this section, performance of the proposed approach in terms of capturing the underlying sparse patterns in coupled data sets, is demonstrated using both simulated and real data.

Simulated Data

= (W ∗ Z − W ∗ X)B − YC + ACTC + αA λb

EXPERIMENTS AND RESULTS

1 2

B / (B ∗ B + ε) 1 2

C / (C ∗ C + ε)

where the operator / denotes element-wise division. Once the gradient is computed, we then use the Nonlinear Conjugate Gradient (NCG) method with Hestenes-Steifel updates (Nocedal & Wright, 2006) and the Moré-Thuente line search as implemented in the Poblano Toolbox (Dunlavy et al., 2010). Traditional approaches for coupled matrix factorizations are based on alternating algorithms (Singh & Gordon, 2008; Van Deun et al., 2009), where the optimization problem is solved for one factor matrix at a time by fixing the other factor matrices. While alternating algorithms are widely-used, direct nonlinear optimization methods solving for all factor matrices simultaneously have better convergence properties within the context of matrix factorizations with missing entries (Buchanan & Fitzgibbon, 2005) and shown to be more accurate in the case of tensor factorizations (Acar et al., 2011a). Therefore, we use a gradient-based optimization algorithm to solve the non-convex optimization problem, i.e., the minimization of the objective function in (4). Neither alternating nor all-at-once approaches can guarantee to reach the global optimum. The computational cost per iteration is the same for both alternating and gradient-based approaches (See Buchanan & Fitzgibbon, 2005; Acar et al., 2011a for in-depth comparisons of alternating and all-at-once approaches).

The goal of simulations is two-fold: (i) to demonstrate that underlying sparse factors used to generate coupled data sets can be accurately captured using the proposed model/algorithm (ii) to study the sensitivity of the proposed approach to different parameter values.

Experimental Set-Up We generate coupled matrices, X ∈ I ×J and Y ∈ I ×K computed as X = ABT and Y = ACT, where A ∈ I ×R has entries randomly drawn from the standard normal distribution; matrices B ∈ J ×R and C ∈ K ×R , similarly, have entries randomly drawn from the standard normal but S% of the entries in each column of B and C is set to zero to have sparse factors. Columns of A, B and C are normalized to unit norm. We then add noise to X and Y to form coupled noisy matrices, i.e., Xnoisy = X+η

N1 N1

and Ynoisy = Y+η

X N2 N2

Y ,

where entries of N1 ∈ I ×J and N2 ∈ I ×K are randomly drawn from the standard normal. In order to assess the performance of CMFSPOPT in terms of capturing the underlying sparse patterns, we generate data sets with: 1. sparsity levels: S={(30, 90), (30, 30), (50,50), (90, 90)}, where each pair shows the sparsity level used for columns of matrices B and C, respectively; 2. noise levels: η = 0.1, 0.5 ; and

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012 27

3. sizes: (I , J , K ) ∈ {(20, 30, 3000),(20, 30, 40)} . We use R=2 as the number of components. Once coupled matrices are generated, ˆ ∈ I ×Rext , CMF-SPOPT is used to capture A ˆ ∈ J ×Rext a n d C ˆ ∈ K ×Rext f o r d i ff B erent values of penalty parameters: λb , λc ∈ {10−3 , 10−2 , 10−1 } and α ∈ {10−3 , 10−2 , 10−1, 1, 5} . Rext indicates the number of extracted components. ˆ and We compare the extracted matrices B ˆ C with the original sparse matrices B and C used to generate the coupled data, in terms of sparsity patterns. For instance, the first column of B, b1, is compared with the matching column ˆ , e.g., b ˆ . If a nonzero in b corresponds of B 1 1 ˆ , then it is a true-positive; if to a nonzero in b 1

ˆ , it a zero in b1 corresponds to a nonzero in b 1 is a false-positive. Note that there is a permutation ambiguity in the factorization; therefore, we look for the best permutation to match the columns. As stopping conditions, CMF-SPOPT uses relative change in function value (set to 10-10) and 2-norm of the gradient divided by the number of entries in the gradient (set to 10-10). For initialization, we use multiple random starts and choose the run with the minimum function value.

Results CMF-SPOPT can capture the underlying sparse patterns in coupled data sets accurately. Figure 1 shows the performance of CMF-SPOPT for coupled data sets X of size 20 × 30 and Y of size 20 × 3000 . The sparsity level is S=(50,50); the noise level is η = 0.5 , and the number of extracted components is Rext=2. Figure 1(a) shows the recovery performance for the first column of matrix B in terms of true-positive rates (TPR) (top row) and false-positive rates (FPR) (second row) for different values of parameters λb , λc and α . The best performance, i.e., exact recovery of the underlying sparsity

patterns, corresponds to TPR=1 and FPR=0. Sizes of the markers in the plots depend on the values obtained for each triplet, ( λb , λc , α ). For instance, ( λb , λc , α )=(0.1, 0.01, 0.1) gives TPR=0.99 and FPR=0.01. Therefore, the point corresponding to ( λb , λc , α )=(0.1, 0.01, 0.1) in the first row of Figure 1(a) is large while the point corresponding to the same triplet in the second row showing FPR is quite small. Similarly, the recovery performance for the first column of C is given in Figure 1(b). We observe that, for ( λb , λc , α )=(0.1, 0.01, 0.1), CMFSPOPT recovers the first component of C with high TPR, i.e., 0.99, and low FPR, i.e., 0.01, as well. The triplet ( λb , λc , α )=(0.1, 0.01, 0.1) returns the best performance for both B and C by recovering the underlying sparse patterns in both data sets almost perfectly. Figure 2 takes a closer look at the performance of CMF-SPOPT by showing TPR and FPR for a specific α value, i.e., α = 0.1 . Figure 2 clearly shows that underlying sparse patterns in both B and C can be captured with high TPR and low FPR for ( λb , λc )=(0.1, 0.01). The reason for different sparsity penalty values is the difference in the number of variables, i.e., while each column of B is of length J= 30, each column of C is of length K=3000. Note that both data sets are generated using the same sparsity level, i.e., S=(50,50). We have only reported the results for the first component of B and C. Results for the second component are similar and omitted here. Also note that matrix factorizations have rotational ambiguity; therefore, they cannot capture the factor matrices uniquely. For certain combination of penalty values, though, factor matrices are uniquely captured by CMF-SPOPT, i.e., unique up to scaling and permutation. Reported TPR and FPR values correspond to those cases, where we can uniquely capture the factor matrices up to scaling and permutation. Noise Level CMF-SPOPT performs well in terms of capturing the underlying sparse patterns for certain

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

28 International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012

Figure 1. Performance of CMF-SPOPT for data set sizes, (I , J , K ) = (20, 30, 3000) , for different values of penalty parameters, λb , λc and α , at noise level η = 0.5 . (a) First column of B, i.e., b1. (b) First column of C, i.e., c1.

Figure 2. Performance of CMF-SPOPT for α = 0.1 and different values of λb , λc , for data set sizes (I , J , K ) = (20, 30, 3000) at noise level η = 0.5 . (a) First column of B, i.e., b1. (b) First column of C, i.e., c1.

parameter values as we have seen in Figure 1. As the noise level decreases, the problem becomes easier and the underlying sparse factors can be captured for many different values of triplets. For instance, Figure 3 demonstrates the recovery performance for a lower noise level, i.e., η = 0.1 , and we observe that compared to Figure 1, it is possible to recover the underlying sparse patterns in B and C for many

more ( λb , λc , α ) triplets. Here, we again set (I, J, K)=(20, 30,3000), S=(50,50), and Rext=2. Data set Sizes Depending on data set sizes, best performing penalty parameters change. Figure 4 shows the performance of CMF-SPOPT for coupled data sets X of size 20 × 30 and Y of size 20 × 40 for α = 0.1 and different values of λb , λc . The

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012 29

Figure 3. Performance of CMF-SPOPT for data set sizes, (I , J , K ) = (20, 30, 3000) , for different values of penalty parameters, λb , λc and α , at noise level η = 0.1 .(a) First column of B, i.e., b1. (b) First column of C, i.e., c1.

results are shown for the specific α value since it achieves the best performance. We observe that for small number of dimensions in the variables mode, i.e., J=30 and K=40, λb = λc = 0.1 can accurately capture the sparse factors in B and C. On the other hand, as the number of variables increases, e.g., K=3000 in Figure 2, lower sparsity penalties become effective. Note that unlike in Figure 2, where we have considered coupled data sets with different dimensionalities, i.e., J=30 and K=3000, here we consider cases where J and K are in the same order of magnitude. Therefore, the best performance is achieved for the same penalty values, i.e., λb = λc . Here, the sparsity level is S=(50,50); the noise level is η = 0.5 , and the number of extracted components is Rext=2. Sparsity Level CMF-SPOPT can capture the underlying sparse patterns accurately for varying levels of sparsity; in particular, the recovery is perfect for higher sparsity. The top and bottom rows of

Figure 5(a) show the performance of CMFSPOPT in terms of capturing the sparsity pattern of the first column of B and C, respectively. Here, ~30% of the entries in each column of B is 0 while ~90% of the entries in each column of C is 0. We observe that underlying patterns can be captured accurately for both data sets with low FPR values and high TPR values. Recovery is not perfect for low sparsity level, though. Figure 5(b) and Figure 5(c) demonstrate the performance of CMF-SPOPT for different sparsity levels better. While the sparse patterns are almost accurately captured for low sparsity levels, they are recovered perfectly for high sparsity levels. For all sparsity levels, the best performance is achieved for α = 0.1 , λb = λc = 0.1 . Here, we set (I, J, K)=(20, 30,40), η = 0.5 and Rext=2. If we explore the performance of CMF-SPOPT for a wider range of parameter values, we also observe that lower λb , λc values are effective for low sparsity level while higher λb , λc values achieve the best performance for higher sparsity levels.

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

30 International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012

Figure 4. Performance of CMF-SPOPT for α = 0.1 and different values of λb , λc , for data set sizes (I , J , K ) = (20, 30, 40) . (a) First column of B, i.e., b1. (b) First column of C, i.e., c1.

Number of Components Finally, we show that CMF-SPOPT is robust to the selection of the component number. We generate data using R=2 but fit the model using Rext ∈ {2, 3, 4} . We set η = 0.5 , S=(50,50), λb = λc = 0.1 , α = 0.1 , and (I, J, K)=(20, 30, 40). Table 1 shows the weight of each coupled component, calculated as follows: We can reR

write X =ABT and Y =ACT as X=∑ βr a r brT r =1

R

and Y=∑ γr a r crT , where βr and γr are the r =1

weights of component r in X and Y, respectively, and a r = br = cr = 1 , for r=1,2,...R. We define the weight of a coupled component r as σr = βr + γr . Similarly, when ˆ B ˆ and C ˆ are extracted using CMF-SPOPT, A, columns are normalized and σˆr is computed. Table 1 shows that when there are two common components, i.e., R=2, and data sets are overfactored using Rext=3,4, weights of the extra components are 0. Besides, sparsity patterns of common components are still accurately captured as indicated by high TPR and low FPR. In summary, simulation studies demonstrate that CMF-SPOPT is quite effective in terms of capturing the underlying sparse patterns in coupled data; however, we also observe that the method is sensitive to penalty parameter values.

Metabolomics Data Analysis Next, we use CMF-SPOPT to jointly analyze metabolomics data measured using different analytical techniques and identify potential markers for apple intake. •

Data: The data consists of blood samples collected from a group of rats, which was part of a study on the effect of apple feeding on colon carcinogenesis (Poulsen et al., 2011). Here, we use the samples from forty-six male Fisher 344 rats (5-8 weeks old) obtained from Charles River (Sulzfeld, Germany). After one week of adaptation on a purified diet, the animals were randomized to two experimental groups: fed either the same purified diet (group 1: Apple 0) or the purified diet added 10 g raw whole apple (group 2: Apple 10) for 13 weeks. During the study at week 3-6, approximately half of the rats from group 1 and 2 were given artificial colon carcinogen DMH (1,2-dimethylhydrazine dihydrochloride). At the end of the study, rats were sacrificed after an overnight fasting. Animal experiments were carried out under the supervision of the Danish National Agency for Protection of Experimental Animals.

The rat plasma samples were analyzed by an untargeted approach using liquid chromatography/quadrupole-time-of-flight (LC-QTOF)

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012 31

Figure 5. Performance of CMF-SPOPT for different levels of sparsity for α = 0.1 and different values of λb , λc for data set sizes (I , J , K ) = (20, 30, 40) . (a) S=(30,90) (b) S=(30,30) (c) S=(90,90).

mass spectrometry (Gurdeniz et al., 2012) and NMR (Kristensen et al., 2010). In LC-MS analysis, raw data is converted into a feature set, where each feature is denoted by the mass over charge (m/z) ratio and a retention time (see (Gurdeniz et al., 2012) for details). In NMR analysis, the spectra were preprocessed (see

(Kristensen et al., 2010) for details) and then converted into a set of peaks using an in-house automated peak detection algorithm. We also have a third data set containing Total cholesterol (chol), low density cholesterol (LDL), very low density cholesterol (VLDL) and high density cholesterol (HDL) lipoproteins (com-

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

32 International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012

puted based on the NMR data (Kristensen et al., 2010)); triacylglycerol (TG) concentrations (measured using the rat plasma samples) and ACF (aberrant crypt foci), i.e., a colon cancer biomarker (measured from colorectum). In summary, our data can be represented using the following three matrices as illustrated in Figure 6: 1. X ∈ I ×J of type samples by features corresponding to LC-MS data, where I=46 and J=1086; 2. Y ∈ I ×K of type samples by chemical shifts corresponding to NMR measurements, where K=115; 3. Z ∈ I ×M of type samples by quality variables corresponding to quality measurements, where M=9. Matrix Z has missing entries. The pattern of missing entries is shown in Figure 6, which can easily be handled using the missing data handling approach we have discussed in Methods section. •

Model: Based on the formulation in (4), we jointly analyze X, Y and Z by minimizing the following objective:

fW (A, B, C, D) = X - ABT + Y - ACT R

2

2

+ W ∗ (Z - ADT )

J

R

2

K

+» b ∑ ∑ bjr2 + ε + » c ∑ ∑ ckr2 + ε r =1 j =1 R M

+» d ∑ ∑ d r =1 m =1

2 mr



r =1 k =1 R 2

+ ε + ±∑ a r r =1

and extract the factor matrices A ∈ I ×R , B ∈ J ×R , C ∈ K ×R and



D ∈ M ×R corresponding to the samples, features, chemical shifts and quality varia b l e s , r e s p e c t i v e l y. We u s e λb = λc = λd = 0.01 and α = 0.1 as penalty parameters based on the performance of these parameters on simulation data sets with sizes similar to the LC-MS and NMR data sets. Results: Before discussing the sparse patterns captured using CMF-SPOPT, we first illustrate the factors extracted using the Singular Value Decomposition (SVD) (Golub & Van Loan, 1996) of matrix X. SVD decomposes X as X = U Σ VT, where U and V are orthogonal matrices corresponding to the left and right singular vectors, respectively, and Σ is a diagonal matrix with singular values on the diagonal. Figure 7(a) shows the scatter plot of u1 and u7 demonstrating that two apple groups can be almost separated using the seventh left singular vector. The goal in metabolomics studies is often to understand the reason for the separation; in other words, the metabolites responsible for the separation. Therefore, we plot the seventh right singular vector in Figure 7(b) to identify the significant features. However, capturing the significant features is difficult since this vector is dense. Similarly, modeling uncentered LC-MS data using Nonnegative Matrix Factorization (NMF) (Lee & Seung, 2001) also reveals dense patterns (results not shown here). Those dense patterns are not even unique; therefore, interpretation

Figure 6. Three coupled matrices, X, Y and Z corresponding to LC-MS, NMR and quality measurements, respectively

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012 33

of NMF components extracted from LCMS data would be even more difficult. In Figure 8, we illustrate the performance of CMF-SPOPT in terms of apple group separation by coupled analysis of X, Y and Z. The scatter plot of a1 vs. a3 in Figure 8(a) shows that the first component can almost separate the two groups. In Figure 8, we can see the sparse patterns, i.e., b1, c1 and d1, mainly responsible for this separation. Unlike Figure 7(b), we can clearly identify the significant features in Figure 8(b). Through coupled analysis, we also get the sparse patterns relevant to apple groups in each data set. Results illustrated in Figure 8 are based on a 5-component CMF-SPOPT model, i.e., R=5. If we decrease R, none of the components can separate the apple groups. For R=5 and R=6, we get almost the exact same component for apple separation. As R increases, we lose the component responsible for the separation. In real data, unlike simulation studies, there are both common and uncommon components in coupled data sets; therefore, robustness of CMFSPOPT to overfactoring (shown in Table 1) is not enough to deal with the problem of determining R. In this study, since we are interested in apple markers, we use a component number that captures the apple group separation. In order to show that the sparse pattern in Figure 8(b) is meaningful, in Figure 9 we illustrate the separation achieved by the SVD of

a small matrix, X ∈ I ×L , constructed using only the features identified in b1, where L=14. We observe that using only 14 out of 1086 features, we can still separate the apple groups; therefore, these features are potential candidates for markers of apple intake. We further study the sparse patterns captured by CMF-SPOPT from a biological perspective. Metabolites identified in the sparse patterns are shown in Figure 8(b) and Figure 8(c). Some of these have been verified by chemical standards while some of them are tentative identifications, further to be explored. Based on the identifications, we find that patterns in Figure 8 are related to apple-induced changes in the endogenous metabolism. These changes include an increase in circulating branched-chain and aromatic amino acids, an increase in circulating glycerol- and choline containing lipids, a decrease in corticosteroids and possibly in androgens, and a decrease in lactate, hypoxanthin and free fatty acids. Several elements of this pattern indicate that the transition from the postprandial to the fasting state was delayed in apple-fed rats, with a slower increase in lactate and free fatty acids and a slower loss of amino acids and lipids from the blood. Moreover, apple feeding seems to suppress the increase in corticosteroids and possibly also of androgens with fasting in sup-

Figure 7. Separation of apple groups using SVD of X. (a) Scatter plot of u1 vs. u7. (b) The right singular vector v7

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

34 International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012

Figure 8. Separation of apple groups using CMF-SPOPT. (a) Scatter plot of a1 vs. a3. (b) Sparse pattern b1 extracted from LC-MS.(c) Sparse pattern c1 extracted from NMR. (d) Sparse pattern d1 extracted from quality measurements.

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012 35

Table 1. Performance of CMF-SPOPT for Rext ∈ {2, 3, 4} when R=2 Matrix B

ˆr ) Component Weight ( σ Rext

1

2

3

2

0.96

0.95

3

0.96

0.95

0.00

4

0.96

0.95

0.00

4

0.00

Matrix C

Component 1

Component 2

Component 1

Component 2

TPR

FPR

TPR

FPR

TPR

FPR

TPR

FPR

0.98

0.05

0.99

0.01

0.98

0.03

0.99

0.01

0.98

0.05

0.99

0.01

0.98

0.03

0.99

0.01

0.98

0.06

0.99

0.01

0.98

0.03

0.99

0.01

Figure 9. Separation of apple groups using SVD of X . (a) Scatter plot of u1 vs. u2. (b) The right singular vector v1

port of an effect on genes involved in steroid metabolism that we have observed in these rats. Using CMF-SPOPT, we were able to extract meaningful sparse patterns from LCMS and NMR complementing each other and describing apple-induced changes in the metabolism. •

Strengths and Weaknesses: Next, we discuss the advantages and potential limitations of CMF-SPOPT approach using illustrative examples on metabolomics data sets. We have so far focused on the underlying factor modeling apple-induced

changes. Another interesting phenomenon in our data set is the grouping of samples according to DMH, i.e., colon carcinogen exposure (Poulsen et al., 2011). Some of the samples are induced with DMH while some of them are controls. When metabolomics data sets are jointly factorized using CMF-SPOPT, one component is mainly responsible for the separation of apple groups as we have illustrated in Figure 8. Another component of the same model, i.e., the fifth component, captures the underlying structures that can separate DMH-induced (i.e., DMH (+)) samples from the control group (i.e., DMH (-)).

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

36 International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012

Figure 10(a) shows that the groups are clearly separable. That is primarily due to ACF, a strong biomarker for colon cancer (Corpet & Taché, 2002), which shows up as an important feature in the pattern extracted from Quality Measurements in Figure 10(d). Here, we use this underlying factor to explore CMF-SPOPT further. Since ACF is a strong feature responsible for the group separation, an effective model capturing sparse patterns should be able to pick it among many variables. Therefore, we slightly modify our data sets such that we add ACF as an additional variable to the LC-MS data; in other words, matrix X now has 1087 columns and the last column corresponds to ACF (Figure 11). Matrix Y and Z remain unchanged. When these matrices are jointly factorized using CMF-SPOPT, one of the components again captures the structure due to DMH induction (Figure 12). We observe that the sparse pattern extracted from LC-MS identifies the last feature, i.e., the concatenated ACF variable, as an important feature out of 1087 features. The sparse patterns extracted from NMR (Figure 12(c)) and Quality measurements (Figure 12(d)) are almost the same as in Figure 10(c) and Figure 10(d) (except for a sign change). This example illustrates that CMFSPOPT can successfully capture few relevant features out of many features. We leave the identification of features showing up in the sparse patterns extracted from NMR and LC-MS as future work. However, we discuss, next, why the single feature showing up in Figure 10(b) disappears in Figure 12(b) in order to understand CMF-SPOPT better. As mentioned previously, we can rewrite X = AB T , Y = AC T and Z = AD T R

X=∑ βr a r brT ,

as

r =1

R

R

Y=∑ γr a r crT a n d r =1

Z=∑ θr a r d , where βr , γr and θr are the r =1

T r

weights of component r in X, Y and Z, respectively, and a r = br = cr = dr = 1 , for r=1,...R. We have so far plotted normalized columns of factor matrices in all Figures in this paper and omitted the weights. However, in order to see the complete picture, we should also look into the weights of the components. Table 2 shows the weights of the components in each data set. Note that component 1 (Figure 8) mainly models apple-induced changes while component 5 (Figure 10) models DMH induction. We observe that the component modeling DMH induction has very little contribution from LC-MS, i.e., β5 = 0.0003 . If β5 was 0, we would see a dense vector in Figure 10(b). This very small weight indicates that the sparse pattern in Figure 10(b) is not reliable. When we look at Table 3 for the weights of the components in the model of modified data sets, the component responsible for the separation of DMH (-) and DMH (+) groups shows up as the first component. With the addition of ACF to LC-MS data, there is more contribution from LC-MS, i.e., β1 = 0.0141 indicating a more reliable component. Also, note that we consistently see small weights in LC-MS. Weights indicate not only the contribution of each data set to a common component but also how much that component explains in each data set. LC-MS data has more than 1000 variables and sparse patterns of tens of variables will only explain very little variation in the data. Even though the weights are small, as we have seen in Figure 8 and Figure 10, the components are still meaningful. In summary, CMF-SPOPT is quite effective in terms of picking even a single relevant feature among many irrelevant ones. However, we should thoroughly check the component weights before interpreting sparse patterns biologically. The weights can also be used to come up with the right sparsity penalty parameters by making sure that a certain combination of sparsity penalty parameters does not extract components

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012 37

Figure 10. Separation of DMH (-) and DMH (+) groups using CMF-SPOPT. (a) a5 (b) Sparse pattern b5 extracted from LC-MS.(c) Sparse pattern c5 extracted from NMR. (d) Sparse pattern d5 extracted from quality measurements

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

38 International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012

Figure 11. Last column of Z is concatenated to matrix X while other matrices remain unchanged

with weights equal to 0 for some data sets in coupled analysis. That being said, parameter selection still remains to be a major challenge.

RELATED WORK Simultaneous analysis of multiple matrices dates back to one of the earliest models aiming to capture the common variation in data sets, i.e., Canonical Correlation Analysis (CCA) (Hotelling, 1936). CCA looks for the patterns in each data set that correlate well and it is, in that sense, different from coupled matrix factorization. This difference has been illustrated in a recent metabolomics study (Van den Berg et al., 2009). More in line with the formulation in (1), Levin (1966) studied simultaneous factorization of Gramian matrices. Similarly, in signal processing, joint diagonalization of symmetric and Hermitian matrices has been a topic of interest (Yeredor, 2002). Furthermore, principal component analysis of multiple matrices has been widely studied in chemometrics using various models, some with clear objective functions while some are based on heuristic multi-level approaches (Smilde et al., 2003). Badea (2008) extended the formulation in (1) to simultaneous nonnegative matrix factorizations by extracting nonnegative factor matrices. Another line of work related to simultaneous matrix factorization is Generalized SVD and its extension to multiple matrices (Ponnapalli et al., 2011). With the increasing interest in the analysis of multi-relational data, Singh and Gordon

(2008) and Long et al. (2006) studied Collective Matrix Factorization for joint factorization of matrices. We can also consider tensor factorizations as simultaneous factorization of multiple matrices (see a recent survey for various tensor models (Acar & Yener, 2009)). While coupled matrix factorization has been widely studied in many disciplines, a recent study by Van Deun et al. (2011) is the only study that enforces sparsity on the factors within the coupled matrix factorization framework, to the best of our knowledge. That paper considers various penalty schemes such as the lasso, elastic net, group lasso, etc., and it is the most related to what we propose, or more specifically to (2). The main differences are (i) we do not enforce orthogonality constraints on factor matrix A, as in Van Deun et al. (2011), (ii) while alternating least squares is used in Van Deun et al. (2011), we use an all-at-once optimization approach solving a smooth approximation of the objective in (2), and (iii) we extend our formulation to joint analysis of incomplete data as in (4).

CONCLUSION While we can collect huge amounts of data using different platforms in metabolomics, we are still lacking the data mining tools for the fusion and analysis of these data sets. In this paper, we have formulated data fusion as a coupled matrix factorization model with penalties to enforce sparsity with a goal of capturing the underlying sparse patterns in coupled data sets. We have also discussed the extension of the proposed

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012 39

Figure 12. The component separating DMH (-) and DMH (+) groups using CMF-SPOPT on modified data sets in Figure 11. (a) a1 (b) Sparse pattern b1 extracted from LC-MS.(c) Sparse pattern c1 extracted from NMR. (d) Sparse pattern d1 extracted from quality measurements.

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

40 International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012

Table 2. Weights of the components in each data set for a 5-component model of the original data in Figure 6 Component Weight 1 (Apple)

2

3

4

5 (DMH)

βr (LC-MS)

0.0116

0.0034

0.0596

0.0676

0.0003

γr (NMR)

0.1527

0.2228

0.3750

0.2141

0.0388

0.3142

0.6402

0.0236

0.7259

0.3472

θr

(QM)

Table 3. Weights of the components in each data set for a 5-component model of the modified data in Figure 11 Component Weight 1 (DMH)

2

3

4

5 (Apple)

βr (LC-MS)

0.0141

0.0595

0.0675

0.0034

0.0116

γr (NMR)

0.0393

0.3750

0.2143

0.2230

0.1522

0.3474

0.0235

0.7257

0.6331

0.3157

θr

(QM)

model to coupled analysis of incomplete data. In order to fit the model to coupled data sets, we have developed a gradient-based optimization algorithm solving for all factor matrices simultaneously. Using numerical experiments on simulated data, effectiveness of the proposed approach in terms of capturing the underlying sparse patterns is demonstrated. We have also illustrated the usefulness of the proposed method in a metabolomics application, where potential markers for apple intake are identified through coupled analysis of LC-MS and NMR data. In our experiments we have shown that different penalty parameters can be used and they are quite effective in terms of capturing the underlying sparse patterns accurately. However, due to the formulation of the model, penalty parameters are related and cannot be controlled independently, which is the main limitation of the current CMF-SPOPT formulation. We plan

to consider alternative formulations that will enable us to independently control the parameters. Another direction of future research is the extension of our approach to coupled factorization of heterogeneous data, i.e., matrices and higher-order tensors (Acar et al., 2011b), since in many metabolomics applications we encounter with such heterogeneous data sets.

ACKNOWLEDGMENT This work is funded by the Danish Council for Independent Research - Technology and Production Sciences and Sapere Aude Program under the projects 11-116328 and 11-120947. We also thank Nikos Sidiropoulos for helpful discussions. The experiment with apple-fed rats and the sample profiling was performed as a part of the ISAFRUIT project. The ISAFRUIT project is funded by the European Commission

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012 41

under the Thematic Priority 5 - Food Quality and Safety of the 6th Framework Programme of RTD (Contract no. FP6-FOOD 016279). The views and opinions expressed in this article/ publication/etc. are purely those of the writers and may not in any circumstances be regarded as stating an official position of the European Commission.

REFERENCES Acar, E., Dunlavy, D. M., & Kolda, T. G. (2011a). A scalable optimization approach for fitting canonical tensor decompositions. Journal of Chemometrics, 25(2), 67–86. doi:10.1002/cem.1335. Acar, E., Gurdeniz, G., Rasmussen, M. A., Rago, D., Dragsted, L. O., & Bro, R. (2012). Coupled matrix factorization with sparse factors to identify potential biomarkers in metabolomics. In Proceedings of the ICDM Workshop on Biological Data Mining and its Applications in Healthcare (pp. 1-8). Acar, E., Kolda, T. G., & Dunlavy, D. M. (2011b). All-at-once optimization for coupled matrix and tensor factorizations. KDD Workshop on Mining and Learning with Graphs. Acar, E., & Yener, B. (2009). Unsupervised multiway data analysis: A literature survey. IEEE Transactions on Knowledge and Data Engineering, 21(1), 6–20. doi:10.1109/TKDE.2008.112. Badea, L. (2008). Extracting gene expression profiles common to colon and pancreatic adenocarcinoma using simultaneous nonnegative matrix factorization. Pacific Symposium on Biocomputing, 13, 279–290. Banerjee, A., Basu, S., & Merugu, S. (2007). Multiway clustering on relation graphs. SIAM. In Proceedings of the International Conference on Data Mining (SDM) (pp.145-156). Buchanan, A. M., & Fitzgibbon, A. W. (2005). Damped newton algorithms for matrix factorization with missing data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 316–322). Corpet, D. E., & Taché, S. (2002). Most effective colon cancer chemopreventive agents in rats: A systematic review of aberrant crypt foci and tumor data, ranked by potency. Nutrition and Cancer, 43(1), 1–21. doi:10.1207/S15327914NC431_1 PMID:12467130.

Doeswijk, T. G., Smilde, A. K., Hageman, J. A., Westerhuis, J. A., & van Eeuwijk, F. A. (2011). On the increase of predictive performance with high-level data fusion. Analytica Chimica Acta, 705(1-2), 41–47. doi:10.1016/j.aca.2011.03.025 PMID:21962346. Dunlavy, D. M., Kolda, T. G., & Acar, E. (2010). Poblano v1.0: A Matlab toolbox for gradient-based optimization (Tech. Rep. No. SAND2010-1422). Sandia National Laboratories. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations. Baltimore, MD: Johns Hopkins University Press. Gurdeniz, G., Kristensen, M., Skov, T., & Dragsted, L. O. (2012). The effect of lc-ms data preprocessing methods on the selection of plasma biomarkers in fed vs. fasted rats. Metabolites, 2(1), 77–99. doi:10.3390/ metabo2010077. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3-4), 321–377. doi:10.1093/ biomet/28.3-4.321. Kristensen, M., Savorani, F., Ravn-Haren, G., Poulsen, M., Markowski, J., & Larsen, F. H. et al. (2010). Nmr and interval PLS as reliable methods for determination of cholesterol in rodent lipoprotein fractions. Metabolomics, 6(1), 129–136. doi:10.1007/ s11306-009-0181-3. Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 13, 556–562. Lee, S., Lee, H., Abbeel, P., & Ng, A. Y. (2006). Efficient l1 regularized logistic regression. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 401–408. Levin, J. (1966). Simultaneous factor analysis of several gramian matrices. Psychometrika, 31(3), 413–419. doi:10.1007/BF02289472 PMID:5221135. Lin, Y. R., Sun, J., Castro, P., Konuru, R., Sundaram, H., & Kelliher, A. (2009). Metafac: Community discovery via relational hypergraph factorization. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (pp. 527-536). Long, B., Zhang, Z., Wu, X., & Yu, P. S. (2006). Spectral clustering for multi-type relational data. International Conference on Machine Learning (ICML) (pp. 585–592).

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

42 International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012

Ma, H., Yang, H., Lyu, M. R., & King, I. (2008). Sorec: Social recommendation using probabilistic matrix factorization. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM) (pp. 931-940). Nocedal, J., & Wright, S. J. (2006). Numerical optimization. New York, NY: Springer. Ponnapalli, S. P., Saunders, M. A., Van Loan, C. F., & Alter, O. (2011). A higher-order generalized singular value decomposition for comparison of global mrna expression from multiple organisms. PLoS ONE, 6, e28072. doi:10.1371/journal.pone.0028072 PMID:22216090. Poulsen, M., Mortensen, A., Binderup, M. L., Langkilde, S., Markowski, J., & Dragsted, L. O. (2011). The effect of apple feeding on markers of colon carcinogenesis. Nutrition and Cancer, 63(3), 402–409. doi:10.1080/01635581.2011.535961 PMID:21432724. Richards, S. E., Dumas, M. E., Fonville, J. M., Ebbels, T., Holmes, E., & Nicholson, J. K. (2010). Intra- and inter-omic fusion of metabolic profiling data in a systems biology framework. Chemometrics and Intelligent Laboratory Systems, 104(1), 121–131. doi:10.1016/j.chemolab.2010.07.006.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B. Methodological, 58(1), 267–288. Van den Berg, R. A., Rubingh, C. M., Westerhuis, J. A., van der Werf, M. J., & Smilde, A. K. (2009). Metabolomics data exploration guided by prior knowledge. Analytica Chimica Acta, 651(2), 173–181. doi:10.1016/j.aca.2009.08.029 PMID:19782808. Van Deun, K., Smilde, A. K., van der Werf, M. J., Kiers, H. A. L., & van Mechelen, I. (2009). A structured overview of simultaneous component based data integration. BMC Bioinformatics, 10, 246. doi:10.1186/1471-2105-10-246 PMID:19671149. Van Deun, K., Wilderjans, T. F., van den Berg, R. A., Antoniadis, A., & van Mechelen, I. (2011). A flexible framework for sparse simultaneous component based data integration. BMC Bioinformatics, 12, 448. doi:10.1186/1471-2105-12-448 PMID:22085701. Yeredor, A. (2002). Non-orthogonal joint diagonalization in the least-squares sense with application in blind source separation. IEEE Transactions on Signal Processing, 50(7), 1545–1553. doi:10.1109/ TSP.2002.1011195.

Singh, A. P., & Gordon, G. J. (2008). Relational learning via collective matrix factorization. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (pp. 650–658). Smilde, A. K., Westerhuis, J. A., & de Jong, S. (2003). A framework for sequential multiblock component methods. Journal of Chemometrics, 17(6), 323–337. doi:10.1002/cem.811.

Evrim Acar is an Assistant Professor at the Faculty of Science at the University of Copenhagen. Her research focuses on data mining and mathematical modeling; in particular, tensor decompositions and their applications in biomedicine and social network analysis. Prior to joining University of Copenhagen, she was a senior researcher at the National Research Institute of Electronics and Cryptology in Turkey and a postdoctoral researcher at Sandia National Laboratories in Livermore, California. She received her BS in Computer Engineering from Bogazici University (Istanbul, Turkey) in 2003 and her MS and PhD in Computer Science from Rensselaer Polytechnic Institute (Troy, NY) in 2006 and 2008, respectively.

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Knowledge Discovery in Bioinformatics, 3(3), 22-43, July-September 2012 43

Gozde Gurdeniz is a Research Assistant and PhD candidate at the Faculty of Science at the University of Copenhagen. Her research focuses on nutritional metabolomics using liquid chromatography mass spectrometry (LC-MS) and nuclear magnetic resonance (NMR) spectroscopy, particularly on data preprocessing and analysis in LC-MS based metabolomics studies. She received her BS in Chemical Engineering and MS in Food Engineering from Izmir Institute of Technology (Izmir, Turkey) in 2005 and 2008, respectively. Morten Arendt Rasmussen is a Post Doc at the Faculty of Science at the University of Copenhagen. His research focuses on data mining, mathematical- and statistical modeling and especially in application of those on data from clinical cohorts. MA Rasmussen received his MSc in Food technology from the Faculty of Life Sciences at the University of Copenhagen in 2007 and his PhD from Faculty of Science at the University of Copenhagen in 2012. Daniela Rago is a PhD student at the Faculty of Science at the University of Copenhagen. Her research is focused on the identification of potential exposure and/or effect markers related to apple and processed apple products intake and their biological effects by untargeted LC-MS based Metabolomics. She received her MSc in analytical chemistry from Sapienza University (Rome, Italy) in 2009. Lars Ove Dragsted is a Professor in Biomedicine and Nutrigenomics at the Department of Nutrition, Exercise and Sport at the University of Copenhagen and a leader of the laboratory for Bioactive Food Components and Health. He has a research focus on dietary intervention and nutritional metabolomics, especially LC-MS profiling techniques for the assessment of the actions of diets, specific foods and bioactive components on health. He has previously been a professor of Toxicology at the Danish Food Institute and a leader of the Department of Biochemical Toxicology and the Laboratory of Metabolism and Genetic Toxicology. Before this he was a Research assistant and PhD student at the Laboratory of Molecular Carcinogenesis at the Fibiger Institute under the Danish Cancer Society and prior to this worked at the Danish Occupational Health Institute as a chemist. He earned his MSc degree in Biochemistry in 1980 and took his PhD in 1994 in Biochemical Toxicology. Rasmus Bro is a Professor at the Faculty of Science at the University of Copenhagen. His research focuses on the use of mathematics and statistics in chemistry; so-called chemometrics. He has worked with developing tensor decompositions and calibration models and multivariate calibration in general. His work is mainly in food and pharmaceutical research and more recently in environmental analysis and clinical diagnostics. R. Bro received his MSc in chemical engineering from the Technical University in Denmark in 1994 and his PhD from University of Amsterdam, The Netherlands in 1998.

Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.