Multi-Class Non-Negative Matrix Factorization for

4 downloads 0 Views 8MB Size Report
Abstract—In this big data era, interpretable machine learning ... the simulated data, and identify biologically meaningful patterns .... a small perturbation of model parameters or training data ..... this feature in class 2 and 3, but can only partially explain .... feature selection methods which are often used as black boxes.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

1

Multi-Class Non-Negative Matrix Factorization for Comprehensive Feature Pattern Discovery Yifeng Li, Member, IEEE, Youlian Pan, and Ziying Liu

Abstract—In this big data era, interpretable machine learning models are strongly demanded for the comprehensive analytics of large-scale multi-class data. Characterizing all features from such data is a key but challenging step to understand the complexity. However, existing feature selection methods do not meet this need. In this paper, to address this problem, we propose a Bayesian multi-class non-negative matrix factorization (MC-NMF) model with structured sparsity that is able to discover ubiquitous and class-specific features. Variational update rules were derived for efficient decomposition. In order to relieve the need of model selection and stably describe feature patterns, we further propose MC-NMF with stability selection (SS-MC-NMF), an ensemble method that collectively detects feature patterns from many runs of MC-NMF using different hyperparameter values and training subsets. We assessed our models on both simulated count data and multi-tumor RNA-seq data. The experiments revealed that our models were able to recover predefined feature patterns from the simulated data, and identify biologically meaningful patterns from the pan-cancer data. Index Terms—multi-class non-negative matrix factorization, stability selection, feature pattern discovery, big data, cancer.

I. I NTRODUCTION

W

E have entered the era of big data, where massive homogeneous and heterogeneous data are generated in various fields. For example, in genomic and clinical studies, high-throughput sequencing techniques have been invented for genome-wide capturing the activities of complex biological systems [1]. When an object or phenomenon is observed from various perspectives, the generated data are multi-view [2], a generalization of the traditional definition. We have recently categorized multi-view data into four types [2]: (1) same set of features observed under different conditions or classes (multiview of features, also known as multi-class data), (2) same set of samples consisting of multiple sets of homogeneous features (multi-view of samples, the traditional definition of multi-view data, can be called multi-feature-source data), (3) same set of samples and same set of features under different conditions (can thus be represented by a tensor), and (4) different sets of samples and different sets of features (used to build multirelational data). On one hand, we are thrilled by the richness of multi-view data, allowing us to comprehensively understand an object or system from different angels or levels. On the other hand, we are challenged by the question - how can we effectively integrate multi-view data in our analytics? Correspondence should be addressed to Y. Li. Y. Li, Y. Pan and Z. Liu are with the Digital Technologies Research Centre, National Research Council Canada, Ottawa, Ontario, K1A 0R6, Canada. E-mail: {yifeng.li, youlian.pan, ziying.liu}@nrc-cnrc.gc.ca. Manuscript received January 06, 2017; revised June 07, 2017.

As one of potential techniques to address this challenge, single-view matrix factorizations [3] have been extended to their multi-view versions that started to show some successes in integrative data analyses. Generally, a single-view matrix factorization method decomposes a matrix X ∈ RM ×N into two low rank matrices W ∈ RM ×K and H ∈ RK×N , such that X = W H + E, where W and H are called basis and coefficient matrices, respectively, due to the fact that the n-th column of X can be approximated by a linear combination of PK the columns of W : xn ≈ k=1 hk,n wk . Sparse matrix factorization models have been developed to increase interpretability of results. These models include sparse principal component analysis (PCA) [4], sparse non-negative matrix factorization (NMF) [5], and sparse representations [6]. Matrix factorization models with structured sparsity are further investigated for meaningful interpretation or model selection. For instance, automatic relevance determination (ARD) has been applied to Bayesian PCA [7] and NMF [8] for determining the correct number of basis vectors. Bayesian versions of matrix factorizations have earned popularity due to their flexibility of imposing structured sparsity on factorization results. However, the parameter estimation is often time-consuming. Variational approximation algorithms have been derived to efficiently learn Bayesian NMF [9] and Bayesian factor analysis [10], which paves the way for developing and optimizing large-scale matrix factorization models. Multi-view matrix factorization models are more complicated than single-view matrix factorizations due to the heterogeneity and scale of multi-view data. In general, multiview matrix factorizations aim at disentangle from data shared factors (or components) among views and local factors specific to individual views. As concisely reviewed below, a range of multi-view matrix factorization models have been published for various applications. Designed for the type-2 multi-view data, two-directional orthogonal projections to latent structures (O2PLS) considers view-specific variations and covariation for two-view data [11]. It has been generalized in OnPLS for n (>2) views using a global one-versus-one strategy [12]. As an extension of inter-battery factor analysis (IBFA) [13] and multiple battery factor analysis (MBFA) [14], Bayesian group factor analysis (GFA) is proposed to find latent factors shared by all views and view(s)-specific latent factors from the type-2 multi-view data [15]. Group component analysis (GCA) (applicable to the type-1 multi-view data) [16] and integrative NMF (applicable to the type-2 multi-view data) [17] impose structured sparsity on the coefficient matrix. Variational Bayesian group sparse non-negative matrix factorization (VBGS-NMF) produces class-specific (but ignores

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

ubiquitous) factors for supervised dictionary learning on the type-1 multi-view data [18]. Collective matrix factorization (CMF) [19] extends tri-factorization to multi-relational data in association studies. A detailed discussion of recent multiview matrix factorization models can be found in [20] and [2]. These aforementioned multi-view matrix factorizations can be applied as feature extraction methods that help understand data from latent feature space. Identifying ubiquitous and classspecific input features are even more crucial for enhancing learning performance and understanding a system in the original feature space. However, multi-view matrix factorizations for this purpose on multi-class data (i.e. the type-1 multiview data) are not well studied yet. To fill this gap, the focus of our study in this paper is to devise a multi-view matrix factorization model for multi-class data such that input features with common activities across all classes and input features specific to one or a subset of classes can be detected with straightforward interpretability. As an important and long-standing topic in machine learning, feature selection aims at finding informative measurements in supervised and unsupervised learning [21]. Unfortunately, selecting an optimal subset of features with respect to an objective is considered as NP-hard [22]. Thus, various approximation algorithms, such as LASSO-like sparse linear models, are devised for regression and two-class classification problems [23]. One-versus-one or one-versus-rest strategies are usually applied to extend two-class feature selection methods to multi-class cases, such as the work in [24]. These strategies have been viewed to be “unnatural” as the time complexity depends on the number of classes. Instead, deep feature selection [25] and random forest [26] are more “natural” for multi-class data, because selecting a feature subset is independent of the number of classes; thus, both methods represent the state of the art for multi-class feature selection. The characterization of all features is key to understand data and problem. However, these existing feature selection methods only output a (short) list of feature candidates, rather than characterize all features. Moreover, inconsistent selection of features, namely a small perturbation of model parameters or training data leads to selecting very different feature subsets, raises concerns [27]. Stability selection [28], an implementation of the crowd wisdom philosophy [29], is recently recognized as a generic consistent feature selection strategy. In stability selection, a feature selection model, e.g. LASSO, is rerun multiple times on randomly sampled training subsets and then empirical probabilities (e.g. the frequency of a feature being selected among the multiple runs) are computed as feature importance scores. Random forest essentially applies the stability selection strategy as well, since a collection of randomized decision trees are constructed to measure feature importance scores [26]. Generally, machine learning models based on the wisdom of crowd are robust to change in parameters and training data, thus relieve the need of model selection. For instance, the dropout method [30] in deep learning combines multiple neural networks to avoid tweaking network structures. To push forward the frontier of feature selection research in the age of big data analytics, in this study, we address the critical puzzle of comprehensive and stable characterization

2

of the patterns of all features in large-scale multi-class data. Our solution is inspired by the recent advances in single-view non-negative matrix factorization optimized by variational approximation and multi-view matrix factorization techniques for extracting shared and view-specific latent factors from the type-2 multi-view data. Our work makes two major contributions. (1) We propose a Bayesian multi-class nonnegative matrix factorization (MC-NMF) model for the type1 multi-view data, in order to identify all feature patterns across multiple classes of non-negative data. We focus on non-negative data because it is common in this big data era, such as high-throughput sequencing data in genomic research and count data in text and image analytics. Using the meanfield variational inference theory, an efficient algorithm is derived to learn the structured basis matrix and coefficient matrix. (2) MC-NMF model powered by stability selection (SS-MC-NMF) is devised to avoid the need of hyperparameter selection, and meanwhile consistently discover feature patterns. To the best of our knowledge, our models are the first successful work that enables a comprehensive delineation of feature patterns from massive multi-class data. Furthermore, the implementations of these models are publicly accessible from https://github.com/yifeng-li/mvmf. Notations used hereafter: 1) A scalar variable is denoted by an italic lower case symbol, e.g. x. 2) A column vector is written in an italic bold-face lower case symbol, e.g. x. 3) A matrix is represented by an italic bold-face upper case symbol, e.g. X. 4) A Poisson distribution is formulated as P(x|λ) ≡ λx −λ , where λ equals to both mean and variance. x! e 5) An univariate exponential distribution is written as E(x|λ) ≡ λe−λx , where x ≥ 0 and λ denotes rate. The entropy of this distribution is calculated as H(x) = − log λ + 1, which is used to compute variational lower bound in Section II-D. 6) A Gamma distribution with shape parameter a and rate ba parameter b is formulated as G(x|a, b) ≡ Γ(a) xa−1 e−bx . Its entropy is computed as H(x) = (a−1)ϕ(a)−log b+ a − log Γ(a), where ϕ(a) is a digamma function defined d by da log Γ(a). 7) A K-dimensional multinomial distribution is written as M(x1 , · · · , xK |N, p1 , · · · , pK ) ≡ QK PK Γ(N +1) xk p , where x = N k=1 k k=1 k Γ(x1 +1)···Γ(xK +1) PK and k=1 pk = 1. Its entropy is calculated as H(x) = PK − log Γ(N +1)+ k=1 (E[log Γ(xk +1)]−N pk log pk ). In the rest of this article, the MC-NMF model and its variational optimization algorithm are presented in Section II. The derivations of its variational update rules and lower bound of its log-likelihood are detailed in Supplemental File 1. The MCNMF based feature characterization algorithm is proposed in Section III. The SS-MC-NMF feature characterization method is described in Section IV. Both MC-NMF and SS-MC-NMF models are assessed on a simulated dataset and a pan-cancer dataset in Section V. Finally, discussions and conclusions are given in Section VI.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

II. VARIATIONAL M ULTI -C LASS NMF A. Aim Let X = [X (1) , · · · , X (V ) ] ∈ R+ M ×N be a multi-class non-negative data matrix with X (v) ∈ R+ M ×Nv representing the v-th class, where M is the number of features, PV Nv the number of samples in the v-th class, and N = v=1 Nv , the total number of samples. Each column of X accommodates a sample, and each row of X corresponds to the observed values of a feature. We define y = [y (1) ; · · · ; y (V ) ] ∈ {1, · · · , V }N as a vector of N class labels corresponding to the samples in X, with y (v) representing the Nv class labels of samples X (v) in the v-th class. To discover the underlying feature patterns, we aim at designing a NMF model that factorizes the non-negative matrix X into structured sparse non-negative matrices W and H, such that each sample in X is approximated by a linear combination of ubiquitous and classspecific factors, and the activity of each feature in individual classes is characterized. Here, W ∈ R+ M ×K is called basis matrix with K factors that are composed of ubiquitous factors shared by all classes, and class-specific factors. Let vector z ∈ {0, 1, · · · , V }K define the membership of each factor in a class. For example, z1 = 0, z2 = 0 and z3 = 0 indicate that the first, second and third factors (columns of W ) are ubiquitous factors; z4 = 1 and z5 = 1 indicate the fourth and fifth factors are class-one-specific factors, etc. Class-wise sparsity is desired on W , so that the activity of a feature in a class can be characterized. H ∈ R+ K×N is the coefficient matrix that also possesses group-wise sparsity, so that each column of H contains coefficients corresponding to the shared factors and class-specific factors in W . In this structure, suppose the n-th sample class, that is yn = v, n belongs to the v-thP Px K K then xn ≈ k=1 δ(zk − 0)hk,n wk + k=1 δ(zk − v)hk,n wk , where the first term is a non-negative liner combination of ubiquitous factors and the second term is a non-negative linear combination of factors specific to class v, and δ(t) is the Kronecker delta function that equals to 1 if t = 0, otherwise 0. This desired NMF is illustrated in Figure 1. X

W

H ¯ (1) H

¯ (2) H

¯ (3) H

˘ (1) H X

(1)

X

(2)

X

(3)



˘ ¯ W W

(1)

˘ W

(2)

˘ W

(3)

˘ (2) H ˘ (3) H

Fig. 1: A schematic example of multi-class NMF. In X, samples (columns) from different classes are marked in different colors. In W , shared factors across all classes are colored in purple, factors in orange, green and blue are specific to class one, two and three, respectively. White color means zero. In H, the purple blocks are coefficients corresponding to the common factors, while orange, green and blue blocks correspond to the class-specific factors. B. Model To realize the aim depicted above, we propose the Bayesian multi-class non-negative matrix factorization (MC-

3

NMF) model, which is formulated as:  M Y N K  Y X    p(X|S) = δ(x − sm,n,k ), m,n     m=1 n=1 k=1    M Y N Y K  Y     p(S|W , H) = P(sm,n,k |wm,k hk,n ),    m=1 n=1 k=1     M Y K V  Y X  (W ) )   p(W |Λ , z) = E w δ(zk − u)λ(W m,k |  m,u ,    m=1 k=1 u=0  p(H|Λ(H) , z, y)    K Y N V X V  Y X     = E h δ(zk − u)δ(yn − v)λ(H) k,n |  u,v ,    n=1 u=0 v=1   k=1  M Y V  Y   )  p(Λ(W ) |a0 , b0 ) = G(λ(W  m,u |a0 , b0 ),    m=1 u=0     V Y V  Y    p(Λ(H) |A, B) = G(λ(H) u,v |au,v , bu,v ).  

(1)

(2)

(3)

(4)

(5)

(6)

u=0 v=1

This model is explained as follows. In Equations (1) and (2), S ∈ R+ M ×N ×K is defined as a latent three-way tensor variable, so that the observed data X P equals to the sum of S K along the third axis, that is xm,n = k=1 sm,n,k . The idea of using a three-way auxiliary variable was first introduced in a Bayesian NMF [9] for the convenience of deriving the variational lower bound of the log-likelihood. In Equation (2), the (m, n, k)-th element of S follows the Poisson distribution P(sm,n,k |wm,k hk,n ). Equation (1) takes the advantage of one important property of Poisson distribution: the PKsum of Poisson distributed independent random variables ( k=1 sm,n,k ) also follows a Poisson distribution with mean equals PK to the sum of the means of these random variables ( k=1 wm,k hk,n ). In Equation (1), δ(t) is a Dirac delta function whose value equals to +∞ if t = 0, otherwise (1) and  PK 0. Thus, Equations (2) imply xm,n ∼ P xm,n | k=1 wm,k hk,n . The motivation of assuming a Poisson distribution is because many modern data are naturally non-negative and can be represented by counts or frequencies. For example in bioinformatics, omics data captured by next-generation sequencing platforms can be transformed to read count per genomic region (e.g. a gene). Using vector z, basis factors in W are labelled to ubiquitous factors (located in the leading columns) and class-specific factors. For instance in Figure 1, z = [0, 0, 0, 1, 1, 2, 2, 2, 3, 3]T . Equation (3) implements an ARD mechanism [7] that enforces class-wise sparsity to each row of W , where elements belong to the same class are exponentially distributed with shared (W ) parameter λm,v which follows a Gamma distribution with hyperparameter a0 (shape) and b0 (rate) as in Equation (5). In Equations (3) and (4), δ(t) is a Kronecker delta function. Each element in H has a row class label as indicated in vector z and a column class label as indicated in vector y. Thus, blockwise sparsity is enforced in Equation (4) by letting all elements of the same row class and the same column class follow the same exponential distribution whose parameter is Gamma distributed as specified in Equation (6). By manipulating the hyperparameters A ∈ R+ (V +1)×V and B ∈ R+ (V +1)×V , we should be able to obtain a sparse H with ubiquitous non-zero blocks at the top and class-specific non-zero blocks in the

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

diagonal as illustrated in Figure 1. To realize it in practice for a 3-class data, A and B should be organized respectively as     blarge blarge blarge alarge alarge alarge  alarge asmall asmall   blarge bsmall bsmall      asmall alarge asmall  , bsmall blarge bsmall  , (7) bsmall bsmall blarge asmall asmall alarge where (alarge , blarge ) controls the nonzero blocks in H and (asmall , bsmall ) is responsible to the zero blocks. Given the model defined above, the joint distribution of all visible and latent variables can thus be factorized as p(X, S, W , H, Λ(W ) , Λ(H) |z, y, a0 , b0 , A, B) = p(X|S)p(S|W , H)p(W |Λ(W ) , z)p(H|Λ(H) , z, y) p(Λ(W ) |a0 , b0 )p(Λ(H) |A, B).

(8) (W )

We denote the latent variables by θ = {S, W , H, Λ , Λ(H) }. In theory, the derivation of the posterior distribution p(θ|X, z, y, a0 , b0 , A, B) is intractable, which prevents us from inferring the (expected) values of θ. Sampling-based algorithms, such as Markov chain Monte Carlo (MCMC) sampling and Gibbs sampling [31], are computationally unaffordable for large-scale problems. Thus, inspired by [9], we resort to a mean-field variational algorithm to approximate the posterior distribution by factorizable distributions. For a gentle introduction on variational approximation inference theory, readers are encouraged to read through [32] and [33]. The derivations of the variational update rules in our MC-NMF and their matrical notations are available in Sections S.2 and S.3 of Supplemental File 1. The overall learning algorithm and the corresponding variational lower bound of the log-likelihood are presented below in Sections II-C and II-D, respectively. C. Optimization Algorithm The complete variational algorithm is described in Algorithm 1 (see Supplemental File 1 for derivations of variational update rules used in this algorithm). It takes a multi-class matrix as input, evolves via alternating updates, and finally returns the expected basis matrix and coefficient matrix (called sufficient statistics in mean-field methods), as well as a feature activity matrix. The number of basis vectors for each class needs to be pre-specified. In general, selecting the optimal number of latent factors in a generative model often requires heuristic strategies. For example in PCA, it can be determined by counting the number of eigenvalues above a given cutoff. In this paper, a simplest method is to uniformly use a roughly estimated number for all classes. Optionally, an ARDbased method can be adopted as a more sophisticated approach [8], that is, for each class, latent factors corresponding to (nearly) zero columns in the basis matrix or (nearly) zero rows in the coefficient matrix are abandoned. Before running the alternating updates in a loop, the sufficient statistics are randomly initialized. In one iteration of the loop, all sufficient statistics are updated sequentially. If the algorithm reaches the maximal number of allowed iterations, it will terminate. The variational lower bound of the log-likelihood (see Section II-D) can be optionally computed to monitor the convergence of the algorithm. When using it, if the local mean change rate

4

of the lower bound is less than a tiny positive number, the algorithm should terminate. Given a sequence of variational lower bounds in vector l ∈ R− i which contains i negative numbers computed in previous and current iterations, the differences of these lower bounds can be represented by d where di = li − li−1 . The local mean change rate at the i-th i−w+1 iteration can thus be calculated as ri = di +di−1 +···+d w where w is the width of the averaging window, say 5. At the end, the final expected basis matrix E (W ) and expected coefficient matrix E (H) , as well as feature activity matrix F¯ are returned. F¯ is a non-negative matrix of size M × (V + 1), where element f¯m,1 reflects the ubiquitously active level of the m-th feature, element f¯m,2 implies the active level of the m-th feature specific to the first class, and so on. Thus, F¯ is a key matrix to understand the activity of features across classes, and is taken as input in Algorithm 2. D. Variational Lower Bound According to the mean-field variational inference theory, we derive the variational lower bound of the log-likelihood to monitor the convergence of our variational algorithm (Algorithm 1). According to the concept of variational lower bound [33], the log-likelihood can be decomposed to  log p(X) = l(q) + KL q(θ||p(θ|X) , (9) where q(θ) is the approximate distribution to the posterior and is defined in Section S.3 of Supplemental File 1, the first term is the variational lower bound, and the second term the Kullback-Leibler (KL) divergence. From KL q(θ||p(θ|X) ≥ 0 [34], we know log p(X) ≥ l(q). Thus, l(q) serves as a lower bound of the log-likelihood. It can be computed as the combination of expected log-joint distribution with respect to q(θ) [where the joint distribution is defined in Equation (8)] and functional entropy of q(θ): Z l(q) = · log

q(S, W , H, Λ(W ) , Λ(H) ) p(X, S, W , H, Λ(W ) , Λ(H) ) d(S, W , H, Λ(W ) , Λ(H) ) q(S, W , H, Λ(W ) , Λ(H) )

= Eq [log p(X, S, W , H, Λ(W ) , Λ(H) )] − Eq [log q(S, W , H, Λ(W ) , Λ(H) )] = Eq [log p(X|S)] + Eq [log p(S|W , H)] + Eq [log p(W |Λ(W ) , z)] + Eq [log p(H|Λ(H) , z, y)] + Eq [log p(Λ(W ) |a0 , b0 )] + Eq [log p(Λ(H) |A, B)] + H(S) + H(W ) + H(H) + H(Λ(W ) ) + H(Λ(H) ), (10)

R where H(x) = − p(x) log p(x)dx = E[− log p(x)] defines the functional entropy of p(x). The functional entropies of related distributions have been given at the end of Section I. The exact forms of the above terms are derived in Section S.4 of Supplemental File 1. Substituting these exact forms into Equation (10), we calculate the variational lower bound P using Equation (11) where the function (·) calculates the sum of matrix elements. For discretely distributed data, the variational lower bound at an iteration remains negative [33]. In a converging algorithm, it should increase as the algorithm evolves. When the algorithm converges, the lower bound

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

Algorithm 1 Variational Algorithm for MC-NMF Input: X: training examples; y: class labels; z: class labels of basis vectors; a0 , b0 , A, B: hyperparameters. Output: E (W ) : expected basis matrix; E (H) : expected coefficient matrix; F¯ : feature activity matrix. K = length(z); {number of factors} M = number of rows(X); N = number of columns(X); V = number of classes(y); Y = matricize(y); Y ∈ {0, 1}N ×V , yn,v = 1 if yn = v Z = matricize(z);Z ∈ {0, 1}K×(V +1) , zk,u = 1 if zk = u {initialization} (W ) (H) E (Λ ) = G(a0 , b0 ); E (Λ ) = G(A, B); ˆ(W ) = 1M ×K ; ˆ (W ) = E (Λ(W ) ) Z T ; A B ˆ(W ) , B ˆ (W ) ); E (W ) = G(A (H) ˆ ˆ (H) = ZE (Λ(H) ) Y T ; A = 1K×N ; B (H) (H) (H) ˆ ,B ˆ ); E = G(A L(W ) = E (W ) ; L(H) = E (H) ; A are {iteratively update the model parameters, A ∗ B and B element-wise multiplication and division, respectively} num iter = 1; while num iter ≤ max iter do (H) T Σ(W ) = L(W ) ∗ L(WX ) ; ) L(H) (L T (W ) X (H) (H) T Σ = (L ) ∗ L(W ) L(H) L ; ˆ(W ) = 1 + Σ(W ) ; A ˆ (W ) = 1M ×N (E (H) )T + E (Λ(W ) ) Z T ; B ˆ(W ) A E (W ) = B ˆ (W ) ; ˆ(H) = (Σ(H) )T + 1; A ˆ (H) = (E (W ) )T 1M ×N + ZE (Λ(H) ) Y T ; B ˆ(H) A E (H) = B ˆ (H) ; L(W ) = (H)

ˆ (W ) )

eψ(A ˆ (W ) B

;

ˆ (H) )

eψ(A ˆ (H) B

L = ; (Λ(W ) ) ˆ A = a0 + 1M ×K Z; ˆ (Λ(W ) ) = b0 + E (W ) Z; B (W )

ˆ(Λ(W ) )

(H)

ˆ(Λ(H) )

E (Λ ) = Aˆ (Λ(W ) ) ; B ˆ(Λ(H) ) = A + Z T 1K×N Y ; A ˆ (Λ(H) ) = B + Z T E (H) Y ; B E (Λ ) = Aˆ (Λ(H) ) ; B num iter = num iter + 1; {compute variational lower bound} compute l(q) as in Equation (11); if num iter > max iter or ri < 1e − 4 then break; end if end while z¯ = take column sum(Z); # the first element of z¯ is the number of basis vectors shared by all classes, the u-th (2 ≤ u ≤ V +1) element of z¯ indicates the number of basis vectors private to the (u − 1)-th class ¯ = [z; ¯ contains K z’s ¯ · · · ; z]; ¯ ¯ in its rows Z #Z Z obtain the feature activity matrix: F¯ = E (W ) Z ¯; (W ) (H) ¯ return E , E , and F ;

5

should reach at a steady state. Since the computation of the lower bound is not economic for large-scale data, thus it might be opted out in the optimization algorithm. l(q) = X + X + X + X + X + X + X +

+

X

X

 − log Γ(X + 1) + X ∗ log(L(W ) L(H) ) − E (W ) E (H)  (W ) (W ) ) T ) T L(Λ Z − (E (Λ Z ) ∗ E (W )  (H) (H) ZL(Λ ) Y T − (ZE (Λ ) Y T ) ∗ E (H) (W ) (W )  ) ) a0 log b0 − log Γ(a0 ) + (a0 − 1)L(Λ − b0 E (Λ (H) (H)  A ∗ log B − log Γ(A) + (A − 1) ∗ L(Λ ) − B ∗ E Λ  ˆ(W ) ) − (A ˆ(W ) − 1)ϕ(A ˆ(W ) ) − log B ˆ (W ) + A ˆ(W ) log Γ(A  ˆ(H) ) − (A ˆ(H) − 1)ϕ(A ˆ(H) ) − log B ˆ (H) + A ˆ(H) log Γ(A

ˆ(Λ(W ) ) ) − (A ˆ(Λ(W ) ) − 1)ϕ(A ˆ(Λ(W ) ) ) − log B ˆ (Λ(W ) ) log Γ(A  ˆ(Λ(W ) ) +A ˆ(Λ(H) ) ) − (A ˆ(Λ(H) ) − 1)ϕ(A ˆ(Λ(H) ) ) − log B ˆ (Λ(H) ) log Γ(A  ˆ(Λ(H) ) , +A (11)

III. MC-NMF FOR F EATURE PATTERN D ISCOVERY Like other NMFs, MC-NMF is capable of extracting new features in a supervised fashion. Uniquely, MC-NMF can be applied to discover shared input features and class(es)-specific input features which are useful to build predictive models and interpret massive multi-class data. In this paper, we focus on the computational discovery of such feature patterns. Based on MC-NMF, we propose a feature pattern discovery method as shown in Algorithm 2. In this method, the feature activity indicator matrix F is a binary matrix of size M × (V + 1) which is obtained by applying a threshold [by default it can be 0.1 × mean(F¯ )] to the feature activity matrix F¯ produced by Algorithm 1. If f¯m,u from F¯ is greater than this threshold, the m-th feature is defined to be active in the (u-1)-th class (when u = 1, it means the m-th feature is ubiquitously active among all classes); otherwise, this feature is defined to be inactive in the (u-1)-th class. In this way, each row of F makes a string of binary code that describes the activity of the corresponding feature. For example in a dataset with V = 3 classes, a binary code “0000” means the corresponding feature is totally silent; “1000” indicates the values of this feature in all classes can be completely explained by the ubiquitous factors; “0100” unveils that this feature is only active in the first class; “1100” implies the ubiquitous factors can totally explain the expressions of this feature in class 2 and 3, but can only partially explain the activity of this feature in class 1; “1110” suggests that the ubiquitous factors can totally explain the variances of this feature in class 3 but only partially in class 1 and 2; and “1111” points out that the shared factors only partially explain the activities of this feature in all classes but each class still needs additional specific explanations. Thus, each binary code represents a feature pattern. Features with the same binary pattern code show the same behaviour thus can be grouped together. Therefore, these feature patterns are very informative to understand multi-class data. We score, in various ways, features with the same binary pattern. In Algorithm 2, we separately rank features based on

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

Algorithm 2 MC-NMF Based Feature Pattern Discovery Input: F¯ : feature activity or empirical probability matrix; g: input feature list; threshold F¯ : a positive threshold on the feature activity matrix F¯ above which the feature activity indicator matrix F is defined; rank method : feature ranking method; threshold S : the minimal threshold to cut the Wilcoxon-rank-sum-test scores; max c : the maximally allowed number of selected features in each pattern. Output: F : feature activity indicator matrix; s(mbv ) ; s(mfv ) , s(wrst) : feature scores; ind (mbv ) , ind (mfv ) , ind (wrst) : indices to reorder features to patterns; gselected : selected features for building up predictive models. get feature activity indicator matrix: F = F¯ ≥ threshold F¯ ; {compute feature scores} for m = 1 to M do # mean-basis-value score P fs = PFm,: ; # number of active classes fm = F¯m,: ; # sum of all active basis values (mbv ) # mean of all active basis values = fm sm fs ; # mean-feature-value score (mfv ) = mean(Xm,active ); # decided by Fm,: , active sm and inactive are indices of samples in the active and inactive classes, respective # Wilcoxon-rank-sum-test score pval = Wilcoxon rank sum test(Xm,active , Xm,inactive ); (wrst) = − log10 (pval ); sm end for {rank features} c, cunique = binary codes(F ); # convert each row of F to a string of binary code, e.g. [1, 0, 0, 1] → “1001”, c contains the binary codes of all features, cunique contains all unique binary codes ind (mbv ) = rank features(c, s(mbv ) ); ind (mfv ) = rank features(c, s(mfv ) ); ind (wrst) = rank features(c, s(wrst) ); {feature selection for classification} gselected = {}; # store selected features for c in cunique do gc is the top maxc features in pattern c with Wilcoxonrank-sum-test score greater than threshold S ; gselected = gselected + gc ; end for return F , s(mbv ) ; s(mfv ) , s(wrst) , ind (mbv ) , ind (mfv ) , ind (wrst) , gselected ;

mean basis values, mean feature values, and negative log-pvalues produced by Wilcoxon rank-sum test. For a feature, the mean basis value is computed as the averaged feature activity over all active classes in the corresponding row of F¯ . The mean feature value is calculated by averaging all active feature values in the corresponding row of X. The Wilcoxon rank-sum test of the m-th feature is carried out on the active feature values versus the inactive feature values. Eventually the features can be ranked by their patterns first, and then sorted using one of the aforementioned scoring methods for

6

Algorithm 3 SS-MC-NMF Based Feature Pattern Discovery Input: X: training examples; y: class labels; z: class labels of basis vectors; g: input feature list; threshold F˜¯ : a positive ˜ ¯ threshold on the individual feature activity matrices F obtained through running MC-NMF over sampled training subsets; α: a positive threshold on the final empirical probability matrix F¯ ; rank method : feature ranking method; threshold S : the minimal threshold to cut the Wilcoxonrank-sum-test scores; max c : the maximal allowed number of selected features in each pattern. Output: F : final feature activity indicator matrix; s(mbv ) ; s(mfv ) , s(wrst) : feature scores; ind (mbv ) , ind (mfv ) , ind (wrst) : indices to reorder features to patterns; gselected : selected features for building up predictive models. {SS-MC-NMF} F¯ = 0M ×(V +1) ; for i = 1 to max run do sample a0 from set a0 and alarge from set alarge ; ˜ y} ˜ from {X, y}; sample (50%) training examples {X, ˜ y} ˜ to obtain the corresponding run Algorithm 1 on {X, ˜¯ ; feature activity matrix F ˜ ˜ ¯ F = F ≥ threshold F ; # the corresponding feature activity indicator matrix update F¯ : F¯ = F¯ + F˜ ; end for obtain the final empirical probability matrix: F¯ = F¯ /max run ; run Algorithm 2 based on F¯ ; # use α as threshold return F , s(mbv ) ; s(mfv ) , s(wrst) , ind (mbv ) , ind (mfv ) , ind (wrst) , gselected ;

each pattern. The top ranked features in all feature patterns (or only in class-specific patterns) can be selected for building predictive models. In the above example, the patterns specific to the first class include “0100”, “1100”, “0011” and “1011”. In a nutshell, Algorithm 2 takes a feature activity matrix (generated by Algorithm 1) as input and returns the feature patterns, the feature scores, the ranked order of features based on both feature patterns and feature scores, and the selected features for building predictive models. Different from existing feature selection methods which are often used as black boxes and only provide a small subset of discriminative features, this algorithm characterizes the activities of all features across all classes in a run using easily interpretable binary codes. IV. MC-NMF WITH S TABILITY S ELECTION FOR F EATURE PATTERN D ISCOVERY Algorithm 2 for discovering feature patterns poses two challenges: (1) The performance of MC-NMF is affected by the four hyperparameters {a0 , b0 , A, B}, of which different settings may lead to varied factorizations and thus feature activity indicator matrices; (2) Different random initialization points of its optimization or small change of the training set may result in inconsistent factorizations and hence feature characterizations. Sensitivity to parametrization and data perturbation are concerns in many feature selection methods [35].

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

In this paper, we further propose to apply the stability selection strategy [28] to MC-NMF in order to overcome these two issues and report robust feature patterns. We name this method as SS-MC-NMF, which is depicted in Algorithm 3. In this method, first, MC-NMF is rerun for multiple (e.g. 200) times. In each run, the hyperparameters are randomly sampled from a predefined list, a subset (we used 50%) of training examples are randomly sampled from the entire training set, and the model parameters are randomly initialized. In practice, we fix b0 = a0 mean(X) and blarge = alarge , asmall =100, and bsmall = 10−30 . Thus, only a0 and alarge are actually sampled. After predefined multiple runs of MC-NMF, the average of the feature activity indicator matrices of all runs are obtained as the final empirical probability matrix denoted by F¯ . A feature repeatedly shows up in a class in most runs should imply that it may be truly active in this class. Finally, by taking empirical probability matrix F¯ as input, Algorithm 2 is called to apply the threshold α ∈ (0, 1] on the empirical probability matrix in order to determine the activity pattern of each feature, and to rank the features based on their patterns and scores. Therefore, SS-MC-NMF could further enhance the performance of MCNMF-based feature pattern discovery through the wisdom of the crowd towards consistent characterization. V. E XPERIMENTS We executed our models on a simulated count dataset and a real RNA-seq dataset in order to assess the convergence of the variational approximation algorithm, the ability of MCNMF and SS-MC-NMF to recover feature patterns, and the potential of SS-MC-NMF for building a predictive model. For each dataset, the data preprocessing, experimental procedure, and findings are described below. Due to lack of comparable benchmark methods (methods such as GFA for the type-2 multi-view data cannot be simply applied to type-1 multi-view data via matrix transpose for feature pattern discovery, see [20] for discussion) for feature pattern discovery from multi-class data, we instead carried out enrichment analysis on the features with certain patterns discovered from the RNA-seq data. A. Recovery of Predefined Patterns in Simulated Data We first investigated whether our models can recover predefined structured patterns from simulated data. A generative model X = W H was designed to generate a dataset with M =160 features and N =3,000 samples from three classes. The basic idea of this simulation was to predefine W and H, and then generate X, so that the ground true of feature patterns was known and could be compared with the feature patterns discovered from X using MC-NMF or SS-MC-NMF. In addition to three ubiquitous basis factors, each class was allowed three specific basis factors. Thus, K=3 + 3 ∗ 3=12. Since there were three classes, the total number of possible patterns was thus P =24 =16, where a pattern was defined by a binary code as above, e.g. “0000”, “1000”, “0100”, · · · , “1111”, where 1 denotes active, and 0 inactive. The first digit in a binary code indicates the ubiquitous activity across all classes. We stored all patterns in matrix R ∈ {0, 1}P ×(V +1) ,

7

and let each pattern have 10 features, obtaining M =160 features in total. To encode it, we defined matrix Q ∈ {0, 1}M ×P with qm,p = 1 denoting the m-th feature belongs to the p-th pattern, and matrix T = QR ∈ {0, 1}M ×(V +1) with tm,u = 1 indicating the m-th feature is active in the (u-1)-th class (u = 1 for ubiquitous activity). We defined each class has multiple factors, which can be encoded by matrix Z ∈ {0, 1}K×(V +1) where zk,u = 1 denotes the k-th factor belongs to the (u-1)th class. Thus, S (W ) = T Z T ∈ {0, 1}M ×K represents the M ×(V +1) structured non-zero blocks in W . Matrix Λ(W ) ∈ R+ was sampled from the Gamma distribution G(a(W ) , b(W ) ), where we set the parameters a(W ) = 10 and b(W ) = 105 . If tm,u = 1 and zk,u = 1, from an exponential distribution (W ) we sampled W [m, k] ∼ E(λm,u ), otherwise W [m, k] = 0. The coefficient matrix H was generated similarly. We defined matrix C ∈ {0, 1}(V +1)×N , where cu,n = 1 indicates that the n-th sample belongs to class (u − 1). Thus, matrix S (H) = ZC ∈ {0, 1}K×N represents the structured non(V +1)×V zero blocks in H. Matrix Λ(H) ∈ R+ was sampled (H) (H) (H) from G(a , b ), where a = 10 and b(H) = 1. Finally, we sampled H[k, n] ∼ E(λu,v ), if zk,u = 1 and cu,n = 1; otherwise H[k, n] = 0. Once W and H were obtained, the simulated data were generated using X = W H. After running MC-NMF, we compared the feature activity indicator matrix F (defined in Algorithm 2) with matrix T (defined above) to see how well the model recovers the defined feature patterns. To quantify it, we defined two featurerecovery accuracy measurements. The first one is called rowr wise-match (RWM) accuracy defined as ac = M M , where Mr is the number of consistent rows in F and T . The second one is called element-wise-match (EWM) accuracy defined as e ae = MM×K , where Me is the number of agreed elements in matrices F and T . We carried out a model search with z = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]T and both a0 and alarge from {103 , 102 , 50, 10, 5, 1, 0.8, 0.6, 0.4, 0.2, 0.1, 0.05, 10−2 , 5 × 10−2 , 10−3 }. Hyperparameters b0 and blarge were tied as b0 = a0 mean(X) and blarge = alarge . The corresponding KL divergence, RWM and EWM accuracies are displayed in Figure 2a, from which we have the following findings. First, MC-NMF is able to recover all real patterns hidden in this data upon a proper hyperparameter setting. Second, the EWM accuracy is higher than the RWM accuracy, which is reasonable, because a mismatched row may still contain matched elements. Third, a good model recovering all the patterns does not necessarily achieve the lowest KL divergence. One can observe that KL divergence is determined by hyperparameters and fluctuates as the decrease of featurerecovery accuracies. Thus, KL divergence cannot be employed as a metric to select hyperparameters. Figure 2b shows the evolution of variational lower bound and corresponding localmean-change rate (the smoothed change rate of the variational lower bound) as the algorithm runs. It confirms that the variational algorithm converges quickly as the algorithm iterates. After 200 iterations, the algorithm reaches at a steady state. The heat map of decomposition result corresponding to Figure 2b is visualized in Figure 2c, which indicates that MC-NMF is able to induce structured sparsity as desired.

20 50

100

Setting

150

200

(a) KL divergences, EWM and RWM accuracies corresponding to a variety of hyperparameter settings.

8

0.0 1e8 0.18 0.5 0.16 1.0 0.14 1.5 0.12 2.0 0.10 2.5 0.08 3.0 0.06 3.5 0.04 4.0 0.02 4.5 0.00 0 100 200 300 400 500 600 700

Local Mean Change Rate

25

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

Lower Bound

30

Element-Wise-Match Accuracy Row-Wise-Match Accuracy

Log2(KL Divergence)

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

Iteration

(b) Variational lower bound of log-likelihood on simulated data with a0 = 0.1 and alarge = 10.

(c) Heat map of the MC-NMF factorization result.

(d) Heat map of the actual basis matrix.

Fig. 2: MC-NMF on simulated data. Values were scaled to [0,1] for visualization purpose only.

The actual basis matrix is displayed in Figure 2d to compare with the predicted basis matrix in Figure 2c. Ignoring the order of basis vectors for a class, we see that MC-NMF is able to recover the predefined patterns. For example, the first predicted ubiquitous basis factor clearly corresponds to the second actual ubiquitous basis vector; for the first class, the first predicted basis vector is nearly identical to the third actual basis vector. This experiment shows that MC-NMF can effective identify feature patterns from multi-class data and provides a straightforward solution to aerially visualize and understand the data. In order to bypass the model selection challenge, improve the stability of discovery, and preserve the capability of MC-NMF, we resorted to SS-MC-NMF on the simulated data. In our experiment using SS-MCNMF, a0 and alarge were randomly sampled from set {103 , 102 , 50, 10, 5, 1, 0.8, 0.6, 0.4, 0.2, 0.1, 0.05, 10−2 , 5 × 10−2 , 10−3 }, while keeping b0 and blarge tied as in the experiment using MC-NMF described above. As the performance of SS-MCNMF is affected by the number of samplings, (denoted by max run in Algorithm 3), and the threshold (denoted by

threshold F¯ in Algorithm 2) to binarize the empirical probability matrix F¯ to F , we monitored the RWM and EWM accuracies as the change of both effectors, as is shown in Figures 3a and 3b, from which we can conclude that SS-MCNMF is robust to variations of these effectors, particularly when the threshold sits in the range of [0.65, 0.90] and the number of samplings exceeds 100. Figure 3c visualizes the empirical probability matrix which is the mean of feature activity indicator matrices over 200 runs of MC-NMF on randomly sampled data. When threshold 0.9 is applied to this empirical probability matrix, the predefined feature patterns can be completely recovered by SS-MC-NMF. This experiment shows that the strategy of collective wisdom in SS-MCNMF indeed avoids searching for a suitable single point in the latent variable and hyperparameter space, and stably and accurately discovers feature patterns in a probabilistic format. B. Meta-Analysis of Tumor RNA-Seq Data We further assessed MC-NMF and SS-MC-NMF’s performance on a big real-life RNA-seq dataset obtained from The

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

ubi

Accuracy

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

rob

ab

ilit y

0.99 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 400 0.5

ica

lP

204060 80100

Th re

gs

sho

ld

of

Em pir

150 Numb 200 er of Samp lin

(a) Change of RWM accuracy with number of samplings and threshold for the empirical probability matrix.

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Accuracy

1.0

ba

bil ity

0.99 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 400 0.5

0

1

2

0000_9 0000_8 0000_7 0000_6 0000_5 0000_4 0000_3 0000_2 0000_1 0000_0 1000_9 1000_8 1000_7 1000_6 1000_5 1000_4 1000_3 1000_2 1000_1 1000_0 0100_1 0100_9 0100_5 0100_3 0100_0 0100_8 0100_6 0100_2 0100_4 0100_7 0010_7 0010_1 0010_4 0010_5 0010_3 0010_8 0010_2 0010_6 0010_9 0010_0 0001_2 0001_8 0001_0 0001_1 0001_6 0001_3 0001_9 0001_4 0001_7 0001_5 1100_1 1100_2 1100_6 1100_7 1100_0 1100_9 1100_5 1100_4 1100_3 1100_8 1010_0 1010_8 1010_9 1010_7 1010_3 1010_1 1010_4 1010_5 1010_6 1010_2 1001_2 1001_4 1001_5 1001_3 1001_8 1001_1 1001_0 1001_9 1001_7 1001_6 0110_5 0110_2 0110_8 0110_1 0110_3 0110_4 0110_9 0110_0 0110_6 0110_7 0101_9 0101_0 0101_5 0101_3 0101_1 0101_2 0101_8 0101_7 0101_6 0101_4 0011_3 0011_1 0011_9 0011_0 0011_7 0011_8 0011_6 0011_4 0011_2 0011_5 1110_3 1110_8 1110_2 1110_7 1110_9 1110_6 1110_4 1110_0 1110_5 1110_1 1101_0 1101_3 1101_4 1101_2 1101_9 1101_1 1101_7 1101_5 1101_6 1101_8 1011_7 1011_5 1011_1 1011_3 1011_4 1011_9 1011_2 1011_0 1011_6 1011_8 0111_9 0111_8 0111_7 0111_6 0111_5 0111_4 0111_3 0111_2 0111_1 0111_0 1111_9 1111_8 1111_7 1111_6 1111_5 1111_4 1111_3 1111_2 1111_1 1111_0

0000

1000

0100

0010

0001

1100

1010

1001

0110

0101

0011

1110

1101

1011

0111

1111

res

Th

gs

ho

ld

of E

mp

150 Numb 200 er of Samp lin

iric

al

Pro

204060 80100

9

0.0

(b) Change of EWM accuracy with number of samplings and threshold for the empirical probability matrix.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

(c) Empirical probability matrix, F¯ , produced by SS-MC-NMF.

Fig. 3: SS-MC-NMF on simulated data.

Cancer Genome Atlas (TCGA, https://tcga-data.nci.nih.gov), capturing the activities of 20,531 genes in multiple tumor types. As a next-generation sequencing technique, RNA-seq platforms enable accurate quantification of gene transcripts. It involves library preparation to capture mRNAs from cellular materials, high-throughput sequencing to generate tens of millions of short sequence fragments (reads), and aligning these short reads to a reference genome. After this procedure, the generated RNA-seq data for further analysis can be represented by a gene×sample matrix where each element is the normalized count of reads aligned to a gene in a sample. Interested readers are refereed to [36] for further knowledge of RNA sequencing. In this study, our models are applied to pan-cancer analysis aiming at identifying and understanding transcriptomic similarities and differences among various cancers. This dataset contains 5,840 gene expression samples (in normalized read-count-per-gene format) from 11 cancer types and 1 normal class. Thus, this dataset describes the expression landscape of 20,531 genes from 12 classes. The detailed statistics of this data is given in Table I. In our experiments, two thirds of the samples were used for training our models, and the rest were kept intact for test.

TABLE I: Number of Samples in the Pan-Cancer Data. Cancer Type breast invasive carcinoma (brca) colon adenocarcinoma (coad) glioblastoma multiforme (gbm) head and neck squamous cell carcinoma (hnsc) kidney renal clear cell carcinoma (kirc) brain lower grade glioma (lgg) liver hepatocellular carcinoma (lihc) lung adenocarcinoma (luad) lung squamous cell carcinoma (lusc) ovarian serous cystadenocarcinoma (ov) prostate adenocarcinoma (prad) Total

Tumor 1104 288 169 522 534 534 374 517 503 309 498 5352

Normal 114 41 5 44 72 0 50 59 51 0 52 488

We ran MC-NMF to investigate its capability of discovering novel feature patterns and its potential to build predictive models. Figure 4a visualizes the decomposition result of MCNMF, when we set a0 = 0.5 and alarge = 10, kept b0 and blarge tied as in Section V-A, and allowed each class to have five specific basis vectors in addition to five ubiquitous basis vectors. In this experiment, we did not optimize the hyperparameter setting, because we resorted to SS-MCNMF in the next experiment. Instead, we demonstrate an

10

Accuracy

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

1.00 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.90

MC-NMF

MLP

Method

Random Forest

(b) Test accuracy. Log2(Training Time)

25 20 15 10 5 0

MC-NMF

MLP

Random Forest

Method

EWM Accuracy RWM Accuracy a0 =0.5,alarge =10 a0 =0.5,alarge =1 a0 =0.5,alarge =0.1 a0 =0.1,alarge =10 a0 =0.1,alarge =1 a0 =0.1,alarge =0.1 a0 =0.01,alarge =10 a0 =0.01,alarge =1 a0 =0.01,alarge =0.1

EWM or RWM Accuracy

(c) Training time. 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Parameter Setting

(a) Heat map of decomposition of training data.

(d) EWM and RWM accuracy.

Fig. 4: MC-NMF on pan-cancer RNA-seq data. In (a), values were scaled to [0,1] for visualization purpose only. Panels (b) and (c) display the mean classification accuracy and computing time of MC-NMF compared with MLP and random forest which took as input all original features. MC-NMF used parameters: a0 = 0.5 and alarge = 10. The variation over 10 runs of MC-NMF using this parameter setting on the same data is also given in these two plot. Panel (d) shows MC-NMF’s means and variations of pair-wise EWM and RWM accuracies over 10 runs for each of 9 parameter settings on the same data.

example of MC-NMF with an acceptable hyperparameter setting which visually shows good patterns. Figure 4a, shows that MC-NMF is able to identify shared factors and classspecific factors. Using algorithm 2, the features associated with each pattern are sorted and the top three features of each pattern are selected for testing their discriminative power for building predictive models. In total, 1,692 features were selected. Then, multi-layer perceptron (MLP) [37], a classic supervised deep learning [38] technique, was used to learn on the dimensionality-reduced training dataset, and to predict the dimensionality-reduced test samples. Figures 4b and 4c show the comparison of its mean classification accuracy over 10 runs on the same dataset using the same hyperparameter setting but different random initializations and computing time (including factorization time, feature selection time, training time, and test time) with MLP and random forest which took all original features. Both classifiers are well recognized and widely applied in solving classification problems [39], [40]. We found that the MLP classifier preceded by a MC-NMF

feature selection achieved similar accuracy with MLP and random forest using all features, and consumed inter-medium computing time. Figure 4d shows MC-NMF’s means and variations of pair-wise EWM and RWM accuracies over 10 runs for each of 9 hyperparameter settings on the same data. We measured the variation of feature patterns by comparing all possible pairs of the feature activity indicator matrices among the 10 runs using the same hyperparameter setting. The prediction of feature activity in each class, as measured by pair-wise EWM accuracies, slightly disagreed among multiple runs using the same hyperparameter setting on the same data. However, this tiny change in predicted feature activities could cause larger disagreement in feature patterns as measured by pair-wise RWM accuracies. On the one hand, this experiment shows the potential of MC-NMF in discovery of informative feature patterns from real large-scale gene expression data. On the other hand, the variations due to different parameterizations and initializations convince us of the necessity of applying the stability selection strategy on MC-NMF.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

We investigated SS-MC-NMF to see its performance of discovering feature patterns and potential of building a predictive model. Figure 5a shows the empirical probability matrix that indicates feature activity patterns in different classes. Figure 5b visualizes the corresponding heat maps of the training data, from which one can clearly observe structured patterns. Unlike traditional sparse-learning based and random-forest based feature selection methods that return a sorted list of features, one unique advantage of our SS-MC-NMF model is that informative feature patterns can be stably detected from multi-class data and globally visualized, which facilitates not only classification, but also data interpretation. We particularly investigated the discriminative power of class-specific features via selecting the top features of each individual-class-specific features for training a MLP model. As shown in Figure 5c, we compared the accuracies with a random forest classifier that greedily adds top features ranked according to the random forest importance scores of all 20,531 genes. Simultaneously as a “natural” feature selection and top classification method, the random-forest based feature ranking and classification model have shown greatest accuracy in practice [25], thus can practically serve as an upper-bound reference in this experiment. (Current implementation of the deep feature selection model proposed in [25] has not been scaled to high dimensional data, thus could not be compared here.) Figure 5c shows that the difference between the test accuracies of the SS-MC-NMF method and random forest method is less than one percent. Since only individual-class-specific features were considered in this MLP classifier, we believe the classification performance of the MLP or other multi-class classifiers can be further improved by using an effective strategy which is yet to be investigated to optimize the use of features of all discriminative patterns. C. Gene Ontology Enrichment Analysis We further performed gene-ontology (GO) enrichment analysis, using GOAL [41] to validate the feature patterns discovered by SS-MC-NMF from the multi-tumor RNA-seq data. Collectively, we identified 2,193 patterns encoded by binary codes of length 13 corresponding to the 11 cancer types listed in Table I plus a sample class and the ubiquitous (ubi) activity in the order of [ubi, brca, coad, gbm, hnsc, kirc, lgg, lihc, luad, lusc, normal, ov, prad]. The ubiquitous factors represent significant gene expression in all 11 cancer types and also in normal tissue samples. For the purpose of this paper, we selected the patterns representing a cluster of genes that are specific to certain cancer types. For example, there are 66 genes in the feature pattern specific to breast invasive carcinoma (brca, 0100000000000). Catecholamine biosynthetic process (GO:0042423, p=0.0012) is the top enriched GO term in the pattern (Table SI in Supplemental File 1). Catecholamine is known to regulate angiogenesis and promote tumor growth [42], [43]. Positive regulation of epithelial cell proliferation (GO:0050679, p=0.011) is also enriched in the pattern. The great majority of human breast cancers arise from epithelial cells [44]. Activity of epithelial cell proliferation is thus a key indicator in breast cancer development and such activity

11

in normal mammary tissue has been used to predict breast cancer risk in premenopausal women [45], [46]. Another significantly enriched GO is positive regulation of MAPK cascade (GO:0043410, p=0.013). MAPK signalling pathways are known to play important roles in many cancers including breast and lung cancers [47], [48]. The lung squamous cell carcinoma (lusc) specific pattern (0000000001000) contains 23 genes. The enriched GO terms include regulation of T cell apoptotic process (GO:0070232, p=0.00061), negative regulation of tumor necrosis factor superfamily cytokine production (GO:1903556, p=0.0011), regulation of lymphocyte apoptotic process (GO:0070228, p=0.0018), ganglion mother cell fate determination (GO:0007402, p=0.0022), negative regulation of Wnt signalling pathway involved in dorsal or ventral axis specification (GO:2000054, p=0.0022), positive regulation of leukocyte cell-cell adhesion (GO:1903039, p=0.025), and epidermis development (GO:0008544, p=0.033). These biological processes are known to highly associate with squamous cell carcinoma [49]. The lung adenocarcinoma (luad) specific pattern (0000000010000) contains 27 genes. They are involved in regulation of macrophage differentiation (GO:0045649, p=2.7E-04), monocyte chemotaxis (GO:0002548, p=1.1E-03), negative regulation of myeloid leukocyte differentiation (GO:0002762, p=1.4E-03), inflammatory response (GO:0006954, p=2.2E-03), cytokine receptor binding (GO:0005126, p=4.5E-03), and leukocyte chemotaxis (GO:0030595, p=9.0E-03). Macrophages are known to promote cancer initiation and malignant progression [50] and tumor associated macrophages (TAMs) are also linked to prognosis in human lung cancer [51]. Rest of the GO-terms discovered here are also more or less associated with lung cancer development. Similarly, the enriched GO terms associated with the brain cancer glioblastoma multiforme (gbm) specific and lower grade glioma (lgg) specific patterns (0001000000000 and 0000001000000) are listed in Table SI. It is interesting to observe that Oncostatin M gene (OSM) is associated with many enriched GO terms. Up-regulation of OSM in glioblastoma cells regulates VEGF promoter activity, potentially promoting angiogenesis [52]. OSM is also known to involve in the tyrosine phosphorylation of STAT proteins. In gbm, STAT-3 was shown to be constitutively active, as assessed by tyrosine phosphorylation status [52]. Lower grade glioma is a type of cancer that develops in the glial cells of the brain. Glial cells support the brain’s nerve cells and keep them healthy. The enriched GO terms, such as GABA-A receptor activity (GO:0004890, p=3.57E-09), in this pattern are consistent with the fact that neurotransmission in lgg patient is highly impaired. The first evidence for the presence of functional GABA receptors on grade II and III glioma cells was proved by [53]. Binding sites for GABA and benzodiazepines were identified on membranes of glioma cells [54], [55]. While in one study the presence of GABA binding sites was reported for all types of gliomas [54], [55] reports GABA and benzodiazepine binding sites appear only in low-grade gliomas. These GO term enrichment results indicate that the SSMC-NMF method proposed in this paper is capable of min-

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

12

100 98 96

Test Accuracy (%)

94 92 90 88 86 84 82 80 78

SS-MC-NMF SS-MC-NMF,Fitted Random Forest Random Forest, Fitted

76 74 72 70 0

50

100

150

200

250

300

Number of Selected Features

(a) Feature activity matrix.

(b) Training data.

(c) Accuracy versus number of selected features.

Fig. 5: SS-MC-NMF on pan-cancer RNA-seq data. According to the feature patterns in ( a), the features in (b) were reordered. In (c), only features in class-specific patterns were considered for training MLP, while all features were considered by randomforest based feature selection and classification. Although SS-MC-NMF followed by MLP could not exceed the ceiling made by random forest in this case, SS-MC-NMF discovered potentially biologically meaningful feature patterns as unveiled by gene ontology enrichment analysis in Section V-C, while random forest merely returned a list of ranked features based on their computational performance. ing informative multi-class feature patterns that are highly relevant with biological functions. We selected the above feature patterns as examples in common types of cancers to demonstrate the goodness of SS-MC-NMF method. Other patterns including those consolidate many cancer types, such as 0111111111011, can be studied in the same way as described above. Nevertheless, in depth study of the involvement of many individual patterns in various biological processes is beyond the scope of this paper and will be reported separately. All gene patterns discovered by SS-MC-NMF is provided in Supplemental File 2. VI. D ISCUSSION AND C ONCLUSION Characterization and understanding of every feature in large-scale multi-class data is a challenging puzzle in the era of big data. In the paper, we propose a multi-class non-negative matrix factorization model for feature pattern discovery from multi-class data. To overcome the model selection dilemma and robustly discover feature patterns, we propose to apply the stability selection strategy to MC-NMF, namely SS-MC-NMF. The results on simulated data show that MC-NMF converges quickly, and SS-MC-NMF is able to recover all predefined feature patterns in the data. We also applied both models to the meta-analysis of RNA-seq tumor data. Our results

corroborate that ubiquitous and tumor-specific genes can be identified from the large-scale cancer data. The discriminative feature patterns also shows promising potential for building predictive models. The Python implementation of our models, simulation method, data and experimental script are accessible at https://github.com/yifeng-li/mvmf. Our models will promote a new generation of feature selection approaches in the big data era. Unlike traditional feature selection methods that return no more information than a subset or list of sorted features, our models are able to characterize the activity of individual features into distinct patterns. Visualized in heat maps, these patterns help us to much better understand a massive multi-class dataset, and provide clear clues for deeper insights. Therefore, as a new concept, feature pattern discovery is more useful than feature selection in modern data analytics. Our models can be potentially applied to discover patterns from RNA-seq data of complex diseases, such as diabetes, multiple sclerosis and autism spectrum disorder. However, as discussed in [2], weak signal and limited sample size are two challenges in the current genomic studies on complex diseases. We tested MC-NMF on a diabetes RNA-seq dataset which was sampled from 77 deceased donors (51 with normal glucose tolerance, 15 with impaired glucose tolerance and 11 with

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

type-2 diabetes) [56], the largest diabetes RNA-seq dataset we found so far. We ran MC-NMF 10 times on 90% of this dataset using hyperparameters a0 = 0.1 and alarge = 0.5 (we did try other parameter settings as well), and obtained an averaged sensitivity of 0.79 with a large standard deviation (0.16) due to the insufficient number of type-2 diabetes samples. The corresponding factorization result for one of the runs is displayed as a heat map in Figure S1 of Supplemental File 1. Even though some patterns can be identified, the current limited availability in data is insufficient in capturing the complexity of diabetes. Given a sufficiently large dataset similar to the size of the cancer data used in this study, we are confident in successfully discovering meaningful features using the SS-MC-NMF to support medical investigations. In addition to the feature pattern discovery addressed above, MC-NMF is potentially applicable to other data analytics. Similar to other matrix factorizations, MC-NMF can be applied for supervised feature extraction. In the training step, a training dataset can be decomposed by MC-NMF to learn basis matrix and coefficient matrix. In the test (or prediction) step, the coefficient matrix of a new dataset can be obtained by solving the structured Bayesian regression problem, which can be formulated by fixing the basis matrix W in our MCNMF model. The coefficient matrices obtained in the training and test steps represent dimensionality-reduced data, that is the representations of the samples in the new feature space spanned by the basis vectors. MC-NMF can even be directly employed as a classifier. A preliminary idea is to measure the similarity of the class-specific coefficients of a test sample with these of the training samples. Another mature idea is to compute the class-specific regression residuals, as did in sparse representation classification [57]. We modelled the non-negative count data using Poisson distribution which is suitable for the read-count per gene (or genomic regions) data captured by high-throughput sequencing techniques. Negative binomial distribution [58] has been used to consider variations from biological replicates in differential expression analysis [59]. Thus, it might replace the Possion distribution in our MC-NMF model for modelling dispersion in similar situations. Since many other data and the logtransformed count data follow Gaussian (or half-normal) distribution, it is necessary to implement a multi-class model for such data. Our models can even be generalized by using more general distributions, such as the Tweedie family whose special cases include Poisson, Gaussian and Gamma distributions. This paper addresses the challenge of discovering distinct feature patterns from multi-class data (the type-1 multi-view data). Non-negative multi-view matrix factorization models integrating heterogeneous data of the same set of samples (the type-2 multi-view data), such as RNA-seq gene expression data, bisulfite-seq DNA-methylation data, and whole exome sequencing data of the same set of patients, will be investigated for the robust detection of inter-view interactions. From a machine learning perspective, the aim is to identify shared factors and view(s)-specific factors involved in potential processes, which can be used for joint feature selection, feature extraction, and clustering. We believe our current and future multi-view models will play an important role in the integra-

13

tive analysis of high-throughput sequencing data in genomic research and heterogeneous data in other fields. ACKNOWLEDGMENT This work is supported by National Research Council Canada. R EFERENCES [1] D. Johnson, A. Mortazavi, R. Myers, and B. Wold, “Genome-wide mapping of in vivo protein-DNA interactions,” Science, vol. 316, no. 5830, pp. 447–455, 2007. [2] Y. Li, F. Wu, and A. Ngom, “A review on machine learning principles for multi-view biological data integration,” Briefings in Bioinformatics, vol. 19, no. 2, pp. 325–340, 2018. [3] G. Golub and C. Van Loan, Matrix Computations, 3rd ed. ML: Johns Hopkins University Press, 1996. [4] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” Journal of Computational and Graphical Statistics, vol. 15, no. 2, pp. 265–286, 2006. [5] Y. Li and A. Ngom, “Versatile sparse matrix factorization: Theory and applications,” Neurocomputing, vol. 145, pp. 23–29, 2014. [6] ——, “Sparse representation approaches for the classification of highdimensional biological data,” BMC Systems Biology, vol. 7, no. Suppl 4, p. S6, 2013. [7] C. Bishop, “Bayesian PCA,” in Advances in Neural Information Processing Systems (NIPS), 1999, pp. 382–388. [8] V. Tan and C. Fevotte, “Automatic relevance determination in nonnegative matrix factorization with the β-divergence,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp. 1592– 1605, 2013. [9] A. Cemgil, “Bayesian inference for nonnegative matrix factorization models,” Computational Intelligence and Neuroscience, vol. 2009, p. Article ID 785152, 2009. [10] J. Luttinen and A. Ilin, “Transformations in variational Bayesian factor analysis to speed up learning,” Neurocomputing, vol. 73, pp. 1093–1102, 2010. [11] J. Trygg, “O2-PLS for qualitative and quantitative analysis in multivariate calibration,” Journal of Chemometrics, vol. 16, pp. 283–293, 2002. [12] T. Lofstedt and J. Trygg, “OnPLS - A novel multiblock method for the modelling of predictive and orthogonal variation,” Journal of Chemometrics, vol. 25, pp. 441–455, 2011. [13] L. Tucker, “An inter-battery method of factor analysis,” Psychometrika, vol. 23, no. 2, pp. 111–136, 1958. [14] M. Browne, “Factor analysis of multiple batteries by maximum likelihood,” British Journal of Mathematical and Statistical Psychology, vol. 33, no. 2, pp. 184–199, 1980. [15] A. Klami, S. Virtanen, E. Leppaaho, and S. Kaski, “Group factor analysis,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 9, pp. 2136–2147, 2015. [16] G. Zhou, A. Cichocki, Y. Zhang, and D. Mandic, “Group component analysis for multiblock data: Common and individual feature extraction,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 11, pp. 2425–2439, 2016. [17] Z. Yang and G. Michailidis, “A non-negative matrix factorization methods for detecting modules in heterogeneous omics multi-modal data,” Bioinformatics, vol. 32, no. 1, pp. 1–8, 2016. [18] I. Ivek, “Supervised dictionary learning by a variational Bayesian group sparse nonnegative matrix factorization,” arXiv, p. arXiv:1405.6914, 2014. [19] M. Zitnik and B. Zupan, “Data fusion by matrix factorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 1, pp. 41–53, 2015. [20] Y. Li, “Advances in multi-view matrix factorizations,” in International Joint Conference on Neural Networks (IJCNN), July 2016, pp. 3793– 3800. [21] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of Machine Learning Research, vol. 3, pp. 1157– 1182, 2003. [22] E. Amaldi and V. Kann, “On the approximation of minimizing nonzero variables or unsatisfied relations in linear systems,” Theoretical Computer Science, vol. 209, pp. 237–260, 1998. [23] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 58, no. 1, pp. 267–288, 1996.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, MONTH 2018

[24] S. Student and K. Fujarewicz, “Stable feature selection and classification algorithms for multiclass microarray data,” Biology Direct, vol. 27, p. 33, 2012. [25] Y. Li, C. Chen, and W. Wasserman, “Deep feature selection: Theory and application to identify enhancers and promoters,” Journal of Computational Biology, vol. 23, no. 5, pp. 322–336, 2016. [26] L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32, 2001. [27] X. Zheng and W.-Y. Loh, “Consistent variable selection in linear models,” Journal of the American Statistical Association, vol. 90, no. 429, pp. 151–156, 1995. [28] U. Meinshausen and P. Buhlmann, “Stability selection,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 4, pp. 417–473, 2010. [29] D. Marbach, J. Costello, R. Kuffner, N. Vega, R. Prill, D. Camacho et al., “Wisdom of crowds for robust gene network inference,” Nature Methods, vol. 9, no. 8, pp. 796–804, 2012. [30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [31] K. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge, MA: MIT Press, 2012. [32] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, “An introduction to variational methods for graphical models,” in Learning in Graphical Models, ser. Adaptive Computation and Machine Learning series, M. Jordan, Ed. Cambridge, MA: MIT, 1998, ch. 5, pp. 105–162. [33] C. M. Bishop, Pattern Recognition and Machine Learning. Academic Press, 2009. [34] T. Cover and J. Thomas, Elements of Information Theory, 2nd ed. Hoboken, New Jersey: Wiley, 2006. [35] Z. He and W. Yu, “Stable feature selection for biomarker discovery,” Computational Biology and Chemistry, vol. 34, no. 4, pp. 215–225, 2010. [36] Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: A revolutionary tool for transcriptomics,” Nature Reviews Genetics, vol. 10, pp. 57–63, 2009. [37] T. Mitchell, Machine Learning. Ohio: McGraw Hill, 1997. [38] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015. [39] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. [Online]. Available: http://www.deeplearningbook.org [40] J. Friedman, R. Tibshirani, and T. Hastie, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, 2009. [41] A. Tchagang, A. Gawronski, H. Berube, S. Phan, F. Famili, and Y. Pan, “GOAL: A software tool for assessing biological significance of genes groups,” BMC Bioinformatics, vol. 11, p. 229, 2010. [42] D. Chakroborty, C. Sarkar, B. Basu, P. Dasgupta, and S. Basu, “Catecholamines regulate tumor angiogenesis,” Cancer Research, vol. 69, pp. 3727–3730, 2009. [43] E. Yang, “Role for catecholamines in tumor progression - Possible use for β-blockers in the treatment of cancer,” Cancer Biology & Therapy, vol. 10, pp. 30–32, 2010. [44] B. Elenbaas, L. Spirio, F. Koerner, M. Fleming, D. Zimonjic, J. Donaher et al., “Human breast cancer cells generated by oncogenic transformation of primary mammary epithelial cells,” Genes & Development, vol. 15, pp. 50–65, 2001. [45] A. Sadlonova, Z. Novak, M. Johnson, D. Bowe, S. Gault, G. Page, J. Thottassery, D. Welch, and A. Frost, “Breast fibroblasts modulate epithelial cell proliferation in three-dimensional in vitro co-culture,” Breast Cancer Research, vol. 7, p. R46, 2004. [46] S. Huh, H. Oh, M. Peterson, V. Almendro, R. Hu, M. Bowden et al., “The proliferative activity of mammary epithelial cells in normal tissue predicts breast cancer risk in premenopausal women,” Cancer Research, vol. 76, pp. 1926–1934, 2016. [47] R. Santen, R. Song, R. McPherson, R. Kumar, L. Adam, M. Jeng et al., “The role of mitogen-activated protein (MAP) kinase in breast cancer,” The Journal of Steroid Biochemistry and Molecular Biology, vol. 88, pp. 239–256, 2002. [48] A. Dhillon, S. Hagan, O. Rath, and W. Kolch, “MAP kinase signalling pathways in cancer,” Oncogene, vol. 26, pp. 3279–3290, 2007. [49] I. Kretschmer, T. Freudenberger, S. Twarock, Y. Yamaguchi, M. Grandoch, and J. Fischer, “Esophageal squamous cell carcinoma cells modulate chemokine expression and hyaluronan synthesis in fibroblasts,” Journal of Biological Chemistry, vol. 291, pp. 4091–4106, 2016. [50] B. Qian and J. Pollard, “Macrophage diversity enhances tumor progression and metastasis,” Cell, vol. 141, pp. 39–51, 2010.

14

[51] J. Quatromoni and E. Eruslanov, “Tumor-associated macrophages: Function, phenotype, and link to prognosis in human lung cancer,” American Journal of Translational Research, vol. 4, pp. 376–389, 2012. [52] E. Brantley, L. Nabors, G. Gillespie, Y.-H. Choi, C. Palmer, K. Harrison et al., “Loss of PIAS3 expression in glioblastoma multiforme tumors: Implications for STAT-3 activation and gene expression,” Clinical Cancer Research, vol. 14, pp. 4694–4704, 2008. [53] C. Labrakakis, S. Patt, J. Hartmann, and H. Kettenmann, “Functional GABA(A) receptors on human glioma cells,” European Journal of Neuroscience, vol. 10, no. 1, pp. 231–238, 1998. [54] L. Frattola, C. Ferrarese, N. Canal, S. Gaini, R. Galluso, R. Piolti et al., “Characterization of the gamma-aminobutyric acid receptor system in human brain gliomas,” Cancer Research, vol. 45, pp. 4495–4498, 1985. [55] A. Jussofie, V. Reinhardt, and R. Kalff, “GABA binding sites: Their density, their affinity to muscimol and their behaviour against neuroactive steroids in human gliomas of different degrees of malignancy,” Journal of neural transmission General Section, vol. 96, pp. 233–341, 1994. [56] J. Fadista, P. Vikman, E. Laakso, I. Mollet, J. Esguerra, J. Taneera et al., “Global genomic and transcriptomic analysis of human pancreatic islets reveals novel genes influencing glucose metabolism,” PNAS, vol. 111, no. 38, pp. 13 924–13 929, 2014. [57] J. Wright, A. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2009. [58] A. Cameron, Regression Analysis of Count Data. Cambridge University Press, 1998. [59] C. Soneson and M. Delorenzi, “A comparison of methods for differential expression analysis of RNA-seq data,” BMC Bioinformatics, vol. 14, p. 91, 2013.

Yifeng Li Yifeng Li is a Research Officer in the Digital Technologies Research Centre, National Research Council of Canada (NRC). Prior to his joining to NRC, he was a post-doctorate at the Wasserman Laboratory of the Centre for Molecular Medicine and Therapeutics, University of British Columbia, Canada. He obtained his Ph.D. in Computer Science, from the University of Windsor, Canada, in 2013. His doctoral dissertation was recognized by a Gold Medal from the Governor General of Canada. His research interests include neural networks, machine learning, data integration, optimization and bioinformatics.

Youlian Pan Youlian Pan holds his Ph.D. in Biology and his Master of Computer Sciences, both from Dalhousie University, Canada. He is currently a Senior Research Officer in Data Mining at the NRC. His research interest is in data mining and machine learning with biological application, specifically in high throughput sequencing data, gene expression profiling and systems biology with an objective of discovering and developing biomarkers related with human diseases and crops’ stress tolerance in adverse environments.

Ziying Liu Ziying Liu received a Masters degree in Medical Sciences from Katholieke Universiteit Leuven, Belgium, in 1994, and a Masters degree of System Science from the University of Ottawa in 2006. She is now working as a Research Council Officer in the Scientific Data Mining Team at the Digital Technologies Research Centre, NRC. She is interested in data mining, machine learning, and knowledge discovery from RNA-seq data in both human and plant applications.

Suggest Documents