Probabilistic Principal Surfaces for Yeast Gene Microarray Data Mining Antonino Staiano, Lara De Vinco, Angelo Ciaramella, Giancarlo Raiconi, Roberto Tagliaferri Dipartimento di Matematica ed Informatica Universit`a di Salerno Via Ponte don Melillo, 84084 Fisciano (Sa), Italy {astaiano, ciaram, gianni, robtag}@unisa.it Roberto Amato, Giuseppe Longo, Ciro Donalek, Gennaro Miele Dipartimento di Scienze Fisiche Universit`a Federico II di Napoli and INFN Napoli Unit Polo delle Scienze e della Tecnologia via Cintia 6, 80136 Napoli, Italy {longo, donalek, miele}@na.infn.it Diego Di Bernardo Telethon Institute for Genetics and Medicine Via Pietro Castellino 111 I-80131 Napoli, Italy
[email protected]
Abstract The recent technological advances are producing huge data sets in almost all fields of scientific research, from astronomy to genetics. Although each research field often requires ad-hoc, fine tuned, procedures to properly exploit all the available information inherently present in the data, there is an urgent need for a new generation of general computational theories and tools capable to boost most human activities of data analysis. Here we propose Probabilistic Principal Surfaces (PPS) as an effective high-D data visualization and clustering tool for data mining applications, emphasizing its flexibility and generality of use in data-rich field. In order to better illustrate the potentialities of the method, we also provide a real world case-study by discussing the use of PPS for the analysis of yeast gene expression levels from microarray chips.
1
Introduction
Across a wide variety of fields, data is being collected and accumulated at a dramatic pace. There is an urgent need for a new generation of computational theories and tools to assist humans in extracting useful information (knowledge) from rapidly growing volumes of data. These theo-
ries and tools are usually labeled as Knowledge Discovery in Databases (KDD). At an abstract level, the KDD field is concerned with the development of methods and techniques aimed at extracting meaning out of data. The full and effective scientific exploitation of these massive data sets calls for the implementation of automatic tools capable to perform a large fraction of data mining and data analysis work, and is posing considerable technical problems and even more challenging methodological ones. Traditional data analysis methods, in fact, are inadequate to cope with such exponential growth in the data volume and especially in the data complexity (ten or hundreds of dimensions of the parameter space) [10]. For example, in genetics, gene-expression microarrays, commonly called ”genechips”, make it possible to simultaneously measure the expression of each of the thousands of genes that are present (transcribed) in a cell or tissue. One can use these comprehensive snapshots of biological activity to infer regulatory pathways in cells, identify novel targets for drug design, and improve diagnosis, prognosis, and treatment planning for those suffering from disease. However, the amount of data this new technology produces is well beyond what can be manually analyzed. Hence, the need for automated analysis of microarray data offers an opportunity for machine learning to have a significant impact on biology and medicine (see [12] [13]). Among the data mining methodologies, vi-
sualization plays a key role in developing good models for data especially when the quantity of data is large. In this context PPS represent a very powerful tool for characterizing and visualizing high-D data. PPS [3, 4] are a nonlinear extension of principal components, in that each node on the PPS is the average of all data points that projects near/onto it. From a theoretical point of view, the PPS is a generalization of the Generative Topographic Mapping (GTM) [2], which on the other hand, can be seen as a parametric alternative to the well known Self Organizing Maps (SOM) [7]. Some advantages of PPS over less sophisticated methods includes the parametric and flexible formulation for any geometry/topology in any dimension, and the guaranteed convergence (indeed the PPS training is accomplished through the Expectation-Maximization algorithm). A PPS is governed by its latent topology and owing to the flexibility of the PPS, a variety of PPS topologies can be created, among which also that of a 3D sphere. The sphere is finite and unbounded, with all nodes distributed at the edge, making it ideal for emulating the sparseness and peripheral property of high-D data. Furthermore, the sphere topology can be easily comprehended by humans and thereby used for visualizing high-D data. We shall go in details over all these issues in the next sections, which are organized as follows: in section 2 latent variable models are presented focusing on GTM and PPS, while section 3 shows the visualization possibilities offered by PPS illustrating a number of visualization methods proposed and implemented by us. Finally, in section 4 a real case-study concerning with the analysis of yeast gene expression levels from microarray data is presented, and in section 5 some preliminary conclusions are drawn.
2
Latent Variable Models
A latent variable model builds the distribution p(t) of a variable t = (t1 , . . . , tD ) lying in an input data space, in terms of a smaller number of latent variables x = (x1 , . . . , xQ ), lying in the so called ”latent space”, where Q < D. The mapping between the input and the latent spaces is given by t = y(x; w) + u
(1)
where y(x; w) is a function of the latent variable x with parameters w, and u is an x-independent noise process. Geometrically, the function y(x; w) defines a manifold in data space given by the image of the latent space. By assuming that the components of u are uncorrelated and by defining a distribution p(x) in the latent space, we can define the joint distribution as: p(t, x) = p(x)p(t|x) = p(x)
D d=1
p(td |x),
(2)
where p(t|x) is the conditional distribution of the data variables given the latent variables. The type of the mapping y(x; w) determines the specific latent variable model. The desired model for the distribution p(t) of the data is obtained by marginalizing over the latent variables p(t) = p(t|x)p(x)dx. (3) This integration will, in general, be analytically intractable except for specific forms of the distributions p(t|x) and p(x).
2.1. Generative Topographic Mapping The GTM defines a non-linear, parametric mapping y(x; W) from a Q-dimensional latent space (x ∈ RQ ) to a D-dimensional data space (t ∈ RD ), where normally Q < D. The mapping y(x; W) (defined continuous and differentiable) maps every point in the latent space to a point into the data space. Since the latent space is Q-dimensional, these points will be confined to a Q-dimensional manifold non-linearly embedded into the D-dimensional data space. If we define a probability distribution over the latent space, p(x), this will induce a corresponding probability distribution into the data space which will be zero outside of the embedded manifold. However, this constraint is unrealistic since the points do not lie exactly on a Q-dimensional manifold and, therefore, a noise model for the variable t is added as D2 D β β 2 exp − (td − yd (x, W)) p(t|x, W, β) = 2π 2 d=1 (4) where t is a point in the data space and β −1 denotes the noise variance. By marginalizing over the latent variables, the probability distribution in the data space, expressed as a function of the parameters β and W, is obtained p(t|W, β) = p(t|x, W, β)p(x)dx. (5) By choosing p(x) as a set of M equally weighted delta functions on a regular grid, p(x) =
M 1 δ(x − xm ), M m=1
(6)
we render the (5) analytically tractable since it turns in into a sum, p(t|W, β) =
M 1 p(t|xm , W, β). M m=1
(7)
(a) GTM
y(x;w)
x2 t2
t3
(8)
where the elements of φ(x) consist of L fixed basis functions {φl (x)}L l=1 , and W is a D × L matrix.
2.2. Probabilistic Principal Surfaces The PPS generalizes the GTM model by building a unified model and share the same formulation as the GTM, except for an oriented covariance structure for the Gaussian mixture in RD . This means that data points projecting near a principal surface node (i.e., a Gaussian center of the mixture) have higher influences on that node than points projecting far away from it. This is illustrated in Figure 2. Therefore, each node y(x; w), x ∈ {xm }M m=1 , has covariance Q D α (D − αQ) eq (x)eTq (x)+ ed (x)eTd (x), β q=1 β(D − Q) d=Q+1
(9) D 0 1 ⎧ ⊥ to the manifold ⎨ 0