Document not found! Please try again

Visualization, Clustering and Classification of ... - CiteSeerX

0 downloads 0 Views 715KB Size Report
an implementation of the Probabilistic Principal Surfaces (PPS) which was ..... [8] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum-Likelihood from. Incomplete ...
Visualization, Clustering and Classification of Multidimensional Astronomical Data Antonino Staiano∗ , Angelo Ciaramella∗ , Lara De Vinco‡ , Ciro Donalek† , Giuseppe Longo† , Giancarlo Raiconi∗ , Roberto Tagliaferri∗ , Roberto Amato† , Carmine Del Mondo† , Giuseppe Mangano† , Gennaro Miele† ∗ Dipartimento

di Matematica ed Informatica, Universit`a di Salerno, Fisciano (Sa), Italy Email: {astaiano, rtagliaferri, ciaram, gianni}@unisa.it † Dipartimento di Scienze Fisiche, Universit`a Federico II di Napoli, Italy Email: {longo, donalek}@na.infn.it ‡ INFOTEL S.r.l., Via Strauss, 45 - 84091 Battipaglia (Sa), Italy Email: [email protected]

Abstract— Due to the recent technological advances, Data Mining in massive data sets has evolved as a crucial research field for many if not all areas of research: from astronomy to high energy physics, to genetics etc. In this paper we discuss an implementation of the Probabilistic Principal Surfaces (PPS) which was developed within the framework of the AstroNeural collaboration. PPS are a nonlinear latent variable model which may be regarded as a complete mathematical framework to accomplish some fundamental data mining activities such as: visualization, clustering and classification of high dimensional data. The effectiveness of the proposed model is exemplified referring to a complex astronomical data set.





Clustering: it is perhaps the most important and widely used method of unsupervised learning. It may be synthetised problem of identifying groupings of similar points that are relatively ’isolated’ from each other, or in other words to partition the data into dissimilar groups of similar items. Classification: it concerns with assigning a given pattern to one of a number of possible classes which depends on the problem at hand. Such classes may be the results of a labeling accomplished on groupings resulting from a clustering procedure.

I. I NTRODUCTION The explosive growth in the quantity, quality and accessibility of data which is currently experienced in all fields of science and human endeavor, has triggered the search for a new generation of computational theories and tools capable to assist humans in extracting useful information (knowledge) from the available and planned massive data sets. This revolution has two main aspects: on the one hand in astronomy (as well as in high energy physics, genetics, social sciences, and in many other fields) traditional interactive data analysis and data visualization methods have proved to be far inadequate to cope with data sets which are characterized by huge volumes and/or complexity (ten or hundreds of parameter or features per record, cf. [1] and references therein). In second place, the simultaneous analysis of hundreds of parameters unveils previously unknown patterns which may lead to a deeper understanding of the underlaying phenomena and trends. The field of Data Mining is therefore becoming of paramount importance not only in its traditional arena but also as an auxiliary tool for almost all fields of research. In this paper we discuss how three common tasks in data analysis (data visualization, clustering and data classification) may be performed using Spherical Probabilistic Principal Surfaces (PPS) as a common framework. •

Visualization: it is a crucial step in the process of data analysis, enabling an understanding of the relations that exists within the data by displaying them in such a way that the discovered patterns are emphasized.

PPS [6], [7] (discussed in Section II) are a nonlinear extension of principal components, in that each node on the PPS is the average of all data points that projects near/onto it. From a theoretical standpoint, the PPS is a generalization of the Generative Topographic Mapping (GTM) [2], [3], which can be seen as a parametric alternative to Self Organizing Maps (SOM) [10]. Some advantages of PPS includes its parametric and flexible formulation for any geometry/topology in any dimension, guaranteed convergence (indeed the PPS training is accomplished through the Expectation-Maximization algorithm). A PPS is governed by its latent topology and, owing to their flexibility, a variety of PPS topology can be created one of which is the 3D sphere. The sphere is finite and unbounded, with all nodes distributed at the edge, it therefore is ideal for emulating the sparseness and peripheral property of high-D data. Furthermore, the sphere topology can be easily comprehended by humans and thereby be og great help in visualizing high-D data (Section III-A). Since PPS generates a probability density function of the input data, in the form of a mixture of Gaussians, it can be used both for clustering (Section III-B) and classification (Section III-C) purposes. To illustrate the power and the effectivness of the model, we shall discuss a case study in the field of astronomy using a real and complex data set (Section IV). All results discussed here were obtained in the framework of the AstroNeural collaboration: a joint project between the Department of Mathematics and Informatics of the University of Salerno and the Department of Physical Sciences of the University Federico II in Napoli.

The main goal of the collaboration is to implement a user friendly data mining tool capable to deal with heterogeneous, high dimensionality data sets.

(a) Manifold in 3 latent space R

(b) Manifold in D feature space R

(c) t projected onto manifold in latent space R

x

t y(x)

E[x|t]

3

II. PPS: THEORETICAL DESCRIPTION PPS defines a non-linear, parametric mapping y(x; W) from a Q-dimensional latent space (x ∈ RQ ) to a Ddimensional data space (t ∈ RD ), where normally Q < D. The mapping y(x; W) (defined continuous and differentiable) maps every point in the latent space to a point into the data space. Since the latent space is Q-dimensional, these points will be confined to a Q-dimensional manifold non-linearly embedded into the D-dimensional data space. PPS builds a constrained mixture of Gaussians (where the priors are all 1 fixed to M ) p(t|W, Σm ) =

M 1 X p(t|xm , W, Σm ), M m=1

(1)

1



D 2

e{− 2 (y(xm ;W)−t)Σm 1

−1

(y(xm ;W)−t)T }

,

(2)

where t is a point in the data space and Σ−1 m denotes the noise variance. The covariance is defined as Σm =

III. A PPLICATION OF PPS TO DATA M INING A. Visualization After a PPS model is fitted to the data, several visualization possibilities are available for analyzing the data points. ( 1) Data point projections onto the latent sphere: The data are projected into the latent space as points onto a sphere (Figure 1). The latent manifold coordinates x ˆn of each data point tn are computed as Z M X x ˆn ≡ hx|tn i = xp(x|t)dx = rmn xm



and each component has the form: |Σm |− 2

Fig. 1. (a) The spherical manifold in R3 latent space.( (b) The spherical manifold in R3 data space. (c) Projection of data points t onto the latent spherical manifold.

Q D αX (D − αQ) X eq (x)eTq (x) + ed (x)eTd (x), β q=1 β(D − Q) d=Q+1 (3) D 0