Visualization, Clustering and Classification of ... - CiteSeerX

Visualization, Clustering and Classification of Multidimensional Astronomical Data Antonino Staiano∗ , Angelo Ciaramella∗ , Lara De Vinco‡ , Ciro Donalek† , Giuseppe Longo† , Giancarlo Raiconi∗ , Roberto Tagliaferri∗ , Roberto Amato† , Carmine Del Mondo† , Giuseppe Mangano† , Gennaro Miele† ∗ Dipartimento

di Matematica ed Informatica, Universit`a di Salerno, Fisciano (Sa), Italy Email: {astaiano, rtagliaferri, ciaram, gianni}@unisa.it † Dipartimento di Scienze Fisiche, Universit`a Federico II di Napoli, Italy Email: {longo, donalek}@na.infn.it ‡ INFOTEL S.r.l., Via Strauss, 45 - 84091 Battipaglia (Sa), Italy Email: [email protected]

Abstract— Due to the recent technological advances, Data Mining in massive data sets has evolved as a crucial research field for many if not all areas of research: from astronomy to high energy physics, to genetics etc. In this paper we discuss an implementation of the Probabilistic Principal Surfaces (PPS) which was developed within the framework of the AstroNeural collaboration. PPS are a nonlinear latent variable model which may be regarded as a complete mathematical framework to accomplish some fundamental data mining activities such as: visualization, clustering and classification of high dimensional data. The effectiveness of the proposed model is exemplified referring to a complex astronomical data set.

•

•

Clustering: it is perhaps the most important and widely used method of unsupervised learning. It may be synthetised problem of identifying groupings of similar points that are relatively ’isolated’ from each other, or in other words to partition the data into dissimilar groups of similar items. Classification: it concerns with assigning a given pattern to one of a number of possible classes which depends on the problem at hand. Such classes may be the results of a labeling accomplished on groupings resulting from a clustering procedure.

I. I NTRODUCTION The explosive growth in the quantity, quality and accessibility of data which is currently experienced in all fields of science and human endeavor, has triggered the search for a new generation of computational theories and tools capable to assist humans in extracting useful information (knowledge) from the available and planned massive data sets. This revolution has two main aspects: on the one hand in astronomy (as well as in high energy physics, genetics, social sciences, and in many other fields) traditional interactive data analysis and data visualization methods have proved to be far inadequate to cope with data sets which are characterized by huge volumes and/or complexity (ten or hundreds of parameter or features per record, cf. [1] and references therein). In second place, the simultaneous analysis of hundreds of parameters unveils previously unknown patterns which may lead to a deeper understanding of the underlaying phenomena and trends. The field of Data Mining is therefore becoming of paramount importance not only in its traditional arena but also as an auxiliary tool for almost all fields of research. In this paper we discuss how three common tasks in data analysis (data visualization, clustering and data classification) may be performed using Spherical Probabilistic Principal Surfaces (PPS) as a common framework. •

Visualization: it is a crucial step in the process of data analysis, enabling an understanding of the relations that exists within the data by displaying them in such a way that the discovered patterns are emphasized.

PPS [6], [7] (discussed in Section II) are a nonlinear extension of principal components, in that each node on the PPS is the average of all data points that projects near/onto it. From a theoretical standpoint, the PPS is a generalization of the Generative Topographic Mapping (GTM) [2], [3], which can be seen as a parametric alternative to Self Organizing Maps (SOM) [10]. Some advantages of PPS includes its parametric and flexible formulation for any geometry/topology in any dimension, guaranteed convergence (indeed the PPS training is accomplished through the Expectation-Maximization algorithm). A PPS is governed by its latent topology and, owing to their flexibility, a variety of PPS topology can be created one of which is the 3D sphere. The sphere is finite and unbounded, with all nodes distributed at the edge, it therefore is ideal for emulating the sparseness and peripheral property of high-D data. Furthermore, the sphere topology can be easily comprehended by humans and thereby be og great help in visualizing high-D data (Section III-A). Since PPS generates a probability density function of the input data, in the form of a mixture of Gaussians, it can be used both for clustering (Section III-B) and classification (Section III-C) purposes. To illustrate the power and the effectivness of the model, we shall discuss a case study in the field of astronomy using a real and complex data set (Section IV). All results discussed here were obtained in the framework of the AstroNeural collaboration: a joint project between the Department of Mathematics and Informatics of the University of Salerno and the Department of Physical Sciences of the University Federico II in Napoli.

The main goal of the collaboration is to implement a user friendly data mining tool capable to deal with heterogeneous, high dimensionality data sets.

(a) Manifold in 3 latent space R

(b) Manifold in D feature space R

(c) t projected onto manifold in latent space R

x

t y(x)

E[x|t]

3

II. PPS: THEORETICAL DESCRIPTION PPS defines a non-linear, parametric mapping y(x; W) from a Q-dimensional latent space (x ∈ RQ ) to a Ddimensional data space (t ∈ RD ), where normally Q < D. The mapping y(x; W) (defined continuous and differentiable) maps every point in the latent space to a point into the data space. Since the latent space is Q-dimensional, these points will be confined to a Q-dimensional manifold non-linearly embedded into the D-dimensional data space. PPS builds a constrained mixture of Gaussians (where the priors are all 1 fixed to M ) p(t|W, Σm ) =

M 1 X p(t|xm , W, Σm ), M m=1

(1)

1

2π

D 2

e{− 2 (y(xm ;W)−t)Σm 1

−1

(y(xm ;W)−t)T }

,

(2)

where t is a point in the data space and Σ−1 m denotes the noise variance. The covariance is defined as Σm =

III. A PPLICATION OF PPS TO DATA M INING A. Visualization After a PPS model is fitted to the data, several visualization possibilities are available for analyzing the data points. ( 1) Data point projections onto the latent sphere: The data are projected into the latent space as points onto a sphere (Figure 1). The latent manifold coordinates x ˆn of each data point tn are computed as Z M X x ˆn ≡ hx|tn i = xp(x|t)dx = rmn xm

and each component has the form: |Σm |− 2

Fig. 1. (a) The spherical manifold in R3 latent space.( (b) The spherical manifold in R3 data space. (c) Projection of data points t onto the latent spherical manifold.

Q D αX (D − αQ) X eq (x)eTq (x) + ed (x)eTd (x), β q=1 β(D − Q) d=Q+1 (3) D 0

Visualization, Clustering and Classification of ... - CiteSeerX

Visualization, Clustering and Classification of ... - CiteSeerX

Suggest Documents

Visualization and Clustering of Author Social Networks - CiteSeerX

Hierarchical Clustering and Visualization of ... - Semantic Scholar

Clustering and Visualization of Mobile Application ...

Clustering and Visualization of Online Chat

Visualization for Classification Problems, with Examples ... - CiteSeerX

Gene expression data clustering and visualization based ... - CiteSeerX

Node Similarity-based Graph Clustering and Visualization - CiteSeerX

Significance of clustering and classification ...

Detection, Classification and Visualization of ...

Clustering and Classification - International Journal of Computer ...

Clustering and Classification of Software ... - Semantic Scholar

Comparison of classification and clustering ... - Semantic Scholar

Classification and Visualization of Multiclass fMRI

Applicability of Clustering and Classification Algorithms ... - CiteSeer

Unsupervised clustering and epigenetic classification of ... - Nature

Quantitative Classification and Natural Clustering of Caenorhabditis

Clustering and classification of email contents - Core

Combination of Clustering, Classification and ... - Semantic Scholar

Clustering and Classification of Cases Using - PolyU

uncertainty classification and visualization of molecular interfaces

Document Clustering, Visualization, and Retrieval ... - Semantic Scholar

a benchmark dataset for audio classification and clustering - CiteSeerX

A clustering-based visualization of colocation

Data mining, Classification and Clustering with ...