Correlation Visualization of High Dimensional. Data using Topographic Maps. Ignacio Dıaz Blanco, Abel A. Cuadrado Vega, and Alberto B. Diez González.
Correlation Visualization of High Dimensional Data using Topographic Maps Ignacio D´ıaz Blanco, Abel A. Cuadrado Vega, and Alberto B. Diez Gonz´alez ´ Area de Ingenier´ıa de Sistemas y Autom´ atica Universidad de Oviedo Campus de Viesques s/n, 33204, Gij´ on, Spain {idiaz,cuadrado,alberto}@isa.uniovi.es
Abstract. Correlation analysis has always been a key technique for understanding data. However, traditional methods are only applicable on the whole data set, providing only global information on correlations. Correlations usually have a local nature and two variables can be directly and inversely correlated at different points in the same data set. This situation arises typically in nonlinear processes. In this paper we propose a method to visualize the distribution of local correlations along the whole data set using dimension reduction mappings. The ideas are illustrated through an artificial data example.
1
Introduction
Visualization and dimension reduction techniques have received considerable attention in recent years for the analysis of large sets of multidimensional data [1–3] and particularly for supervision and condition monitoring of complex industrial processes [4–6]. These techniques allow to discover unknown features and relationships of high dimensional data in a visual manner by means of a mapping from a data space D (also input space) onto a low dimensional visualization space V where complex relationships among input variables can be easily represented and visualized while preserving information significant to a given problem. Another very useful technique when dealing with high dimensional data is correlation analysis. Correlation analysis is concerned with finding how components x1 , · · · , xp of the sample data vectors {xi }i=1,···,n are mutually related. The standard way to cope with this problem is through the analysis of second order statistics such as the correlation matrix R whose coefficients r ij ∈ [−1, 1] provide a description of how variables xi and xj are related. These coefficients are the result of a normalized inner product –the cosine– between vectors formed by the values of xi and xj for the whole data set and, in consequence, they provide a correlation information of a global nature. However, in many cases data variables can be correlated in different ways for different regions of the data space. This is the case, for instance, of multimodal or nonlinear processes, which behave locally in different ways depending on the working point. Thus, we need a local description of correlation.
In this paper, we suggest a method to combine correlation analysis with the power of dimension reduction visualization methods, such as the Self-Organizing Map (SOM) [7] or the Generative Topographic Map (GTM) [8], allowing to visualize local correlations for each pair of variables xi , xj through the so called correlation maps defined in the visualization space. The paper is organized as follows. In section 2 the ideas of local covariance and local correlation are introduced, and a method to display the information provided by local second order statistics in the visualization space is proposed. In section 3 the proposed ideas are illustrated through a simple example. Finally, in section 4 some concluding remarks and future research lines are outlined.
2 2.1
Correlation Maps Local Covariance Matrix
Let ψ(y) : R2 → Rn a continuous mapping which takes a point y of the visualization space V ⊂ R2 and obtains a point ψ(y) pertaining to the manifold which approximates the distribution of the input data points xi in the data space D ⊂ Rn . 2 2 1 Let’s define the following neighborhood function wi (y) = e− 2 kxi −ψ(y)k /σ , which describes the degree of locality or proximity of sample xi with respect to ψ(y) in the data space D. We define the local mean vector m(y) and the local covariance matrix C(y) associated to a point y in the visualization space V as P i xi · wi (y) m(y) = P i wi (y) P [xi − m(y)][xi − m(y)]T · wi (y) P C(y) = (cij ) = i i wi (y)
(1) (2)
Taken independently, the n×n components cij (y) of the covariance matrix C(y), can be regarded as local covariance values which describe the local dependency between variables xi and xj . Expressions (1) and (2) represent local versions of the sample first and second order moments of the input data distribution around the image of point y in the visualization space, i.e., ψ(y), where the width factor σ is a design parameter related the degree of locality to be taken into account, allowing to establish a tradeoff between global and local correlations. The local covariance C(y) described in (2) defines in V a field of covariance matrices from D each of which provides a local description of second order statistical features of data in D lying in the vicinity of ψ(y). 2.2
Local Correlation Matrix
The previously defined covariance matrix provides insight in the approach of local description of second order statistics. However, in looking for correlations, correlation coefficients are preferred as they provide a normalized description of
correlations in the interval [−1, +1]. The local correlation matrix around y can be defined as R(y) = (rij )
where,
rij = √
cij cii cjj
(3)
The local correlation matrix R(y) has n × n components rij (y) which represent the local correlation coefficient between variable xi and variable xj and lie always in the interval [−1, +1], where +1 denotes full direct correlation, 0 denotes incorrelation, and −1 denotes full inverse correlation. 2.3
Visualization of Second Order Statistical Features
Both the covariance matrix C(y) and correlation matrix R(y) are defined for each point y of V . In addition to this, all powerful geometrical and statistical interpretations underlying both matrices can be represented in V using scalar quantities. Thus, for instance, each component cij (y) or rij (y) defines a scalar quantity susceptible to be represented in the same way as SOM planes, using a color code for each pixel y. In the same way, the principal values of the covariance matrix λi (y) or the components of the principal vectors ui (y) can be represented as SOM planes. This representation provides a unified visualization of the underlying correlations and second order statistical properties in general. Moreover, it is coherent with other SOM representations such as SOM planes or the u-matrix providing insight in the pattern of correlation dependencies among variables or revealing the most important features describing the behavior of the underlying process for each data region.
3
Application to Artificial Data
All these ideas are illustrated in figures 1, 2 and 3. A simple 2D data set was used to train both a 1D-SOM and a 2D-SOM. Local covariances were obtained for the 2D-SOM using (1) and (2) and then plotted in both the data space D and the visualization space V . Local correlations were also obtained using (3) to build the correlation maps of rxx , rxy , ryx , ryy shown in figure 2. A set of points with negative local correlations (corresponding to the right part of the “arc” in the data) can be discovered by looking at the upper left corner in the rxy plane. Similarly, moderately high correlations appear in the upper right corner of the map, showing up the positive local correlations existing in the left part of the “arc” in the data space. It can also be observed that the graphical information provided by correlation maps in figure 2 is consistent with that shown in the SOM planes in figure 3, because both are descriptions in the same visualization space V . Finally, as we should expect, planes rxx and ryy are equal to 1, and rxy = ryx due to the symmetry properties of correlation matrices.
1D−SOM in Data Space
2D−SOM in Data Space
5
5
0
0
−5
−5
−10
−10 −5
0
5
−5
Visualization Space
0
5
10
Visualization Space
15
14
10
12
5
10 8
0 6 −5
4
−10
2 0
−15 0
10
20
30
0
5
10
15
Fig. 1. Local covariances in D (top) and in V (bottom) obtained for both a 1D-SOM (left) and a 2D-SOM (right). In the thick areas (low correlations), the covariances are nearly spherical, while in thin areas (high correlations) the covariances become low rank, and oriented, showing up in V the nature of local correlations.
4
Concluding Remarks
We have proposed here a method for the visualization of local second order statistical properties using dimension reduction mappings like –but not restricted to– the SOM. The proposed idea has strong connections with local model approaches, such as [9], where local linear PCA projections are proposed to capture the nonlinear structure of data. We showed here through an artificial data example how local second order statistical properties can be revealed by means of correlation maps, which, in
xx
xy 1
14 12
1
14 12
0.5
10 8
0.5
10 8
0 6
0 6
4
−0.5
2
4
−0.5
2
0 0
5
10
−1
0
1
14
0
5
yx
−1
10
yy
14 12
1
12 0.5
10 8
0.5
10 8
0 6
0 6
4
−0.5
2
4
−0.5
2
0 0
5
10
−1
0 0
5
10
15
−1
Fig. 2. Correlation Maps for the 2D-SOM show a region in V (up-left) related to highly negative local correlations and another region (up-right) revealing positive local correlations.
addition, are consistent with other standard representations in the visualization space such as the component planes or the distance matrix. This provides an alternative way for high dimensional data visualization to the standard methods based on SOM –u-matrix, SOM planes, response surfaces or SOM planes rearrangement [10], as well as SOM clustering methods [11]– which combines the classical correlation analysis techniques (correlation matrix) with the power of SOM data visualization. As a matter of further study, the idea of local second order moments is not restricted to correlation analysis or even to second order moments. Eigenvalues λi (y) or the components of eigenvectors ui (y) of the local covariance matrix can lead to meaningful maps, which can be derived in a straightforward manner from the ideas described here. In a similar way, higher order statistics (cumulants) can be obtained in a local fashion opening new exciting research lines in data visualization.
x
y
Interneuron Distance Matrix
14
14
14
12
12
12
10
10
10
8
8
8
6
6
6
4
4
4
2
2
2
0
0 0
5
10
0 0
5
10
0
5
10
Fig. 3. SOM planes of variables x and y and distance matrix.
The ideas proposed in this paper are currently being tested in the steel industry to investigate the effects of several dozens of process variables in several quality factors of the processed coils in a tandem mill with encouraging results.
References 1. Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, Dec, 22 2000. 2. Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, Dec., 22 2000. 3. Jianchang Mao and Anil K. Jain. Artificial neural networks for feature extraction and multivariate data projection. IEEE Transactions on Neural Networks, 6(2):296–316, March 1995. 4. David J. H. Wilson and George W. Irwin. RBF principal manifolds for process monitoring. IEEE Transactions on Neural Networks, 10(6):1424–1434, November 1999. 5. Teuvo Kohonen, Erkki Oja, Olli Simula, Ari Visa, and Jari Kangas. Engineering applications of the self-organizing map. Proceedings of the IEEE, 84(10):1358–1384, october 1996. 6. Esa Alhoniemi, Jaakko Hollm´en, Olli Simula, and Juha Vesanto. Process monitoring and modeling using the self-organizing map. Integrated Computer Aided Engineering, 6(1):3–14, 1999. 7. Teuvo Kohonen. Self-Organizing Maps. Springer-Verlag, 1995. 8. Christopher M. Bishop, Markus Svensen, and Christopher K. I. Williams. GTM: The generative topographic mapping. Neural Computation, 10(1):215–234, 1998. 9. M. Tipping and C. Bishop. Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2):443–482, 1999. 10. Juha Vesanto. Som-based data visualization methods. Intelligent Data Analysis, 3(2):111–126, 1999. 11. Juha Vesanto and Esa Alhoniemi. Clustering of the self-organizing map. IEEE Transactions on Neural Networks, 11(3):586–600, May 2000.