ClassiMap: a new dimension reduction technique for exploratory data ...

4 downloads 91 Views 2MB Size Report
data analysis techniques which ignore the class labels of the data. ... multidimensional labeled data based on the visual analysis of their graphical depiction as a ...
This version is not the camera ready. Please refer to the final published version for accuracy and completeness.

ClassiMap: a new dimension reduction technique for exploratory data analysis of labeled data Sylvain Lespinats1, Michaël Aupetit2, Anke Meyer-Baese3 1

CEA/INES, Laboratory for Solar Systems (L2S), BP 332, 50 avenue du Lac Léman, F-73377 Le-Bourget-du-lac, France. (corresponding author ; e-mail: [email protected]). 2 Qatar Computing Research Institute, 10th floor, Tornado Tower, PO Box 5825, Doha, Qatar, 3 Department of Electrical and Computer Engineering, Florida State University, Tallahassee, FL 32310-6046 , USA.

Abstract—Multidimensional scaling techniques are unsupervised Dimension Reduction (DR) techniques which use multidimensional data pairwise similarities to represent data into a plane enabling their visual exploratory analysis. Considering labeled data, the DR techniques face two objectives with potentially different priorities: one is to account for the data points’ similarities, the other for the data classes’ structures. Unsupervised DR techniques attempt to preserve original data similarities, but they do not consider their class label hence they can map originally separated classes as overlapping ones. Conversely, the state-of-the-art so-called supervised DR techniques naturally handle labeled data, but they do so in a predictive modeling framework where they attempt to separate the classes in order to improve a classification accuracy measure in the low-dimension space, hence they can map as separated even originally overlapping classes. We propose ClassiMap, a DR technique which optimizes a new objective function enabling exploratory data analysis of labeled data. Mapping distortions known as tears and false neighborhoods cannot be avoided in general due to the reduction of the data dimension. ClassiMap intends primarily to preserve data similarities but tends to distribute preferentially unavoidable tears among the different-label data and unavoidable false neighbors among the same-label data. Standard quality measures to evaluate the quality of unsupervised mappings cannot tell about the preservation of within-class or between-class structures, while classification accuracy used to evaluate supervised mappings is only relevant to the framework of predictive modeling. We propose two measures better suited to the evaluation of DR of labeled data in an exploratory data analysis framework. We use these two label-aware indices and four other standard unsupervised indices to compare ClassiMap to other state-of-the-art supervised and unsupervised DR techniques on synthetic and real datasets. ClassiMap appears to provide a better tradeoff between pairwise similarities and class structure preservation according to these new measures. Keywords— multidimensional scaling, exploratory data analysis, labeled data, mapping evaluation, dimensionality reduction, distance preservation.

1. INTRODUCTION 1.1. Context Multidimensional scaling techniques are Dimension Reduction (DR) techniques used for visual analysis of multivariate data [12, 40, 48]. Original multivariate data are usually mapped with a DR technique as points in a 2-dimensional metric space (the map) while preserving at best original similarities between data points. When applying nonlinear DR techniques among which are ISOMAP [67], NeRV [64] and the vast majority of modern DR techniques [21, 40, 48, 78], it is important to notice that axes of the map have no meaning at all. We will only consider these meaningless-axes maps in the sequel, therefore the only remaining piece of information related to the original multidimensional data is fully contained in the points’ pairwise proximities: in such a scatterplot, nearby points are supposed to represent similar data, and far away points non similar ones. The analysis of multidimensional data is to be done through their 2-dimensional graphical representation as a scatterplot. However, these mappings necessarily come with distortions due to the reduction of the dimension, so the visual analysis is not trustworthy in general. The present paper focuses on the mapping of labeled data, i.e. data with an assigned class label. Unsupervised DR techniques are exploratory data analysis techniques which ignore the class labels of the data. Dealing with labeled data, supervised mapping techniques naturally come to mind as they are designed to take into account both the original data similarities and the data class labels. They have been used in a wide range of applications including handwritten digits classification [35], prediction of membrane protein types [75], face recognition [6, 7, 83], handshape recognition [27], gene expression data analysis [56], or bankruptcy analysis [57]. However these supervised DR techniques essentially 1

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. focus on classification accuracy of the resulting maps, they can be viewed as techniques for predictive modeling of the classes under the main constraint of a reduced dimension. As we shall see, the predictive modeling objective is not compatible with the objective of exploratory analysis of labeled data. Our objective is to develop a new DR technique enabling visual exploratory data analysis of labeled data that enables to preserve the class structure together with the similarities. 1.2. Paper contributions and outline We propose ClassiMap, a new mapping technique for visual Exploratory Data Analysis of Labeled Data (EDALD). Our contribution is fourfold: -

-

-

In section 2, we propose a new framework to understand which kind of information can be inferred about the multidimensional labeled data based on the visual analysis of their graphical depiction as a scatterplot. For this sake, we propose to adapt the concepts of Class-Cluster Correlation Assumption (CCCA) and Class-Cluster Perceptual Correlation (CCPC) defined in [5] to the present context of EDALD through their low-dimensional representation. We specifically take into account the mapping distortions and the types of inferences that one can derive from labeled data regarding the correctness of the labeling, the relevance of the space or the one of the similarity metric. In section 3, from the DR state-of-the-art and the above CCCA-CCPC framework, we explicitly define the goals of EDALD and the specific mapping distortions, namely the between-class false neighbors and the within-class tears, to be penalized in order to fulfill these goals. In section 4, we define an original cost function which effectively penalizes these specific distortions and ClassiMap, a new DR technique of labeled data which minimizes this cost function. Finally, in section 5 we propose the Class Over-Separation Index (COSI), a new index to quantify how much maps unduly separate originally overlapping classes, in addition to the Class OverLapping Index (COLI), a new index in this context, to quantify how much the maps avoid unfounded class overlapping. Both these indices together with other 4 standard unsupervised indices make possible a fair and objective comparison of ClassiMap with other supervised and unsupervised mapping techniques in the perspective of EDALD.

In section 6, experiments on toys and real data are proposed which demonstrate the benefits and limits of ClassiMap. We conclude in section 7 opening with perspective. 2. EDA OF MULTIDIMENSIONAL LABELED DATA THROUGH THEIR SCATTERPLOT REPRESENTATION 2.1 Exploratory Data Analysis vs predictive modeling Exploratory data analysis and predictive modeling are two distinct data processing frameworks. Exploratory Data Analysis (EDA) [69] aims at extracting descriptive information about the data at hand that can be used by humans for knowledge discovery: correlation, clustering, outlier detection, intrinsic dimension estimation, Principal Component Analysis (see [81] for a general overview) or more recently Topological Data Analysis [13, 23] are part of this field. Predictive modeling aims at designing a model of the data enabling to predict missing characteristics of unseen data drawn from the same population (see [10] for a general overview). Supervised classification techniques like K-Nearest-Neighbors or Support Vector Machines and regression techniques like Logistic Regression are part of this field. DR techniques are typically used in the EDA framework as they provide a graphical representation of the data enabling direct visual data exploration by the users. In this framework, several indexes are often used to evaluate the overall quality of the maps (please refer to section 5), although an ultimate evaluation for a map would certainly be the quantity and quality of information about the original data the users could infer from it. Naturally, such subjective evaluation depends on the user’s goals. Conversely, in the predictive modeling framework, the main objective is class prediction under DR constraints, so crossvalidation or regularization techniques could and should be used to select the best predictive model of the classes. In this work we propose a new DR technique for labeled data to be used in the EDA framework. We shall further describe the EDA with DR techniques to make clear our objective. We first borrow the concepts of Class-Cluster Correlation Assumption and of Class-Cluster Perceptual Correlation to a previous work proposed by one of the authors [5]. We link these concepts to the EDA inferential tasks to create a new framework of EDA of Labeled Data (EDALD) through their low dimensional representation, and detail the possible impact of the mapping distortions onto the correctness of these inferences.

2

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. 2.2 The Class-Cluster Correlation Assumption (CCCA) A clustering process is a way to determine groups of similar data that can be further identified as classes by assigning them a specific label. These labels are usually assigned by a human expert or by a machine designed to do so. It is also a fundamental clustering assumption that the data assigned to the same cluster should be more similar to each other than the data assigned to different clusters [15]. Hence when data come with labels, we believe that the human expert who (or the machine which) classified the data have relied on some data similarity measure to group the data first and then to assign them to classes with specific labels. Therefore we assume that it must exist some latent space where each data class does not overlap the others and forms a well separated single cluster. We call this assumption the Class-Cluster Correlation Assumption (CCCA) in the sequel (The CCCA is called the cluster assumption in the semi-supervised learning literature [15]). 2.3 The CCCA and EDA If data come as points in a multidimensional space, then the relation between the descriptive variables can be studied (correlation analysis) and the probability density in this space can be estimated, and when pairwise similarities are available, cluster analysis, outlier detection, intrinsic dimension estimation or topological analysis can also be conducted. Determining whether the CCCA is valid or not in the multidimensional data space is related to the data clustering structure, thus it only depends on having a pairwise data similarity matrix. When data come with labels, then testing the validity of the CCCA is an EDA task. The CCCA is not true for a class in the given data space if this class is shattered in different clusters (false within-CCCA) or if it lies in the same cluster as other ones so if it exist a mixed-class cluster (false between-CCCA). Both cases can happen together for instance with a class being shattered in two mixed-class clusters. The CCCA is valid when within-CCCA and between-CCCA are both valid for each class. Depending on what information we trust and so what information we question, there can be four different combinations of patterns and assumptions regarding whether a class is mixed, clustered or shattered and whether the labels or the similarity measure is trusted upon: - Mixed-class and correct labels: if we assume that the labels are correct then the presence of a mixed-class cluster in the data space is a hint that some unknown variables used originally to distinguish the data and to give them a distinct label, are missing to separate the labels in the available data space or not used to compute the available similarity measure. More discriminant variables should be searched for further classification or understanding. For instance we can assume that the distribution of the maximum speed of cars and motorcycles available in the market is roughly similar, while the number of wheels is a very discriminant characteristic to distinguish cars from motorcycles. Therefore if maximum speed is the only observed variable and cars and motorcycles labels are trusted upon, then both having roughly the same maximum speed distribution are likely to overlap in some clusters along the maximum speed variable indicating that discriminant variables are missing (here the unobserved number of wheels). - Mixed-class and correct space or similarity measure: if we assume that the similarity measure or the variables of the data space are the discriminant ones, then the existence of a mixed-class cluster is a hint that some of the labels are erroneous. These labels should be corrected for further classification or understanding. For instance if having four wheels is assumed to be a discriminant characteristic of cars, then observing a four-wheel vehicle labeled as “motorcycle” is likely to be a labeling error rather than the effect of irrelevant or missing variables indicating that this label should be examined in detail. - Shattered-class and correct labels: if we assume that the labels are correct then the presence of a class shattered into several clusters in the data space is a hint that some irrelevant variables independent to the label, split the class component and are observed in the available data space or used to compute the available similarity measure. Such variables should be removed to reduce the data space dimension in order to prevent the curse of the dimensionality and get a condensed representation. For instance the maximum speed variable can shatter cars into clusters of low, middle and high speed vehicles indicating this variable might be irrelevant to get a condensed representation space for cars and so should be ignored. - Shattered-class and correct space or similarity measure: if we assume that the similarity measure or the variables of the data space are the discriminant ones, then the existence of a class shattered in several clusters in the data space is a hint that this class should be broken down into as many subclasses as there are clusters, or that some observations are missing

3

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. between the class components to make them a single cluster. Class subdivision should be considered or the empty space area between the available clusters should be better sampled. For instance considering the weight of a vehicle, if the observed sample only contains small compact cars and big SUV, then two clusters appear because the medium sized cars are not in the sample, so further investigation could be led to see why there are no medium sized cars in the sample, or to decide to define subclasses based on the cars’ weight. In standard predictive modeling, labels are assumed to be correct and the best space or similarity measure is searched for to separate the classes. In EDALD, one can assume the CCCA to be valid and can detect anomalies regarding the data space, the labels, the similarity measure or the sampling when this assumption is violated. All these cases are summarized in Table 1. 2.4 The Class-Cluster Perceptual Correlation (CCPC) The above description concerns the test of the CCCA in the original multidimensional space. Of course, a user cannot perform this analysis without automatic clustering or analysis techniques [23]. However, using DR techniques it might be possible for the user to test the validity of the CCCA through visual analysis. The user expects the map to reliably represent the original data’s pairwise similarities as points’ pairwise proximities because this is what a DR technique is usually designed to optimize. Therefore the user will attempt to infer properties of the multidimensional labeled data from the visual patterns related to clusters and classes she will detect in the scatterplot. Thus we shall understand what kind of visual patterns the visual system of the user might detect and the related properties she is likely to infer about the multidimensional data. Due to well-known visual perception mechanisms [76], the Gestalt law of proximity leads the user to mentally group nearby points together (visual clustering) pre-attentively, i.e. almost instantly and with almost no cognitive effort and so to group together the data these points represent. Conversely, if a point appears isolated from the other points (visual outlier detection), then she is likely to think at first sight that the same holds true for the data this point represents. When data come with class labels, they can be graphically represented for instance by coloring the points according to the labels. Obviously the class label being a categorical variable, it can be mapped to the shape of the points instead of their color or other relevant graphical variable [76]. In the sequel we consider the color is used with no loss of generality. In this case the Gestalt law of similarity comes into play: points given the same color are likely to be grouped together pre-attentively, and so likely are the data they represent. As a consequence, scatterplots with colored points give rise to a potential conflict between the perceptual grouping formed by the colors (classes) and the one formed by the proximities (clusters). A set of controlled experiments with colored squares arranged in a grid has been proposed in [28]. It showed that color and space feature axes strongly interfere for visual tasks like detecting class outliers or counting the number of classes. The best situation to achieve the tasks arises when color and spatial features are correlated. Another study in [63] proposed a taxonomy of visual cluster separation factors in scatterplots for explaining reasons for failures of state-of-the-art separation measures. Here again, the higher the colors and positions of the points are correlated, the easier the cluster separation is identified. So this work suggests that the correlation between proximity and color perceptual groupings is of primary importance for visual cluster separation tasks too. Furthermore, a low Class-Cluster Perceptual Correlation (CCPC) resulting from a conflict between proximity and similarity grouping is more likely to draw the attention of the user who wonders why colors and proximities are not correlated. Indeed, a conflict between perceptual proximity and similarity seems to be an atypical circumstance in nature as experimentally shown in [41] where artificial neurons receiving retina signals self-organize through the stream of natural images so that nearby neurons are excited by similar signals. So a low CCPC is more likely to question the user and to result in a higher cognitive load than a high CCPC. It must be noted that the CCPC can be low either when several classes overlap in a single cluster (low between-CCPC), or when a single class is shattered in several clusters (low within-CCPC). The CCPC is high when both between-CCPC and within-CCPC are high (See Table 1). Therefore the CCPC acts as a visual pattern detector especially sensitive to the violation of the CCCA in the visual space, so it gives a practical way for a human to possibly check the validity of the CCCA in the original data space too. But this can happen only when the data space geometry matches the one of the display without distortions, i.e. when points’ pairwise proximities on the screen closely match the data pairwise similarities in the data space, so essentially when the data space is already 2dimensional. However, in general the data space is more than 2-dimensional so the map obtained through DR is a distorted representation of the multidimensional data. We shall understand how these distortions impact the checking of the CCCA using the CCPC and shall derive a DR technique which would be the less detrimental to this visual analytic processing of labeled data.

4

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. 2.5 Mapping distortions Due to the lossy compression which occurs when reducing the data dimension, maps come with mapping distortions. This is likely to lead the user to visually infer wrong conclusions about the original data based on their distorted map. Namely, grouped and isolated data in terms of their similarities in the multidimensional data space are not necessarily mapped as grouped and isolated points respectively in terms of their proximities in the 2-dimensional space. Although further distinction could be made between geometrical and topological mapping distortions, at first glance they are based on some neighborhood preservation [40] and may be of two different types [3, 45]: - a tear occurs when data i and j are originally neighbors of each other (high similarity) while their projections as points i and j on the map are not neighbors (low proximity) (Figure 1, map A); - a false neighborhood occurs when points i and j are neighbors on the map (high proximity) while their original data i and j are not neighbors (low similarity) (Figure 1, map B). Many papers in Machine Learning and Visualization communities proposed techniques to evaluate quantitatively or qualitatively these mapping distortions [22, 36, 37, 42] in order to select the best DR technique and to limit wrong pattern inferences. For instance, CheckViz [45] displays data as a scatterplot with additional background color coding to show the local distortions. In [50], edge bundles are used to display tears while background colors encode false neighbors. And in [3] background color is used to display the actual similarity to a reference data selected by the user. We analyze all the possible distortions which could occur when mapping labeled data (see Figure 2). Given a focus data lying in a multidimensional space and its projection as a focus point into a low dimensional space, both (obviously) assigned to the same label (blue star), six different cases can happen regarding their position inside or outside the neighborhood of the focus point, and their label compared to the one of the focus point: cases WFN and BFN correspond to false neighborhoods within and between the classes respectively, cases WT and BT correspond to tears within and between the classes respectively, and cases WC and BC correspond to correct within-class or between-class mappings respectively. Dealing with labeled data in an EDA framework, the above analysis showed that it is important for a DR technique to take the class structure into account so the CCCA can be checked properly: points lying in a mixed-class area should be mapped in the same mixed-class area, and points lying in a pure-class area should be mapped in the same pure-class area. Obviously, if the pairwise similarities are perfectly preserved by the DR technique, so are the pure-class and mixed-class structures. But because distortions of the similarities are unavoidable, how far class structures can be preserved comes into play. 2.6 The impact of the distortions when inferring the CCCA from the CCPC The user tends to rely on the CCPC to infer the validity of the CCCA, but this inference is not straightforward because of the mapping distortions induced by DR techniques. In order to analyze the impact of mapping distortions onto the EDA of multidimensional labeled data through their 2-dimensional representation, we have to consider all the possible combinations for a class to lie in a single cluster, to be shattered in several clusters or to be overlapped by other classes, both actually in the original data space and perceptually in the map in an independent way. Examining these combinations results in 12 typical cases depicted in Table 2, which depend on whether CCCA is valid, falsebetween or false-within, and CCPC is high, low-between or low-within. Cases I to VIII refer to mappings where a correct inference can be obtained about the MD data based on their 2D visualization. Cases IV and V show that tears (IV) or false neighborhood (V) at the interface between two classes are not detrimental to the visual inference regarding the CCCA. Cases VII and VIII show that tearing or collapsing a mixed-class area does not change about the inference either that labels are erroneous or that some discriminant variables are missing. Cases IX, X, XI and XII lead to possibly wrong inferences or overlooked information regarding the CCCA. Cases IX and X refer to gathering or splitting a single class while cases XI and XII refer to mixing or un-mixing two classes.

5

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. The above analysis is obviously informal as the way clusters are defined, the class-cluster correlation is measured in the multidimensional space and the way the human visual system detects clusters and effectively measure the class-cluster perceptual correlation in the map should be specified and formalized in a practical instantiation of this analysis framework. However the merit of this CCCA-CCPC framework is to explain the rationale hidden behind the particular class structure and thus the importance of its knowledge for the EDALD. We will use the Table 2 in section 3 to derive our ClassiMap approach but now we shall examine a toy example to make clearer our goal regarding the limits of the state-of-the-art DR techniques seeing how they generate different kinds of between-class and within-class distortions. 2.7. A toy example Facing labeled data, it is natural to consider supervised DR techniques as better able to deal with EDALD because unsupervised DR techniques ignore class labels while supervised ones take these labels into account. However, a close study of the supervised DR techniques (section 3.2) will show that they are designed to match a predictive modeling objective which is not satisfying for EDALD. In order to illustrate this issue we propose to go further with a toy experiment that illustrates the limits of both unsupervised and supervised standard DR techniques for EDALD and present the type of maps we claim to be desirable. This example allows us to clarify our goal and the technique to compute such a map will be detailed in section 4. We considered a two-class dataset, whose data are drawn uniformly from the faces of a 3-dimensional cube as showed in the Figure 3. Two faces of the cube are blue, two other are red and the last two are a balanced mix of red and blue classes. From a perceptual point of view and also from a statistical point of view, labels in the mixed regions can be inferred from the sample to come from nearly the same underlying distribution with some fixed mixing proportion of the red and blue classes (here the underlying distribution is a regular grid with equal probability for each point to be assigned to the red or blue class). This region is likely to be perceived by a user or modeled by a machine as a single mixed-blue-red meta-class. Ignoring the exact proportion in the mixed areas, and focusing on a two-class dataset, we can at first glance define three distinct meta-classes: pure-blue, purered and mixed-blue-red and recover the typical setting described in the previous section: in terms of within-class structure, each class is made of one connected component (valid within-CCCA), and in terms of between-class structure each class partly overlap the other one (partly valid between-CCCA). We would like the mapping to preserve this information in the scatterplot so that to have a high within-CCPC (blue and red classes lying in one component each) and low between-CCPC (partly overlap of blue and red classes). Let’s see how the 3-dimensional data from the cube’s faces are projected into a plane using different supervised and unsupervised DR techniques. Here we focus on the qualitative outcomes of this experiment. We must emphasize that it does not exist a perfect 2-dimensional representation of the initial cube in the plane which would preserve all the original pairwise data similarities, i.e. mapping distortions cannot be avoided because it is not possible to preserve the sphere-like topology of the original cubic manifold by embedding it in the plane. Still the trivial topological structure of the classes can be preserved at least as the pattern of the manually unfold cube shown at the top of the figure: in this pattern, the CCPC lets us see a single component for the blueclass, a single component for the red class and a partial overlap of the blue and red classes as expected for the cube data. The knowledge of the original data and the original class structure allows us to see that among all these mappings, only ClassiMap performs well in preserving both the high pairwise similarities and also the class structure: the number of connected components of each class is still the same (high within-CCPC), and the mixed-blue-red meta-class is also preserved (low between CCPC in this area). We will propose in section 5, two specific quantitative measures to quantify how much this class-structure information is preserved. CCA and LMDS provide similar maps as ClassiMap in terms of the pairwise similarities preservation, but not taking care of the labels, they are more destructive of the class structure: they can shatter classes in several components. This is a clear example to shows that even good mappings in terms of pairwise similarities preservation can be more or less efficient in class structure preservation. Other unsupervised techniques like PCA, Isomap, LLE or DD-HDS are more prone to destruct the original class structure by merging distinct classes more than they actually are as they ignore the label information. Standard supervised mapping techniques are usually proposed as better solutions than unsupervised mappings to deal with the mapping of labeled data. They take the data labels into account attempting to separate on the map the points with different classes in order to improve the class prediction of unseen data. However they usually put too much emphasis on class separation leading to over-separation, i.e. to separate even originally overlapping classes largely ignoring the data pairwise similarities. This can be seen on the Figure 3, where supervised mapping put a larger emphasis on the labels separation than on the original pairwise similarities preservation. Ignoring the labels and due to the symmetrical distribution of the cube dataset, many mappings of the cube are possible with the same DR technique reaching the very same final amount of distortions regarding the original and mapped data pairwise 6

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. similarities as seen for ClassiMap, CCA and LMDS here. But considering the class labels it becomes possible to take them into account to favor some of these otherwise equal solutions. This is how ClassiMap operates: if distortions must occur to comply with the pairwise similarities preservation objective, these distortions are driven toward separating points with different labels or gathering points with the same label. Importantly, if no distortion is necessary to get a perfect map in terms of pairwise similarities preservation, then ClassiMap will not enforce separation between different-label points nor will it enforce gathering of same-label points. The class structure preservation is then an incidental objective of ClassiMap compared to the pairwise similarities preservation, it occurs if it can, depending on the amount of distortions necessary to preserve the pairwise similarities while reducing the dimensions. We shall now detail the state-of-the-art and explain how it and the CCCA-CCPC analytic framework have been used to design ClassiMap and to give it the properties it exhibited in the above toy example.

3. STATE OF THE ART AND GOALS 3.1. Unsupervised DR techniques Since the Torgerson’s classical Multidimensional scaling (classical MDS) [68], many DR techniques have been based on the distance preservation paradigm. Interested readers will report to [21, 40, 48, 78] for a deeper state-of-the-art on unsupervised DR techniques. We focus here on some of the most typical non-linear ones which have been later extended to the supervised case. Isomap [67] approximates the geodesic distances along the underlying data structure with path length measured on the data knearest-neighbor graph, and maps these distances into a low dimensional space with the Torgerson’s classical MDS. The Locally Linear Embedding (LLE) [60], Laplacian Eigenmaps [9] and related techniques reconstructs each data point based on a linear weighting of its k-nearest-neighbors and projects the points on the map based on this weighting scheme (please note that an improvement of related techniques has been proposed [30]). Other techniques such as NeRV [74], Sammon’s mapping [61], Curvilinear Component Analysis (CCA) [16], Local MultiDimensional Scaling (Local MDS) [73] and Data-Driven High Dimensional Scaling (DD-HDS) [42] are cost-function-driven algorithms. For these latter DR techniques m, the cost function is called global stress and is denoted Em in the following. It takes the general form of a sum over all data pairs (i,j) (eq. 2) of local stress functions Sm(i,j) (eq. 1): S m i, j   d ij  d ij*  Fm .

(1)

Em   S m i, j 

(2)

p

i, j

where dij = ||xi - xj|| and d*ij = ||x*i – x*j|| denote the Euclidean distances between data i and j in the original space and on the map respectively, p belongs to [1,+∞[ (typically p = 2) and Fm is a decreasing function on IR+ which equals 0 or tends towards 0 with large input values. Fm depends on the distances dij or d*ij according to the technique m, and is designed to emphasize short distances depending on a bandwidth parameter . Many functions have been proposed for F, including the Heaviside step function: F u   0 if u   and F u   1 otherwise. As we saw in the toy example of section 2.7, none of these unsupervised DR techniques is able to preserve the class structure in general because they ignore the class labels, and we saw in the Table 2 that the same proximity pattern but involving pure-class or mixed class (Patterns II and VI for instance) can drive different inferences regarding the class structure. Therefore, as a Desired Property (DP) of a DR technique designed for the EDALD, the data labels shall be considered (DP1) so that originally pure-class or mixed-class regions are mapped as pure-class or mixed-class regions respectively. 3.2. Supervised DR techniques Supervised mapping techniques take the class label of the data into account while mapping the data. A classical supervised mapping technique is the Linear Discriminant Analysis (LDA) [20, 22]. The LDA is aimed at finding a linear subspace onto which the linear projection of the data points best separates the classes in the sense that it minimizes the between-class over the within-class variances ratio. The LDA is based on a Euclidean space hypothesis, but an extension of the LDA based on a kernel function called the Kernel Fisher Discriminant Analysis (KFD) [51] has been proposed to deal with non-linear structures. Another non-linear version of LDA has been recently proposed [19]. The Supervised GP-LVM [24] introduced the classes as an 7

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. additional latent variable while performing a classical GP-LVM mapping. The Supervised Isomap with Explicit mapping (SE-Isomap) [47] and the Supervised Curvilinear Components Analysis [38, 39] are supervised extensions of Isomap and CCA respectively, which map each class separately and attempt to organize the resulting maps afterward. The map assembling step is challenging when several classes overlap or a class is disconnected. The Witten’s Supervised MDS [79] uses a cost function that results from a trade-off between distance-preserving and class-preserving criteria. Semi-supervised version for LLE (called SS-LLE), Isomap (SS-Isomap) [80] and Local Tangent Space Alignment [82, 84] (SSLTSA) have been proposed. These techniques suppose that the mapping of a labeled subset of the data has already been achieved in some supervised way, and then, it projects the whole dataset one more time so as to map the labeled subset as close as possible to its previous position. Another approach to deal with labeled data has been used extensively: first modify the original distances in order to favor class separation, and then map the data applying state-of-the-art unsupervised mapping algorithms to these modified distances. For instance, the Supervised Locally Linear Embedding (S-LLE) [35, 58], the Local Fisher Embedding (FLLE) [59], the Enhanced Supervised Locally Linear Embedding (ES LLE) [83] and the Probability-based Locally Linear Embedding (PLLE) [85] use this approach based on the unsupervised LLE. S-LLE turns the distance dij into dij+D (where D is the maximum distance in the dataset) if the data i and j do not belong to the same class; while dij remains unchanged otherwise. The parameter  must be set within [0,1]. S-LLE0 ( corresponds to the original unsupervised LLE (the distances are not changed); otherwise, the greater the more the over-separation of the classes. The Supervised Isomap (S-Isomap) [25, 77] extends Isomap to the supervised case using the classical Isomap algorithm on a modified distance: dij = 1  e d  if the data i and j belong to the same class; and dij 2

ij

= e d    otherwise. Default parameters are proposed:  and  is set equal to the average of all the pairwise distances. Supervised NeRV (S-NeRV) [74] performs the unsupervised NeRV algorithm [74] with distances weighted by the Fisher information matrix which locally stretches or shrinks original coordinate axes depending on their conditional class distribution [31, 33, 34, 55]. 2

ij

In most of these latter techniques, a supervised original distance pre-processing is performed in order to take the classes into account within the subsequent unsupervised mapping. We ran another toy experiment displayed in the Figure 4, to complement the previous one depicted in Figure 3, to make clear that these supervised mapping techniques, changing the original similarities by modifying the metric locally, first tend to artificially tear the classes and second do so even when a perfect mapping exists as a trivial solution. In this example the original data points are located on a 2-dimensional square grid with a random class assignment. The original and the projection spaces are both a 2-dimensional plane so it is obvious that the optimal mapping is the identity function which maps each data onto itself and comes with no distortion. We used the original data x as the initial positioning of their image x through the mapping and then started the optimization process for each technique. This simple experiment is a sanity check: we expect the mapping of the original data and the original data themselves to be exactly identical because no dimension reduction occurs. Figure 4 shows that all the supervised techniques except ClassiMap, fail to preserve the original pairwise similarities and even tend to separate the classes artificially while both classes overlap in the original space. We claim that such original distance transformation is not desirable in an EDALD framework for the following reasons: -

Change of the original similarities depending on the class labels: the primary available information (i.e. the original similarities) are changed depending on the class labels before the application of the mapping itself, hence the mapping based on the supervised similarities can have distortions regarding the original similarities even when the labeled data could have be mapped with no distortions with respect to these original similarities (see the toy example described above and displayed in Figure 4).

-

Class separation enforcement: class separation is enforced on the map (high between-CCPC) even when classes actually overlap in the original space (false between-CCCA), purposely changing the structure of the classes (typically un-mixing the mixed classes) to improve the class prediction of unseen data. It can generate false inferences (see Table 2, case XII) letting the user think that labels and similarity are correct (erroneous inference that the between-CCCA is valid) while ignoring possible erroneous labels or possible missing discriminant variables.

Considering these two limits of supervised DR techniques, we propose that a DR technique designed for EDALD should have the following desirable properties: 8

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. -

No change of the original similarities (DP2): the original similarities shall not be modified depending on the class labels, so that perfect maps can be obtained if they exist;

-

Class overlapping preservation (DP3): in order to preserve the mixed-class areas, the class separation shall not be systematically enforced, because in the EDALD framework, depicting the actual class structure in the map is important to evaluate the validity of the CCCA by relying on the CCPC;

-

Class separation enhancement (DP4): while class separation enforcement can be detrimental to a correct inference of the between-class structure, it is nevertheless desirable for class separation to be enhanced if and only if the classes are already separated in the original space (see Table 2, case IV). Indeed enhancing separation is likely to ease the visual perception of the between-class structure as suggested by user experiments in [28] while it is qualitatively not detrimental to the correctness of the inference brought by this perception (inferring that two classes are actually separated would be visually more efficient in terms of cognitive load due to pre-attentive processing, if these classes appear as distant pureclass regions than if they appear as adjacent pure-class regions).

We shall further analyze the Table 2 in order to derive the technical machinery to obtain the desirable properties DP1 to DP4. 3.3 Setting the technical objectives of DR for EDALD The desirable property DP1 obviously requires that the existence of different labels impacts the objective function of the intended DR technique. As in standard supervised techniques, we will consider distinctively the within-class and between-class similarities for this sake. However, the desirable property DP2 imposes this class distinction to occur only in proportion to the distortions necessary to achieve the mapping. Therefore we propose to match DP2 by driving the unavoidable distortions in such a way that the desirable properties DP3 and DP4 are matched too. Regarding the 4 possible distortions described in section 2.5, exploratory data analysis through the data mapping will be optimally trustworthy if cases WFN, BFN, WT and BT are avoided as it will preserve all the pairwise similarities and so the class structure at the scale of the neighborhood width. However in practice distortions cannot be avoided, therefore we have to choose which distortions are more detrimental than others to the class structures. Looking at the Table 1 we see that there are two distinct and independent analyses depending on whether we focus on a given class (within-class patterns in cases I, II, IX and X) or on the relation between two classes (between-class patterns in cases III to VIII, XI and XII). Remember that in the context of EDALD, the DR technique is only a mean to enable visual inference about the actual multidimensional data and structure of the classes. Therefore, the actual within-class structure can be analyzed on a class-by-class basis, mapping each class separately. However, the between-class structure necessarily involves at least two classes to be mapped together. And DP3 and DP4 clearly involve between-class relations and so will be impacted by between-class distortions. Therefore we might consider that when mapping several classes together, we must strongly penalize all the between-class distortions while we may tolerate the within-class distortions because these latter ones can be cared about for each class separately. Although it sounds rational, we shall analyze in more details the Table 2. As can be seen, only the cases IX, X, XI and XII correspond to false inference or to overlooked information about the actual class structure. Regarding that we should focus on between-class structure preservation, only the cases XI and XII shall be considered in priority. It must be noted that case XII corresponds to the case where actual mixed-class is artificially separated by the DR technique. This is exactly what most of the supervised DR techniques tend to achieve (see Figure 4) but they do so by forcing the separation modifying the original similarities and so by going against the DP3. Though, this un-mixing process is very unlikely to occur without forcing this between-class separation. Indeed, mapping the points in a mixed-class region either shattering it in several still mixed-class parts (case VII) or collapsing it further in still a mixed-class smaller size region (case VIII), are both far more optimal solutions with respect to the similarity preservation, than shattering the mixed-class regions in distinct pure-class clusters which would generate many between-tears (case XII). In other words the DP3 would be naturally achieved if class separation were not enforced. On the contrary, the case XI where well separated classes end overlapping in the map is very likely as the reduction of the dimension generally imposes to reduce distances and so goes against the DP4. Therefore we propose to avoid the case XI by penalizing Between False Neighbors (BFN) and Within Tears (WT). This approach will achieve the four desirable properties DP1 to DP4 we identified in the context of EDALD. 9

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. In summary, we propose to penalize BFN and WT more than BT and WFN. Penalizing BFN will make false neighborhoods more likely to happen within classes (WFN, case X), which is of low consequence in terms of the inference regarding the within-CCCA as we can analyze the within-class structure for each class separately with any unsupervised DR technique. Penalizing WT will also make tears more likely to happen between classes (BT, case IV), which is anyway unlikely to un-mix mixed-class regions (case XII) as we discussed above so achieving the class overlapping preservation (DP3). Penalizing WT, because it favors BT will enable bigger separation of already separated classes (case IV) achieving the class separation enhancement (DP4) still without being detrimental to the correctness of the inference regarding the between-CCCA (case IV). Now we shall describe how ClassiMap is designed to achieve these technical objectives.

4. THE CLASSIMAP TECHNIQUE 4.1. The ClassiMap global stress Mapping techniques cannot preserve all the original pairwise distances in general [44, 45, 62], so they come with distortions as explained in section 2.1. It is well-known [42, 44, 45, 73] that minimizing the Sammon’s stress function [61] more strongly penalize tears than false neighborhoods, while minimizing the CCA’s stress function [16] more strongly penalize false neighborhoods than tears (see Table 2). Their respective local stress functions are:

SSammoni, j   dij  dij*  F dij 

and

 

SCCA i, j   dij  dij*  F dij*

(3) (4)

For instance, let us consider the Sammon’s local stress (eq. 3) and suppose two different cases where the mapping error dij  dij* is similar: in the first case, the data i and j are close points mapped far apart (a tear), while in the second case, i and j are far apart data mapped as neighbors (a false neighborhood). In the former case F(dij) will be higher than in the latter case, therefore the minimization of the corresponding global stress (eq. 2) will avoid the first case and so doing will favor the second one as the only alternative, generating false neighborhoods. For symmetric reasons, the minimization of the CCA global stress favors tears. The choice of the mapping technique gives thus access to a certain level of control on the type of distortions. Therefore the way the different kinds of mapping distortions are distributed among the pairs of data, is a degree of freedom we can act on to achieve the desired goals of supervised mappings without the need to modify the original distances prior to the projection (DP2). ClassiMap is based on this idea. ClassiMap drives the unavoidable distortions in places which minimize their impact with respect to the class separation enhancement and class overlapping preservation DPs as discussed in section 3.3. This is achieved by applying Sammon or CCA local stress functions alternatively depending on the class co-membership of the data pairs (eq. 5) (DP1). Let us define A the co-membership matrix of the data to the classes. A is a square matrix n×n where n is the number of data points. The value Aij quantifies the class similarity between data i and j. Aij is set to 1 if the data i and j belong to the same class, and to 0 otherwise. The ClassiMap local stress is written:

S ClassiMapi, j   

d ij  d ij*

p



 

 Aij  F d ij   1  Aij  F d ij*

Aij  S Sammoni, j   1  Aij  S CCA i, j 

(5)

While minimizing the ClassiMap global stress function EClassiMap, points from the same class tend to be grouped together, while points from different classes tend to be separated, fulfilling the Class separation enhancement (DP4). However, this occurs only if the mapping of local pairwise distances cannot be fairly achieved (DP2). If classes overlap in the original space, that means there is no significant difference between local within-class and between-class distances, hence it is unlikely that a tear will occur specifically between the classes to separate them on the map while letting the within-class distances unchanged, which fulfills the Class overlapping preservation (DP3). In brief, ClassiMap strongly penalizing within-class tears (WT) and between-class false neighborhood (BFN), it will drive unavoidable mapping distortions through between-class tears (BT) and within-class false neighborhoods (WFN), thus where they maximally preserve the original between-class structure (DP3 and DP4). The Table 2 summarizes in the right columns the strength of the penalization applied by CCA, Sammon and ClassiMap to the different cases (the darker the red color the stronger the penalization).

10

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. Note that ClassiMap is a multidimensional scaling technique: the input of the algorithm is a distance or dissimilarity matrix with no need for a Euclidean hypothesis. It is similar to the Local MDS (LMDS) [73] whose local stress function also balances between Sammon’s and CCA stresses with a parameter  :

S LMDS i, j     S Sammon(i, j )  1     SCCA (i, j )

(6)

where  lies within [0, 1] ( = 0 strongly penalizes false neighborhoods while  = 1 strongly penalizes tears). However, the LMDS’s parameter  is a global constant that the user has to set arbitrarily and which is independent of the class comemberships making LMDS an unsupervised mapping technique. On the contrary, the ClassiMap’s parameter Aij is local and automatically tuned depending on the class labels of data pairs resulting in maps totally different from LMDS ones. Therefore ClassiMap better preserves the class structure and so is more suitable to EDALD than LMDS. Therefore, if we set Aij =  for any i and j then the Sammon’s mapping, CCA and Local MDS are all special cases of ClassiMap: if  = 1 every data belong to the same unique class and ClassiMap behaves like Sammon’s mapping, while if  = 0, each data is the single member of its own class and ClassiMap behaves like CCA. 4.2. Parameters setting and optimization Many functions can be used for F. In the following, we use the one proposed in [42] for its robustness to the concentration of measure phenomenon [1, 8, 17, 26, 52]: x (7) F x   1  f u,  , du





where

  meandij   21    std dij  ,   2 std dij  i, j

and f u,  ,   is the Normal distribution with mean μ and

i, j

i, j

standard deviation (std) . The parameter p is set to 1. The parameter γ controls the influence of the neighborhood; it decreases linearly from 0.9 to 0.1 during the map optimization. Notice that using such a function F does not assume that the distances are normally distributed as explained in [42]. There are many different possibilities for minimizing EClassiMap but we chose the following one because we are familiar with it. Remind that we do not claim to propose the best optimization technique for minimizing EClassiMap but rather to propose EClassiMap as a relevant objective function to be minimized to achieve a mapping which enables EDALD through their scatterplot graphical representation. The minimization of EClassiMap is achieved thanks to a force directed placement technique [14, 18, 53] mimetic to the ones used in [42] except for the stress function. The FDP is a technique introduced by Eades [18] for graph drawing where graphs are compared to spring systems: nodes are associated to masses and edges to springs between these masses. The springs cause forces that move masses. After relaxation of the system the final configuration of masses is assumed to draw a fair representation of the graph. However the similarity matrix that a map is an attempt to represents is equivalent to the adjacency matrix of a weighted complete graph. Based on this fact, the FDP has been used in dimensionality reduction techniques [42, 14, 53]. Masses are then associated to data and there are springs that link each mass to every other ones. The length at rest of a spring between two masses equals the distance between corresponding data in the original space. The resulting placement of the masses is a map of the data in the output space. Most of stress driven mapping techniques can be optimized through an FDP algorithm by adapting the stiffness of the springs: indeed controlling the stiffness of a spring allows controlling the risk of distortion on the corresponding distance. Here, the stiffness is set to Aij  F d ij   1  Aij  F d ij* . The force acting on the mass j due to the mass i is defined as: (i, j ) 

d

* ij

 d ij

d  d ij * ij

 S

 (i, j )  u ij

(8)

Classimap

 where uij is the unitary vector oriented from x*i to x*j on the map. Forces generate data movements on the map to minimize the

global energy of the system. At time t, a mass (i for instance) is characterized by its position in the output space (noted xi* t  ), its speed (noted vi t  ), and its acceleration (noted ai t  ). ai t  is given by the resultant of the forces applied on the mass i : a t   i, j  . At time t  t , the  i j i

11

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. speed vi t  is modified according to a damping coefficient  (set to 0.7) in such a way that vi t  t   ai t  t  t    vi t  , and induces the movement of the mass i as xi* t  t   vi t  t  t  xi* t  . The stability level of the mass-spring system may be measured by

Energyt  

1  2

 v t  

. The system is declared stable when the

i

sum of the forces on each mass vanishes (i.e. the level of stability becomes lower than a given threshold). The resulting map can reach a local minimum of the cost function [42, 14, 53]. Such an algorithm may be started from a random map, but for the sake of efficiency, an iterative process has been preferred. The algorithm first builds a global representation based on a small subset of the data, and progressively increases the number of considered data during the mapping process. Each addition of a new subset of data is followed by a mass-spring relaxation phase in order to improve the mapping. The order of selection of the data is the same as the one used in DD-HDS [42]: The first selected data is the one for which the sum of the distances to all other data is minimum. The next selected data is the one that would minimize the quantization error if it were selected. The quantization error is defined as the sum of the distances in the original space between all unselected data and their nearest neighbor among the selected data. The selection is carried on similarly until depletion of the set of unselected data. At first, The D+1 firstly selected data are positioned in the output space (where D is the dimensionality of the output space). Indeed, a perfect placement of (D+1) data in a D-dimensional output space is always possible. Next, the number of data in the subset that will be mapped onto the output is increased (doubled for instance) according to the selection order. The FDP algorithm allows reaching a stable configuration for the subset in the output space. The data selection phases and subset representation phases are alternated until depletion of the set of unmapped data. For the selection phases, we chose to systematically double the number of selected data in the subset. Indeed, in our experience, a faster increase of the number of data may lead to a higher risk of performing a poor map and a slower one would increase the computation time. At the beginning of the algorithm, selected data are far from one another, due to the data ordering procedure. A high value is then chosen for the parameter  (see equation (7)), so that large distances influence the mapping. When the number of data is increased, the value of  is decreased so as to emphasize smaller distances. The final  value reflects the user-driven trade-off between the efficiency of the local and global data configurations. In the present work,  was linearly decreased from 0.9 to 0.1 step-by-step, one step at each mapping phase.

5. EVALUATION CRITERIA In our experiments, we shall compare ClassiMap with other supervised and unsupervised techniques. We use standard unsupervised criteria to compare the way DR techniques and ClassiMap preserve the original pairwise similarities (DP2) purposely ignoring the class labels. Regarding labeled data (DP1), we also have to evaluate how much the Class overlapping preservation (DP3) and Class separation enhancement (DP4) are achieved by the different DR techniques. Despite many supervised mapping techniques have been proposed, there is no unified framework for their evaluation. Yet, in most of the papers, the evaluation consists in visual inspection of the map and in measuring the accuracy of the map as a classifier. We propose the Class OverLapping Index (COLI) to measure how much classes are clustered in the map, so how much they are un-mixed with the other classes forming pure-class regions. This index quantifies the achievement of the Class separation enhancement (DP4). Different indexes could be used but in order to stay coherent with the EDALD framework we aim at, we propose to use a descriptive measure of class purity originally defined in [70] and used for the first time here in the context of the evaluation of DR techniques of labeled data. At last we propose a new criterion named Class Over-Separation Index (COSI) which is dedicated to objectively evaluate mappings of labeled data according to the Class overlapping preservation (DP3). COSI is derived from COLI and as COLI it is to be considered in the framework of EDA of labeled data, not as a measure in a predictive modeling framework. 5.1. Evaluation of similarity preservation 5.1.1. Trustworthiness and continuity indices Trustworthiness and continuity [71, 72] are used for the evaluation of structure preservation in non-linear mappings. This pair of 12

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. indexes is based on neighborhood ranks: it compares sets of closest neighbors found in the original space and on the map. The trustworthiness index is sensitive to false neighborhoods while the continuity index is sensitive to tears. Other pairs of indexes as Precision and Recall [73, 74] or RankVisu indexes [43] could be used as well. 2 (9) ETrustworthiness, k    r i, j   k  nk 2n  3k  1 i jU k i  EContinuity,k 



2   r * i, j   k nk 2n  3k  1 i jVk i 



(10)

where n is the number of data points, k is the user-defined number of nearest neighbors, r(i,j) is the neighborhood rank (1 for the nearest neighbor, 2 for the second nearest, etc…) of data j with respect to data i in the original space; r*(i,j) is the neighborhood rank of data j with respect to data i on the map; Uk(i) and Vk(i) are the sets of data points which are among the k-nearest neighbors of i on the map but not in the original space, and in the original space but not on the map respectively. These indexes lay within [0, 1], the lower the index is, the more the map preserves neighborhood ranks, the better the map: ETrustworthiness and EContinuity are equal to 0 for maps with neither tears nor false neighborhoods respectively. However, note that a map that fairly preserves ranks can be hardly readable (see S-Isomap in Figure 6 for example). For that reason, considering the distance preservation can be also fruitful. 5.1.2. CCA and Sammon’s indexes Similarly to trustworthiness and continuity, the CCA global stress E CCA is sensitive to false neighborhoods while the Sammon’s global stress ESammon is sensitive to tears. However, they focus on distances rather than on ranks and have been used in [44, 45] to evaluate unsupervised mapping techniques. The function F is chosen to be a Heaviside step function (see section 2.2.) and p=2 as for most of the stress-based techniques including Local MDS, CCA, and Sammon’s mapping. Keep in mind that several techniques based on rank or other similarity measures cannot be expected to preserve accurately original distances (including LLE and S-NeRV). In these cases, distance preservation indices are less relevant as we noticed in our experiments. 5.2. Evaluation of classes’ structure preservation 5.2.1. The Class OverLapping Index (COLI). According to the Class separation enhancement (DP4), supervised mappings should emphasize class separation on the map so that the analyst can clearly visualize the non-overlapping classes at one glance. In order to quantify the degree of undesired classes’ overlapping on the map, we propose to define a class-purity index based on the one proposed in [70] in another context. This index, called Class OverLapping Index (COLI) in the following, counts the average number of data from the same class in the neighborhood of points in the map:

COLI k ,C , M  1 

1 n   k ,C ,M i  n i 1

(11)

i

where k ,C , M i  



1 q  kNN M (i ) Cq  Ci k



Given a set of n class labels C assigned to the n data (Ci is the class label of data point i), an integer k between 1 and n, and a metric space M,  k ,C ,M i  is the proportion of k-nearest neighbors of point i on the map M which belong to the same class as the data i. The COLI can also be computed in the original space in order to get a baseline score for comparison, in which case we will write it COLI(X,C,k). COLI lies between 0 and 1: the lower the COLI is, the lower the class overlapping, the higher the class purity and so the class separation, the higher the CCPC and the better the achievement of the Class separation enhancement (DP4). A high value for COLI is indicative of a map where classes have been fairly separated. 13

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. Increasing class separation is good but too much class separation is not, especially when original data overlap but are artificially separated by standard supervised mapping techniques. Therefore we defined the Class Over-Separation Index to measure when too much separation occurs. 5.2.2. The Class Over-Separation Index (COSI) As explained in section 3.2 and shown in the Figure 4, supervised mapping techniques tend to over-separate the classes, i.e. to separate the class even when they strongly overlap in the original space. We propose the Class Over-Separation Index (COSI) to evaluate this tendency. The idea is to measure the average degree of separation using the COLI with respect to the maps obtained after random permutations of the data labels. A strong separation while class labels have been randomly assigned means the mapping technique is prone to over-separation. For each of z trials, class labels are permuted at random and data with these shuffled class labels are projected with the supervised mapping technique to be evaluated (i.e. z supervised mappings are computed from scratch). The COSI is the average COLI purity index computed over all these maps. It is formally defined as:

COSI  X , C , M   1 



z

i 1

COLI ( M i ( X , Ri (C )), Ri (C ))

(12) z where Mi(X,C) is the data position X* on the map outcome of the ith supervised mapping M of the data X with class labels C, Ri(C) is a random permutation of the original class labels C, and z is chosen equal to 15 in our experiments. The COSI can also be computed in the original space where the mapping M is the identity, to give a baseline for comparison, in which case we will write it COSI(X,C,Id). The COSI lies between 0 and 1, the lower the COSI is, the less the mapping technique is prone to over-separation of the classes, the better the achievement of the Class overlapping preservation (DP3) and the better the map. A similar technique has been applied in [86] in the context of data exploratory analysis and feature selection. Note that the same randomization is applied in every map for the sake of fair comparison in all the experiments.

5.3. Standardized criteria In order to get an overall measure of quality across different datasets, we standardized the structure preservation indices defined above with respect to the Isomap [67] non-linear mapping technique. The standardized distance preservation index is given by: m  m  ECCA  ESammon (13) DPm   Isomap  Isomap ECCA  ESammon and the standardized rank preservation index is given by: m  m  ETrustworth iness  EContinuity (14) RP m   Isomap Isomap ETrustworthiness  EContinuity Classes’ structure indices computed on the map M, are also standardized, but with respect to the values computed in the original data space: sCOLI 

COLI ( X * , C , k ) COSI ( X , C , M ) (15) sCOSI  COLI ( X , C , k ) COSI ( X , C , Id )

(16)

5.4. Evaluation criteria in brief The pair of trustworthiness and continuity indices (eq. 9 and 10) quantifies the neighborhood rank preservation and the pair of CCA and Sammon’s indices (eq. 2, 3 and 4) quantifies the distances preservation. The Class Overlapping Index (eq. 11) quantifies the degree of class overlapping on the map, and the Class Over-Separation Index (eq. 12) the tendency to separate classes which should not be so. Standardized indices (eq. 13, 14, 15 and 16) are used to get comparable values across datasets (see Figure 9). Lower indices give better maps. In all these cases, the lower the value of the index is, the better the map.

14

This version is not the camera ready. Please refer to the final published version for accuracy and completeness.

6. EXPERIMENTS In the following, we wish to compare the performance of several unsupervised and supervised mapping techniques versus ClassiMap on different datasets with respect to the evaluation indices defined above. 6.1. Datasets We use 9 datasets in our experiments. A summary of their main characteristics is given in the Table 4. The first one is the 2D-square dataset already presented in section 2.3. and in the Figure 4. This 2 dimensional dataset with overlapping classes is aimed at showing the over-separation tendency of state-of-the-art supervised mapping techniques. The second is the Oil Flow synthetic dataset [66] obtained from a physical simulation of oil flow in a pipeline. Three classes correspond to three different mixtures of oil, water and gas in pipes: ―stratified‖, ―annular‖ and ―homogeneous‖. Other ones are extracted from the UCI machine Learning Repository [2] (http://archive.ics.uci.edu/ml/datasets/). The optical recognition of handwritten digits dataset is a set of black and white pictures (8 × 8 pixels) of handwritten digits (numbers from ―0‖ to ―9‖). The E. coli dataset contains protein localization sites strains of the bacteria Escherichia coli. The glass identification dataset corresponds to types of glass defined in terms of their oxide content. The iris dataset contains measures of sepals and petals of 3 types of iris flowers [20]. The isolet dataset contains features including spectral coefficients, contour features, sonorant features, pre-sonorant features, and post-sonorant features measured on sound when subjects spoke the name of each letter of the alphabet. The libras movement dataset contains hand movement types in LIBRAS, the official Brazilian sign language. The wine dataset corresponds to chemical measures from wines produced by three different raisers. We normalized glass identification, wine and isolet datasets. Due to the very high number of data in case of oil flow, digits and isolet datasets, we chose to use a random subsample of 500, 200 and 120 data respectively preserving the original proportions of the classes. 6.2. Mapping techniques For each dataset, 12 mappings techniques have been tested. Their main characteristics are summarized in the Table 5. The codes used in the present work come from different sources: LDA and PCA: Matlab statistical toolbox Isomap: available at http://isomap.stanford.edu/ (J.B. Tenenbaum, V. de Silva, and J.C. Langford); DD-HDS: http://sy.lespi.free.fr/DD-HDS-homepage.html (S. Lespinats) ; CCA, Local MDS and Sammon: our own Matlab code is used, but a C-code from Statistical Machine Learning and bioinformatics group, University of Helsinki is available at http://research.ics.tkk.fi/mi/software.shtml; LLE and S-LLE: kindly provided by D. de Ridder; S-Isomap: http://lamda.nju.edu.cn/code_S-Isomap.ashx (X. Geng); S-NeRV : kindly provided by Bjørn Sand Jensen. In the following, the various parameters have been set as advised by the authors (see Table 5 for values), or according to what seems the best choice to us if no recommendation has been found. The computational time given in the Table 5 is measured with the digit dataset. Please notice that this time is strongly related to the implementation and the datasets so it is only indicative of the order of magnitude of the computation time one can expect for each mapping technique. CCA, Local MDS0.5 and Sammon’s mapping techniques share the same weighting function F described in section 2.2. The value of the parameter σ of this function F and the one used in distances preservation indices ECCA and ESammon, only depends on the dataset and is given in the Table 4. The parameter k used in the rank preservation indices ETrustworthiness and EContinuity and in COLI and COSI, is set equal to 5. Different runs of stochastic optimization techniques with exactly the same parameters may lead to different maps due to local minima. Therefore, each mapping technique relying on such a stochastic optimization process is run 15 times in order to display 15

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. the distribution of the generated maps with respect to the different indices. In order to distinguish visually between supervised and unsupervised techniques, the names of supervised techniques appear in a frame in every Figure. 6.3. Results with the 2D-square dataset The main result we can observe in the Figure 4 is that ClassiMap provides an optimal identity map just as other distancepreserving unsupervised mapping techniques. In particular, as opposed to the other supervised mapping techniques, it has no tendency to over-separate the classes. Beyond this good property, the fact that ClassiMap is able to perfectly map data lying in a plane is also a qualitative benefit against other supervised or rank-based mapping techniques. 6.4. Results with the Digit dataset The digit dataset shows two interesting properties: (1) classes have very few overlap in the high-dimensional spaces (COLI = 0.1 in original space, this point is confirmed by a more detailed ground-truth analysis described in Figure 5); however, (2) it is difficult to unfold into a 2-dimensional space (see Figure 6). In such a case we expect supervised DR techniques to lead to a better quality of the mapping in terms of the class structure preservation. Figure 6 shows the maps obtained with different mapping techniques applied on the Digit dataset. Figure 7 shows that LLE, SLLE, PCA, LDA, S-NeRV and Sammon’s mappings badly preserve structures. Even if S-Isomap appears to be good at rank preservation, we can observe in the Figure 6 that the map is hardly readable because most of the data are grouped together in the center of the map (as shown in the Figure 8 that displays a distance-based evaluation). Here, the poor efficiency of S-isomap to preserve local structure in terms of the distances is corroborated. Conversely, ClassiMap better preserves the distances. Moreover, S-Isomap shows in the Figure 9 a very high tendency to classes’ over-separation. The Figure 9 shows that ClassiMap is the only supervised mapping technique which performs well on both Class separation enhancement and Class overlapping preservation DPs. 6.5. Other datasets The 12 mapping techniques have been evaluated over seven other datasets. Several global tendencies can be observed in the Figure 10. The stress-based mapping techniques generally show high performance in terms of distance preservation. Isomap is often competitive too. Comparing techniques with respect to rank preservation is less straightforward, because somewhat similar values are often reached. ClassiMap outperforms other supervised mapping techniques in terms of distance preservation and is also among the best techniques in terms of rank preservation. We observe that ClassiMap is able to reach somewhat similar performance to the ones obtained with DD-HDS and LMDS (which are unsupervised) regarding the distance and ranks preservation although some distortions are less penalized; maybe because accounting for the class can help unfolding the global underlying data structure. S-LLE and S-Isomap show few overlapping (low sCOLI), but often at the cost of a high over-separation (high sCOSI). LDA and S-NeRV are robust to over-separation (low sCOSI), but they do not separate classes when they could do so (high sCOLI). In comparison to supervised non-linear mapping techniques, unsupervised techniques are not sensitive to over-separation as expected, but classes are clearly less separated. ClassiMap appears to be among the most efficient technique to preserve classes (low sCOLI), just after S-LLE and S-Isomap. However, ClassiMap is less prone to over-separate classes than S-LLE and SIsomap (clearly lower sCOSI); although ClassiMap achieves more over-separation than unsupervised techniques which, by definition, cannot over-separate. One could be especially interested in the ability of supervised mapping techniques to highlight classes. These ones should focus on the COLI and COSI which rely on a class purity index to evaluate a map according to the achievement of the Class separation enhancement and the Class overlapping preservation DPs respectively. In case of a mapping technique which enhances class separation, the purity index in the map should be lower than in the original space (the lower is the better). The sCOLI and sCOSI indexes presented in the Figure 10 (bars in the third and fourth columns) have been standardized in order to be comparable across datasets (section 5.3). S-LLE and S-Isomap are almost always the best in terms of class purity (very often in 1st or 2nd position wrt COLI), but the most prone to over-separation (very often in 4th or 5th position wrt COSI). LDA and 16

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. S-NeRV are the most robust to over-separation (very often in 1st or 2nd position wrt COSI) but have the worst class purity (very often in 4th or 5th position wrt COLI). ClassiMap is the unique supervised mapping technique to perform equally good with respect to both COLI and COSI being in the 3rd position in most of the cases. 6.6. Sensibility analysis of all quality indexes with respect to the value of the parameter k The parameter k appears in the Etrustworthiness & Econtinuity, COLI and COSI index’s definitions (see section 5.1.1, 5.2.1 and 5.2.2 respectively). It has been set to 5 to be in agreement with a similar choice made for some mapping techniques (Isomap [67] among other ones). We have however to check that k = 5 is also a good setting for these various indexes. We use the Digit data to analyse the influence of k on all the quality indexes. In order to do so, we compare the outcome for k set to 2, 5 and 10 in the Figures 11 and 12. We observe that the relative and absolute positions of the points representing the quality of the maps, is qualitatively the same for the different values of k.

7. DISCUSSION 7.1. User guidelines As our experiments showed, ClassiMap is currently the best available DR technique suitable to EDALD. Therefore we advise to use ClassiMap when the objective is to explore labeled data by mapping them in the plane while preserving both their similarities and the class structure. However, no DR technique can preserve in general all the original similarities and class structure due to the reduction of the dimension. The COSI and COLI indices are designed to evaluate ClassiMap and other DR techniques regarding the preservation of the class structure as a whole, they tell us about the global quality of a map, but they cannot tell us about which patterns in the map itself are trustworthy or due to mapping artifacts. Therefore we shall advise the user to always use complementary techniques like ProxiViz [3], ProxiLens [29], CheckViz [45] or others [50] to analyze the mapping itself before drawing any conclusion regarding the class structure and the CCCA based on their own visual class-cluster correlation perception. We shall also advise the reader to cross-check the visual cues with automatic machine learning approaches as it has been proposed for instance in [11][5][23]. 7.2. Extension of the co-membership matrix ClassiMap can handle fuzzy classifications. Indeed, the co-membership matrix A can be filled with 0 and 1 for crisp classification, but also with values in between for soft assignation of the labels. For instance, ClassiMap can take into account uncertainties in classes’ labelling. The probability of a class given a data can be used to estimate the probability that two data belong to the same class, i.e. their co-membership probability. In that case, the comembership matrix A could be filled with these latter probabilities. In another setting, data points can belong to several classes simultaneously. Let us suppose that the data point i belongs to two classes at the same time. The co-membership between i and data points that belong to both these classes can be set equal to 1. ClassiMap can also handle hierarchical classification, by setting a value in A for pairs of points belonging to the same branch of the classification tree: the longer the branch, the lower the value, and the greater the separation on the map. ClassiMap can be extended to the semi-supervised case as well where a dataset is partially labeled, using an ad hoc setting of the co-membership matrix for unlabeled data points. In that case, the co-membership probability of an unlabeled data with a labeled one can be set to the inverse of the number of classes or can take into account the number of data points in each class. 7.3. A posteriori positioning Projecting a new data a posteriori on the map is considered by Zhao and Zhang as one of the biggest issues for existing mapping techniques [85]. In a predictive modeling framework, a posteriori positioning is used to classify a new data point based on the location of the points already mapped. In the EDALD framework, the objective is not to classify new data points but being able to a posteriori positioning new points would: - ease the mapping of streaming data and so the tracking of changes in the class structure; - avoid the computational cost of mapping all the data again and again while only little information changed; - insure the visual stability of the map as a mental reference for the user not expected to change for each new added data point. ClassiMap as other stress-based techniques could handle the a posteriori positioning of new data [16, 42]. Actually, the local 17

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. stresses between a new labelled data and data already mapped can be easily computed, and the sum of these local stresses easily used to drive the new data towards its natural position on the map. There is yet another possibility: not to project the new point on the map. One of the authors has used this idea in a former application paper [4] in the context of information retrieval. It consists in showing the original similarity of the new data (the query) with respect to the labelled data already mapped (the references) by coloring the Voronoi cells of the latter on the map based on shades of grey proportional to this similarity. In that way, if the original neighbors of the new data are mapped in different locations due to a tear, the actual similarity to these separated data can be displayed in its full complexity. This approach is based on the proximity visualization technique ProxiViz proposed in [3].

8. CONCLUSION In the present paper, we emphasized and addressed substantial weaknesses of existing supervised DR techniques in the context of Exploratory Data Analysis of Labeled Data (EDALD). We proposed a new framework to understand which kind of information can be inferred about the multidimensional labeled data based on the visual analysis of their graphical depiction as a scatterplot. For this sake, we reframed the concepts of Class-Cluster Correlation Assumption (CCCA) and Class-Cluster Perceptual Correlation (CCPC) defined in a previous work, to the present context of EDALD through their low-dimensional representation. We specifically took into account the mapping distortions and the types of inferences that one can derive from labeled data regarding the correctness of the labeling, the relevance of the space or the one of the similarity metric. A thorough study of the state-of-the-art in connection to this new framework enabled us to define the goals of EDALD and the specific mapping distortions, namely the between-class false neighbors and the within-class tears, to be penalized in order to fulfill these goals. We derived an original cost function which effectively penalizes these specific distortions and defined ClassiMap, a new DR technique for labeled data which minimizes this cost function. Finally, we proposed the Class Over-Separation Index (COSI), a new index to quantify how much maps unduly separate originally overlapping classes, in addition to the Class OverLapping Index (COLI), a new index in this context, to quantify how much the maps avoid unfounded class overlapping. Both these indices together with other 4 standard unsupervised indices made possible a fair and objective comparison of ClassiMap with other supervised and unsupervised mapping techniques in the perspective of EDALD. In our experiments, ClassiMap provides maps where the level of distortions often remains close to the ones of the best unsupervised DR techniques. However, the separation of the classes is better preserved with ClassiMap than with unsupervised DR techniques which ignore the class labels. Conversely, the state-of-the-art supervised DR techniques better separate the classes than ClassiMap but at the price of artificially un-mixing actually overlapping classes that only ClassiMap was able to preserve. Overall, ClassiMap appears as the DR technique which is the best suited to EDALD. Beyond the possible extension of ClassiMap to the a posteriori positioning of new data points or to the case of fuzzy labeling or partially labeled datasets, we think that efforts should be put in the evaluation of ClassiMap and other DR techniques in general, with real users. This is a far from trivial work that would deserve a dedicated study on its own to define reasonable user tasks and the way to evaluate their achievement in terms of effectiveness and accuracy. A recent work involving one of the authors [64] suggests focusing research efforts on finding quality measures which would mimic at best the human visual perception of specific patterns. Designing and assessing these measures would require labeled ―meta-data‖ sets where each meta-data is a scatterplot resulting from a DR technique, and its label encodes the perception level of some specific pattern by a human user. Given these labeled meta-data, we could find the measure best modeling the human judgment for a given pattern, and then we could legitimately use this measure to evaluate how new DR techniques render these patterns without further relying on user studies evaluations always difficult to set up and still rarely conducted in practice.

Acknowledgment The authors would like to thank Bjørn Sand Jensen (Technical University of Denmark) for kindly providing codes (and support) to calculate the supervised distance used in S-NeRV and D. de Ridder (Delft University of Technology) for his codes for Supervised LLE.

18

This version is not the camera ready. Please refer to the final published version for accuracy and completeness.

References [1] Aggarwal, C. C., Hinneburg, A., Keim, D. A., 2001. On the Surprising Behavior of Distance Metrics. In: High Dimensional Space Lecture Notes in Computer Science: 8th International Conference on Database Theory (ICDT 2001), London (United Kingdom), Springer Berlin Heidelberg, 420-434. [2] Asuncion A., Newman D.J., UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science, 2007. http://www.ics.uci.edu/~mlearn/MLRepository.html [3] Aupetit, M. 2007. Visualizing distortions and recovering topology in continuous projection techniques. Neurocomputing 70(7-9):1304-1330. [4] Aupetit, M., Allano, L., Espagnon, I., Sannie, G., Visual Analytics to Check Marine Containers in the Eritr@c Project. Proc. of the International Symposium on Visual Analytics Science and Technology (EuroVAST), J. Kohlhammer and D. Keim (Editors), pp. 57-60, Bordeaux, June 2010 [5] Aupetit, M., 2014. Sanity check for class-coloring-based evaluation of dimension reduction techniques. In Proceedings of the Fifth Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (BELIV '14), Heidi Lam, Petra Isenberg, Tobias Isenberg, and Michael Sedlmair (Eds.). ACM, New York, NY, USA, 134-141. DOI=10.1145/2669557.2669578 http://doi.acm.org/10.1145/2669557.2669578 [6] Bai, X.M., Yin, B.C., Shi, Q., Sun, Y.F., 2005. Face recognition based on supervised locally linear embedding method. J. Inform. Comput. Sci. 4, 641–646. [7] Balasubramanian, V.N., Krishna, S., Panchanathan, S., 2008. Person-independent head pose estimation using biased manifold embedding, EURASIP Journal on Advances in Signal Processing. [8] Bellmann, R., 1961. Adaptive Control Processes: A Guided Tour. Princeton Univ. Press. New Jersey. [9] Belkin, M., Niyogi, P., 2003. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation 15(6): 1373-1396. [10] Bishop, C.M.,2006. Pattern recognition and machine learning. Springer, New York. [11] Bruneau, P., Pinheiro, P., Broeksema, B., Otjacques, B. 2015. Cluster Sculptor, an interactive visual clustering system. Neurocomputing 150: 627-644. [12] Buja, A., Swayne, D.F., Littman, M.L., Dean, N., Hofmann, H., Chen, L., 2008 Data Visualization With Multidimensional Scali ng. Journal of Computational and Graphical Statistics 17(2). [13] Carlsson, G., 2009. Topology and data. Bulletin (new series) of the American Mathematical Society 46 (2): 255–308. doi:10.1090/s0273-0979-09-01249-x. [14] Chalmers, M., 1996. A linear iteration time layout algorithm for visualizing high-dimensional data. In: Yagel, R., Nielson G. M. (Eds.), Proc. 7thConf. Visualization, SanFrancisco/LosAlamitos (California). 127–132. [15] Chapelle, O., Schölkopf, B., Zien, A., 2006. Semi-supervised learning. Cambridge, Mass.: MIT Press. ISBN 978-0-262-03358-9 [16] Demartines, P., Hérault J., 1997. Curvilinear component analysis: A selforganizing neural network for nonlinear mapping of data sets, IEEE Trans. Neural Netw., vol. 8, no. 1, pp. 148–154. [17] Donoho, D. L., 2000. High-dimensional data analysis: The curses and blessings of dimensionality. In: Amer. Math. Soc. Lecture: ―Math challenges of the 21st century‖, Los Angeles (California), Available: http://www-stat.stanford.edu/~donoho/ . [18] Eades, P., 1984 A heuristic for graph drawing. in Proc. 13th Manitoba Conf. Numer. Math. Comput. (Congressus Numerantium), D. S. Meek and G. H. J. V. Rees, Eds., Winnipeg, MB, Canada, vol. 42, pp. 149–160. [19] Fan, Z., Xu, Y., Zhang, D., 2011. Local linear discriminant analysis framework using sample neighbors. IEEE Transactions on Neural Networks, vol. 22, no. 7, pp. 1119–1132. [20] Fisher R.A., 1936. The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics, n° 7, p. 179-188. [21] France S.L., Caroll J.D. 2010. Two-Way Multidimensional Scaling: A Review, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. PP (99), pp.1-18. [22] Fukunaga. K., Introduction to Statistical Pattern Recognition, second ed. Academic Press, 1990. [23] Gaillard, P., Aupetit, M., Govaert, G., 2008.Learning topology of a labeled data set with the supervised generative gaussian graph. Neurocomputing 71 (7), 1283-1299 [24] Gao, X., Wang, X., Tao, D., Li, X., 2011. Supervised Gaussian Process Latent Variable Model for Dimensionality Reduction. IEEE Transactions on Systems, Man, and Cybernetics, Part B 41(2): 425-434. [25] Geng X., Zhan, D.C., Zhou, Z.H., 2005. Supervised nonlinear dimensionality reduction for visualization and classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B 35(6): 1098-1107. [26] Gromov, M., 1999. Metric Structures for Riemannian and Non-Riemannian Spaces, Progress in Mathematics 152, Birkhauser Verlag. [27] Gu, Y.J., Hsieh, P.F., Yang M.H., and Wu, C.H., 2006. Multi-view Hand Shape Recognition Using Statistical LLE Proceedings of the International Computer Symposium, vol. 3, pp. 980-984. [28] Haroz, S., Whitney, D. 2012. How Capacity Limits of Attention Influence Information Visualization effectiveness. IEEE Transactions on Visualization and Computer Graphics 18(12): 2402-2410. [29] Heulot, N., Aupetit, M., Fekete, J.D., ProxiLens: Interactive Exploration of High-Dimensional Data using Projections. EuroVis 2013 Workshop on Visual analytics using Multidimensional Projections, Leipzig, Germany, June 2013 [30] Hou, C., Zhang, C., Wu, Y., Jiao, Y., 2009. Stable local dimensionality reduction approaches. Pattern Recognition 42(9): 2054-2066. [31] Jensen, B. S. Exploratory Datamining in Music. Master thesis, Technical University of Denmark. 2006. [32] Jolliffe I., Principal Component Analysis, Springer-Verlag, New York, 2002. [33] Kaski, S., Sinkkonen, J., Peltonen, J., 2001. Bankruptcy analysis with self-organizing maps in learning metrics. IEEE Transactions on Neural Networks, 12:936–947, 2001. [34] Kaski, S., Sinkkonen, J., 2004. Principle of learning metrics for exploratory data analysis. Journal of VLSI Signal Processing, special issue on Machine Learning for Signal Processing, 37:177–188. [35] Kouropteva, O., Okun, O., Pietikäinen, M., 2003. Supervised locally linear embedding algorithm for pattern recognition. In: Proc. Pattern Recognition and Image Analysis, IbPRIA 2003 Proc.. LNCS, vol. 2652. Springer, pp. 386–394. [36] Kruskal J.B., Non-metric multidimensional scaling: A numerical method, Psychometrika, vol. 29, pp. 115–129, 1964. [37] Kruskal J.B., 1964. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Psychometrika, vol. 29, pp. 1–27. [38] Laanaya, H., Martin, A., Aboutajine, D., Khenchaf, A., "A New Dimensionality Reduction Method for Seabed Characterization: Supervised Curvilinear Component Analysis", IEEE OCEANS`05 EUROPE, Brest, France, 20-23 June 2005. [39] Laanaya, H., Martin, A., Khenchaf, A., Aboutajine, D., "Une nouvelle méthode pour l`extraction de paramètres : l`analyse en composante curvilinéaire supervisée, Atelier Fouille de données complexes dans un processus d`extraction de connaissance", Extraction et Gestion des Connaissances (EGC), pp 21-32, Namur, Belgique, 24-26 January 2007. [40] Lee J.A. and Verleysen, M.. Nonlinear dimensionality reduction. Springer, New York, NY, USA, 2007. [41] Le Roux, N., Bengio, Y., Lamblin, P., Joliveau M., and Kégl, B. 2007. Learning the 2-D topology of images. In Advances in Neural Information Processing Systems, Volume 20, pages 841–848. The MIT Press. [42] Lespinats S., Verleysen M., Giron A., Fertil B., 2007. DD-HDS: a tool for visualization and exploration of highdimensional data, IEEE Trans. Neural Netw., vol. 18, no. 5, pp. 1265-1279.

19

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. [43] Lespinats S., Fertil B., Villemain, P., Herault, J., 2009. Rankvisu: Mapping from the neighbourhood network, Neurocomputing, vol. 72 (13-15), pp. 29642978. [44] Lespinats, S., Aupetit, M., "Mapping without visualizing local default is nonsense." Proceedings of the 18th European Symposium on Artificial Neural Networks (ESANN 2010), Bruges, Belgium, 2010, pp. 111-116. [45] Lespinats S. and Aupetit M. ―CheckViz: Sanity Check and Topological Clues for Linear and Non-Linear Mappings‖, Computer Graphics Forum, 30:11321, 2011. [46] Li, J.X., 2004. Visualization of high-dimensional data with relational perspective map. Inf. Visualization 3, 49-59. [47] Li C.G. and Guo J., ―Supervised isomap with explicit mapping‖, in Proceedings of the 1st IEEE International Conference on Innovative Computing, Information and Control, ICICIC ’06, Beijing, China, August 2006. [48] van der Maaten, L.J.P. Postma, E.O., van den Herik, H.J., 2009. Dimensionality reduction: A comparative review. Tilburg University Technical Report, TiCC-TR 2009-005, 2009. [49] Martinetz, T., Schulten, K. 1994. Topology representing networks. Neural Networks 7 (3), 507-522 [50] Martins, R.M., Coimbra, D. B., Minghim, R., Telea, A.C. 2014. Visual analysis of dimensionality reduction quality for parameterized projections, Computers & Graphics, 41 (Jun. 2014), 26-42, DOI: 10.1016/j.cag.2014.01.006. [51] Mika S., Rätsch G., Weston J., Schölkopf B., Müller K-R., 1999. Fisher Discriminant Analysis with Kernels, Neural Networks for Signal Processing, vol. 9, p. 41-48. [52] Milman, V. D., 1988. The heritage of P. Lévy in geometric functional analysis, Astérisque 157-158, 273-301. [53] Morrison, A., Ross, G., Chalmers, M., 2003. Fast Multidimensional Scaling through Sampling, Springs and Interpolation. Inf. Visualization 2, 68–77. [54] Pearson K., 1901. On lines and planes of closest fit to systems of points in space, Philosophical Magazine n°2, pp. 559-572. [55] Peltonen, J., Klami, A., Kaski, S., Improved learning of Riemannian metrics for exploratory analysis. Neural Networks, 17:1087–1100, 2004. [56] Pillati, M., Viroli, C., 2005. Supervised locally linear embedding for classification: An application to gene expression data analysis. In: Sergio Zani, Andrea Cerioli (Eds.), Book of Short Papers, CLADAG2005, Parma, 6-8 Giugno, MUP, pp. 147–150. [57] Ribeiro, B., Vieira A., Carvalho das Neves, J., 2008. Supervised Isomap with Dissimilarity Measures in Embedding Learning Lecture Notes in Computer Science Volume 5197/2008, pp 389-396. [58] de Ridder, D., Kouropteva, O., Okun., O., 2003. Supervised locally linear embedding - Lecture Notes in Artificial Intelligence, 2714:333–341. [59] de Ridder, D., Loog, M., Reinders, M.J.T., "local Fisher embedding", in Proceedings of the 17th International Conference on Pattern Recognition, 2004, pp. 295–298. [60] Roweis, S.T., Saul, L.K., 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290 (5500), 2323–2326. [61] Sammon, J.W., 1969. A nonlinear mapping for data structure analysis, IEEE Trans. Comput., vol. C-18, no. 5, pp. 401–409. [62] Schreck, T., von Landesberger, T., Bremm, S., ―Techniques for Precision-Based Visual Analysis of Projected Data.‖ Information Visualization, vol. 9(3), pp. 181-193, 2010. [63] Sedlmair, M., Tatu, A., Munzner, T. and Tori, M. 2012. A Taxonomy of Visual Cluster Separation Factors. Computer Graphics Forum (PRoc. EuroVis), 31(3), 1335-1344. [64] Sedlmair, M., Aupetit, M. 2015. Data-driven Evaluation of Visual Quality Measures. Eurographics Conference on Visualization (EuroVis) 2015. H. Carr, K.-L. Ma, and G. Santucci. Computer Graphics Forum 34(3) [65] Sips, M., Neubert, B., Lewis, J.P., Hanrahan, P., 2009. Selecting good views of high-dimensional data using class consistency. In Proceedings of the 11th Eurographics / IEEE - VGTC conference on Visualization (EuroVis'09), Hans-Christian Hege, Ingrid Hotz, and Tamara Munzner (Eds.). Eurographics Association, Aire-la-Ville, Switzerland, Switzerland, 831-838 [66] Svensén M. Generative Topographic Map homepage. http://www.ncrg.aston.ac.uk/GTM/index.html [67] Tenenbaum, J.B., Silva, V.de., Langford, J.C., 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), 2319– 2323. [68] Torgerson, W.S., 1952. Multidimensional scaling: 1. Theory and method, Psychometrika, vol. 17, pp. 401–419. [69] Tukey, J.W., 1977. Exploratory Data Analysis. Addison-Wesley. ISBN 0-201-07616-0. [70] Varini, C., Degenhard, A., Nattkemper, T.W., 2006. Visual exploratory analysis of DCE-MRI data in breast cancer by dimensional data reduction: A comparative study. Biomedical Signal Processing and Control 1(1): 56-63. [71] Venna, J., Kaski, S., Neighborhood preservation in nonlinear projection methods: An experimental study, In G. Dorffner, H. Bischof, and K. Hornik, editors, Proceedings of ICANN 2001, International Conference on Artificial Neural Networks, pp. 485-491, Berlin, 2001. Springer. [72] Venna J., Kaski S., 2006. Local multidimensional scaling, Neural Networks, vol. 19, no. 6-7, pp. 889-899. [73] Venna, J., Kaski, S., Nonlinear dimensionality reduction as information retrieval, in: M. Meila, X. Shen (Eds.), Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS 2007), San Juan, Puerto Rico, 2007, pp. 568–575. [74] Venna, J., Peltonen, J., Nybo, K., Aidos, H., Kaski S., 2010. Information retrieval perspective to nonlinear dimensionality reduction for data visualization The Journal of Machine Learning Research 11, 451−490. [75] Wang, M., Yang, J., Xu, Z.J., Chou, K.C., 2005. SLLE for predicting membrane protein types. J. Theor. Biology 232, 7–15. [76] Ware, C. 2004. Information visualization: perception for design. 189-190. 2nd Edition, Morgan Kaufmann, Elsevier. [77] Weng, S., Zhang C., Lin, Z., 2005. Exploring the structure of supervised data by discriminant isometric mapping, Pattern Recognition 38, 599–601. [78] Wismüller, A., Verleysen, M., Aupetit, M. and Lee, J. A. "Recent Advances in Nonlinear Dimensionality Reduction, Manifold and Topological Learning." Proceedings of the 18th European Symposium on Artificial Neural Networks (ESANN 2010), Bruges, Belgium, 2010, pp. 71-80. [79] Witten, D.M., Tibshirani, R., 2011. Supervised multidimensional scaling for visualization, classification, and bipartite ranking, Computational Statistics & Data Analysis, Volume 55, Issue 1, Pages 789–801. [80] Yang, X., Fu, H., Zha H., Barlow, J., ―Semi-supervised nonlinear dimensionality reduction‖, in: Proceedings of the International Conference on Machine Learning, 2006, pp. 1065–1072. [81] Zaki, M.J., Meira W.Jr. 2014. Data mining and analysis, Fundamental Concepts and Algorithms Cambridge University Press, May 2014. ISBN: 9780521766333 [82] Zha, H., Zhang, Z., ―Spectral analysis of alignment in manifold learning.‖ IEEE International Conference on Acoustics, Speech, and Signal Processing. 2005. [83] Zhang, S.Q., 2009. Enhanced supervised locally linear embedding, Pattern Recognition Letters 30, pp. 1208–1218. [84] Zhang, Z., Zha, H., 2004. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM Journal on Scientific Computing, 26(1), 313–338. [85] Zhao L., Zhang, Z., 2009. Supervised locally linear embedding with probability-based distance for classification, Computers & Mathematics with Applications 57 (6), pp. 919–926. [86] Zighed, D.A., Lallich, S., Muhlenbac, F., 2005. A statistical approach of class separability. Applied Stochastic Models in Business and Industry, Vol. 21(2), pp. 187-197.

20

This version is not the camera ready. Please refer to the final published version for accuracy and completeness.

Table 1 –The Exploratory Data Analysis of Labeled Data (EDALD) framework: summary of the main structural information which can be inferred about the labels or the data space or similarities from the within-class and between-class patterns in the actual multidimensional space.

MD Patterns CCCA

Assumption

Inference

Space/similarity is correct. Labels are correct

Between-class pattern Several classes mixed in a single cluster

Between-class and withinclass pattern Each class in a single cluster

Within-class pattern One class shattered in several clusters

False between-CCCA Low between-CCPC Erroneous labels

Valid CCCA High CCPC Correct labels

False within-CCCA Low within-CCPC Subclass/Missing observations

Missing discriminant variables

Correct space/similarity

Irrelevant variable

Original data

Map A

Map B False neighborhood

Tear

Fig. 1. Distortions relative to the star-shaped data point. The original dataset (left) is a regular grid of points in the plane (usually a high-dimensional space). Triangles within the circle belong to the original star’s neighborhood. Map A (center) shows a tear where some of the original neighbors are no more so on the map (green triangles). Map B (right) shows a false neighborhood where some neighbors on the map (magenta circles) are not among the original neighbors.

21

This version is not the camera ready. Please refer to the final published version for accuracy and completeness.

I

MD 2D

MD inference based on 2D visualization Assuming good similarity: correct / false / overlooked Assuming good labels: correct / false / overlooked

No

Subclass/Missing observations – Irrelevant variable

II

MD  2D

No

III

MD  2D

No

IV

MD  2D

BT

V

2D  MD

BFN

VI

MD  2D

No

VII

MD  2D

WT+BT

VIII

2D  MD

WFN+BFN

IX

MD  2D

WT

Subclass or missing observation – Irrelevant variable Correct labels Correct space/similarity

X

2D  MD

WFN

Correct labels – Correct space/similarity Subclass/Missing observations – Irrelevant variable

XI

MD  2D

WT+BFN

Erroneous labels – Missing discriminant variables Correct labels – Correct space/similarity

WFN+BT

Correct labels – Correct space/similarity Erroneous labels – Missing discriminant variables

ClassiMap

Pattern

Main distortions

Sammon

Pattern

Mapping direction

CCA

Case identifier

Table 2. Exploratory Data Analysis of Labeled Data (EDALD) framework considering mapping distortions: we show all the possible mapping evaluation cases regarding the matching between grouping perception (CCPC) based on classes (blue or red colors) and clusters (proximity) in the map (2D), and class-cluster assumptions (CCCA) in the data space (MD). The mapping can be correct with no distortions (No), or distorted with false neighbors (FN) or tears (T) related to within-class (W) or between-class (B) similarities (MD) or proximities (2D). Depending on the confidence we have in the data space or the labeling to be correct models of the idealized space where CCCA is valid, and depending on what we observe on the map based on the CCPC, we can infer different characteristics regarding the MD data. These characteristics already presented in Table 1 are indicated in blue or green, and their validity is displayed as correct, overlooked or wrong. For instance, in the case J (yellow cells) of within-class false neighbors (WFN) mapping distortions, the user might see a high-within CCPC of the blue class appearing as a single cluster while the blue class is actually shattered in several clusters in the multidimensional space. Then regarding the blue class, the user will wrongly infer a valid CCCA while she will overlook either the possible presence of irrelevant variables splitting the blue class, or the lack of blue-class samples, or the possibility of dividing the blue class into several subclasses. The right column indicate how much CCA, Sammon and ClassiMap DR techniques penalize the different cases (the darker the red, the higher the penalization, the lower the likeliness of the case to appear in the map) (see section 4.1).

Correct labels – Correct space/similarity

Erroneous labels – Missing discriminant variables

or

or XII

2D  MD

22

This version is not the camera ready. Please refer to the final published version for accuracy and completeness.

Fig. 2. Types of mapping distortions with labeled data. The left part shows data configuration in the original space, the right part shows a map of these data. The red and blue colors indicate the two class labels. The focus is on the blue star. Whatever the class labels, cases WFN an BFN are false neighborhoods within and between classes respectively, and cases WT and BT are tears within and between classes respectively. All these distortions should be penalized to enable a trustworthy visual EDALD. However, as some distortions must occur due to the dimension reduction, we shall assign priorities to them. ClassiMap puts stronger penalization on cases BFN and WT.

23

This version is not the camera ready. Please refer to the final published version for accuracy and completeness.

Fig. 3. Maps of the “cube” dataset. The first row of the figure shows how are built the 3 dimensional data; The color codes for the class labels. The 12 maps reached with supervised (framed name) and unsupervised (not-framed name) DR techniques are shown in the next 3 rows. ClassiMap preserves better than others both the within-class (one blue and one red component) and between-class structures (partial overlap of red and blue classes).

24

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. CCA

Original data

DD-HDS

Isomap

LLE

1

ClassiMap

S-Isomap

S-NeRV

S-LLE 0.5

S-LLE 1

Fig. 4. Examples of overseparation by supervised DR techniques: Original 2-dimensional data (top left) with randomly assigned class labels (red triangles and blue circles) are mapped on a plane with unsupervised (top row) and supervised (bottom row, name in a frame) DR techniques. CCA and DD-HDS which preserve distances give exactly the same data distribution as the original one (just as Sammon’s mapping or Local MDS), or a similar distribution for Isomap and LLE. Conversely, S-Isomap, S-NeRV and S-LLE, supervised mapping techniques which first modify the original similarities based on the class labels, tend to unduly separate the classes. In this case, the proposed ClassiMap DR technique gives the same correct map as distance-preserving unsupervised mapping techniques.

TABLE 3 NOTATION SUMMARY n

Number of data

Ci

Class label of data i (positive integer)

xi

Position of data i in the original space

*

Position of data i on the map

x

i

dij

Distances between xi and xj

d*ij

Distances between x*i and x*j

ri(j)

Rank of the distance dij among the diz in increasing order neighbor, 2 for the second nearest…)

(1 for the nearest

ri*(j)

Rank of the distance d*ij among the d*iz in increasing order

Fm

Non-negative decreasing function of the distance for the mapping technique m



Bandwidth parameter of the function Fm

k

Size of the neighbourhood in both spaces

Uk(i)

Set of data among the k-nearest neighbors of xi but not of x*i

Vk(i)

Set of data among the k-nearest neighbor of x*i but not of xi

Em(i,j)

Stress function between data i and j for the mapping m

Aij

Class co-membership of data i and j (1 if Ci = Cj and 0 otherwise)

25

This version is not the camera ready. Please refer to the final published version for accuracy and completeness. TABLE 4: DATASETS ESSENTIAL FEATURES.

# data (n) 100 1000 5264 336 214 150 7797 360 178

Square Oil flow Digits E. coli Glass Iris Isolet Libras Wine

# classes 2 (b) 3 10 (b) 8 6 3 (b) 6 (b) 15 (b) 3

Dimension 2 12 64 7 10 4 617 90 13

σ 1.2 0.5 35 0.1 0.75 0.75 20 0.75 2

The parameter σ (column 5) refers to our setting for neighborhood width (see section 2.2. for details). Column 3: (b) means classes are balanced.

TABLE 5: ESSENTIAL FEATURES OF THE MAPPING TECHNIQUES.

[Ref] LDA PCA Isomap DD-HDS CCA LocalMDS0.5 Sammon LLE S-LLE0.25 S-Isomap ClassiMap S-NeRV

20, 22 54, 32 67 42 16 72 61 60 35, 58 77, 25 . 74

Linear / non-linear Linear Linear Non-linear Non-linear Non-linear Non-linear Non-linear Non-linear Non-linear Non-linear Non-linear Non-linear

Supervised / unsupervised supervised unsupervised unsupervised unsupervised unsupervised unsupervised unsupervised unsupervised supervised supervised supervised supervised

(a)

Stochastic / exact Exact Exact Exact Stochastic Stochastic Stochastic Stochastic Exact Exact Exact Stochastic Stochastic

parameters . . k=5 λfinal = 0.1 σ (depends on the dataset) σ (depends on the dataset) σ (depends on the dataset) k = 20 (as advised by the authors). k = 20  = 0.5,  = average distance λfinal = 0.1 3 for distance, 1 for mapping

Comput. Time (sec)