Least Square Projection: A Fast High-Precision ... - Semantic Scholar

11 downloads 2012 Views 6MB Size Report
capability of the technique to form groups of points by degree of similarity in 2D. ... H. Levkowitz is with the Computer Science Department, University of.
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,

VOL. 14,

NO. 2,

MARCH/APRIL 2008

1

Least Square Projection: A Fast High-Precision Multidimensional Projection Technique and Its Application to Document Mapping Fernando V. Paulovich, Luis Gustavo Nonato, Rosane Minghim, and Haim Levkowitz, Member, IEEE Abstract—The problem of projecting multidimensional data into lower dimensions has been pursued by many researchers due to its potential application to data analysis of various kinds. This paper presents a novel multidimensional projection technique based on least square approximations. The approximations compute the coordinates of a set of projected points based on the coordinates of a reduced number of control points with defined geometry. We name the technique Least Square Projections (LSP). From an initial projection of the control points, LSP defines the positioning of their neighboring points through a numerical solution that aims at preserving a similarity relationship between the points given by a metric in mD. In order to perform the projection, a small number of distance calculations are necessary, and no repositioning of the points is required to obtain a final solution with satisfactory precision. The results show the capability of the technique to form groups of points by degree of similarity in 2D. We illustrate that capability through its application to mapping collections of textual documents from varied sources, a strategic yet difficult application. LSP is faster and more accurate than other existing high-quality methods, particularly where it was mostly tested, that is, for mapping text sets. Index Terms—Document and text processing, visualization, simulation, modeling and visualization, data and knowledge visualization, information visualization, visualization techniques and methodologies.

Ç 1

INTRODUCTION

D

ATA sources have increased substantially both in size and complexity, so extracting useful information from them has become a challenge. One measure of data complexity is the number of attributes associated with each instance of data. Consider, for example, data from a demographic census: a data instance records attributes such as age, sex, education, occupation, income, etc. Considering each data attribute as a data dimension, if we have m such attributes, each data instance can be interpreted as an m-dimensional vector placed in an m-dimensional definition space. In traditional statistical analysis, data instances with four or more dimensions are known as multivariate or hypervariate data. Text on Information Visualization refers to such data as multidimensional. Conventional methods for multidimensional data visualization such as scatterplots, parallel coordinates, or pixel-oriented methods, which are normally employed to assist data interpretation, can fail if they are directly applied to high-dimensional data (a good

. R.V. Paulovich, L.G. Nonato, and R. Minghim are with the Instituto de Cincias Matema´ticas e de Computac¸a˜o, Universidade de Sa˜o Paulo, Av. Trabalhador Sa˜o-carlense, 400 Sa˜o Carlos, SP, Brazil, CP: 668-CEP: 13560-970. E-mail: {paulovic, gnonato, rminghim}@icmc.usp.br. . H. Levkowitz is with the Computer Science Department, University of Massachusetts Lowell, 1 University Avenue, Lowell, MA 01854. E-mail: [email protected]. Manuscript received 18 Apr. 2007; revised 4 July 2007; accepted 11 Sept. 2007; published online 15 Oct. 2007. Recommended for acceptance by G.G. Robertson. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TVCG-2007-04-0039. Digital Object Identifier no. 10.1109/TVCG.2007.70443. 1077-2626/08/$25.00 ß 2008 IEEE

review of data visualization methods can be found in [1]). Moreover, identification of patterns and models grows more difficult as dimensionality increases (the dimensionality curse [2]), and the lack of proper representations can severely impair interpretation. A common way of handling dimensionality is to reduce the number of dimensions so that strategies that are known to work well with low-dimensional data can be applied. One method for reducing the dimensionality that has been successfully applied is the Multidimensional Projection technique. A multidimensional projection technique typically maps the data into a d-dimensional space, with d ¼ f1; 2; 3g, while retaining in the projected space some information about distance relationships among the data items in their original definition space. This way, a graphical representation can be created to take advantage of the human visual ability to recognize structures or patterns based on similarity such as clusters of elements. The problem of multidimensional projection has been the concern of many researchers due to the large variety of applications that could benefit from visual representations of data sets with large number of dimensions. Multidimensional projection techniques are either based on linear combinations of data attributes, defining them in a new orthogonal basis of small dimension, or on a process that attempts to minimize a function of the information loss incurred during projection. Linear techniques may fail to capture the relevant patterns of data with nonlinear structures such as clusters of arbitrary shapes or curved manifolds. In such cases, minimization techniques are better candidates. However, they can be computationally expensive to achieve high precision projections. Thus, faster Published by the IEEE Computer Society

2

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,

projection techniques that can capture nonlinear relationships must be sought. Visual representations of document collections is a spatial challenge for visual representations and has been tackled in various forms [3]. Most techniques draw their display from extracted data (such as citation, cocitation, authoring, and keywords). However, less success has been achieved in techniques that build maps of documents based on their content. Well-built projections can be used in this case to display the underlying global structure or local trends of a document collection by organizing similarity in content with neighboring regions on the map. Textual data sets render vector representations with high dimensionality and sparsity, rendering the application particularly challenging for visualization techniques. Techniques built to speed up extremely large document collections lose too much information when seeking scalability. Precise techniques are useful for applications where users have to examine a large quantity of documents, but are too slow, even for data sets of moderate sizes. This paper presents a novel multidimensional projection technique, called Least Square Projection (LSP), that encompasses good features of both linear and nonlinear projection methods. Aside from being fast, LSP can be successfully employed on high-dimensional sparse spaces, resulting in a precise final positioning of points. The solution of a linear system by fast methods results in cost effectiveness. The initial control-point definition prints a nonlinear feature to the method that helps reflect complex correlations. For the case of text documents, LSP can generate high-precision layouts for applications that must rely on the examination of content (thus helping in locating interesting material to focus on), and yet, it is fast enough to handle a reasonable amount of documents. We have presented an initial version of LSP before [4], now significantly extended to employ automatically defined weights to improve control-point definition and to reduce the computational complexity. The result is a new improved projection technique with better performance, extra degree of flexibility, added precision, and added uniformity of behavior when applied to multiple types of text collections and other data sets. LSP is made available publicly for use, implemented in a tool called Projection Explorer (PEx) [5].1

2

RELATED WORK

Formalizing the concept of distance-based multidimensional projections, let X ¼ fx1 ; x2 ; . . . ; xn g be a set of m-dimensional data, with ðxi ; xj Þ being a dissimilarity (distance) measure between two m-dimensional data instances, and let Y ¼ fy1 ; y2 ; . . . ; yn g be a set of points into a d-dimensional space, with d ¼ f1; 2; 3g, and dðyi ; yj Þ is the (euclidean) distance between two points of the projected space. A multidimensional projection technique can be described as an injective function f : X ! Y that seeks to make jðxi ; xj Þ  dðfðxi Þ; fðxj ÞÞj as close to zero as possible 8xi ; xj 2 X [6]. Multidimensional projection techniques can be divided into two major groups according to the functions f 1. http://infoserver.lcad.icmc.usp.br/.

VOL. 14,

NO. 2,

MARCH/APRIL 2008

employed: linear projection techniques and nonlinear projection techniques. Linear projection techniques create linear combinations of the data attributes, defining them in a new orthogonal basis of lower dimension. Among such techniques, the widely known are the second-order techniques such as the Principal Component Analysis (PCA) [7] (also known as the Karhunen-Loe´ve Expansion). Second-order techniques are the ones that employ information embedded in the covariance matrix of the data, being particularly suitable for data presenting Gaussian (normal) distributions, since in this case, it captures almost all data distribution [8]. Nonlinear techniques attempt to minimize a function of the information loss incurred by the projection. Normally, this function is based on the dissimilarities among the m-dimensional instances and on distances among the d-dimensional points. Hence, it does not require representing the original data as vectors: it is sufficient to have a mechanism for measuring instance dissimilarity in the highdimensional space. Since nonlinear techniques perform an optimization process, their iterative nature is an additional advantage. Thereby, a user can observe the execution of the projection process and interrupt it if convenient. Another interesting feature is that adding new subsets of instances only requires a limited number of additional iterations. Linear techniques, on the other hand, require the overall process to be entirely redone. One example of a nonlinear projection technique is Multidimensional Scaling(MDS) [9]. Originating from the psychophysics domain, MDS actually comprises a class of techniques aimed at mapping instances belonging to an m-dimensional space into instances on a d-dimensional space ðd  mÞ, striving to keep some distance relations. A wellknown example of the MDS technique is Sammon’s Mapping [10]. It starts by defining a function that indicates the amount of information lost in the projection and then applies an iterative nonlinear optimization method based on the gradient of this function to find a (local) minimum. Although Sammon’s Mapping can unfold data belonging to manifolds of high-dimension, once large distances are taken into account in the optimization, it can fail for highly twisted spaces. In this sense, an improvement is the Curvilinear Component Analysis (CCA) [11], which employs a new optimization function that ignores distances greater than a threshold. Another possibility is, instead of using euclidean distance, employing geodesic distances. The GeoNLM [12] and ISOMAP [13] are examples. A recurrent problem related to Sammon’s Mapping is its computational complexity Oðn2 Þ. In order to improve such complexity, Pekalska et al. [14] suggest some strategies. The main idea of these is to project only a subset of t multidimensional instances, with t  n, settle such points, and interpolate the n  t remaining points. As interpolation strategies, triangulation, distance mapping, and artificial neural networks have been suggested. Among the various MDS techniques, the simplest ones are those based on Force-Directed Placement (FDP) [15]. Originally proposed as a graph-drawing heuristic by Eades [16], the FDP model aims at bringing a system composed of instances

PAULOVICH ET AL.: LEAST SQUARE PROJECTION: A FAST HIGH-PRECISION MULTIDIMENSIONAL PROJECTION TECHNIQUE AND ITS...

connected by “virtual” springs into an equilibrium state. Instances are initially placed randomly, and the spring forces iteratively push and pull them until reaching an equilibrium. To apply the FDP model as an MDS technique, the spring forces must be proportional to the difference between the dissimilarity ðxi ; xj Þ among the m-dimensional instances, and the distances dðfðxi Þ; fðxj ÞÞ among the d-dimensional points. In the general case, where each instance is connected to all other instances, the iteration of the FDP model’s complexity is Oðn2 Þ. Whenever it needs at least n iterations in order to reach the equilibrium state, the FDP model’s complexity is Oðn3 Þ. Aiming at reducing this complexity, two different strategies can be employed: reducing the number of iterations necessary to reach the equilibrium state or reducing the complexity of each iteration. One example of the former strategy was presented in [6], called the Force Scheme. Different from the original idea of Eades, where each instance is moved once per iteration, Force moves each instance n  1 times on an iteration. Thus, less iterations are necessary to bring the system to an equilibrium state. Although it reduces the model’s complexity, each iteration is still Oðn2 Þ. Aiming at reducing this complexity, Paulovich and Minghim [17] proposed a new method, where the instances are first clustered, and Force is applied, considering the instances of each separated cluster, 3 defining a model whose complexity is Oðn2 Þ. An example of a technique that reduces the complexity of the iterations was presented by Chalmers [18]. This approach reduces the complexity of an iteration by using data samples in order to determine which instances are connected by the imaginary springs. Although this approach makes the iterations linear, the model complexity is still high due to the n iterations necessary for creating the layout, that is, Oðn2 Þ. Aiming at reducing this complexity, another approach was presented in [19] (and extended in 3 [20]), which defines a FDP model with complexity Oðn2 Þ. In pffiffiffi this approach, a random sample S of n instances is first projected using the Chalmers approach. Then, the remaining instances are interpolated from these instances. The process for the interpolation is the one that makes this 3 technique Oðn2 Þ; therefore, Morrison and Chalmers [21] suggest a modification of this interpolation, reducing the 5 final complexity to Oðn4 Þ, whereas Jourdan and Melancon [22] suggest a further approach to reduce it to Oðn log nÞ. The original projection techniques mentioned here are usually quite precise in terms of the quality of the projection to locate highly related groups of data items. Approaches to reduce the complexity of the techniques aim at making them more scalable to very large data sets, since this is one challenge of information visualization tasks. However, as will be shown in Section 4, this results in projections with lower quality as far as grouping of highly related items is concerned. In this context, we aim at projections that can do well as far as relating similar data elements by positioning them in neighboring regions of the display elements is concerned. However, we also aim at reduced cost, as compared to the basic algorithms of the projections with the highest quality in these terms.

3

For the case of textual documents, high precision is important for supporting applications in which a user has dozens to a few thousand documents to examine. Although scalability to millions of documents is desirable, techniques designed to accomplish that thus far (such as the ones adopted in the systems InfoSky [23] and IN-SPIRE [24], [25], [26]) can help locate greater pockets and areas of interest but lose intergroup and intragroup relationships by making grouping decisions too early in the process. In order to achieve high precision with reasonable computation cost for reasonably sized data sets, we present a novel projection technique, called LSP. LSP is nonlinear in that the definition of control points is outside the solution of the linear system. It projects the points in a way that is based on their neighborhood. LSP is therefore suitable for nonlinear sparse spaces, is fast, and has high precision.

3

LEAST SQUARE PROJECTION

Given a set of points S ¼ fp1 ; . . . ; pn g in IRm , the LSP algorithm aims at representing the points of S in a lower dimensional space IRd , d  m, so as to preserve the neighborhood relationship among the points as much as possible. Two main steps are involved in the projection process. First, a subset of points in S, called “control points,” are projected onto IRd by the MDS method. Making use of the neighborhood relationship of the points in IRm and the Cartesian coordinates of the control points in IRd , it is possible to build a linear system whose solutions are the Cartesian coordinates of the points pi in IRd . The idea of generating the coordinates of a set of points from control points has already been explored in the context of mesh recovering and mesh editing by Sorkine and Cohen-Or [27], [28]. In fact, the LSP method generalizes the ideas of Sorkine and Cohen-Or in order to deal with highdimensional spaces while avoiding the need of a mesh to define the linear system.

3.1 Building the Linear System Let Vi ¼ fpi1 ; . . . ; piki g be a set of ki points in a neighborhood of a point pi and let p~i be the coordinates of pi in IRd . Suppose that p~i is given by the following: X p~i   p~ ¼ 0; pj 2Vi ij j ð1Þ X 0  ij  1; ij ¼ 1: If (1) holds for the points in S, then each pi is in the convex hull of the points in Vi . Particularly, when ij ¼ k1i , we have pi in the centroid of the points in Vi . This particular case has some interesting properties, as discussed in [29] and [30]. Equation (1) gives rise to a set of linear systems from which it is possible to compute the coordinates of points p~i , that is Lx1 ¼ 0;

Lx2 ¼ 0;



Lxd ¼ 0;

ð2Þ

where x1 ; x2 ; . . . ; xd are the vectors containing the Cartesian coordinates ðx1 ; . . . ; xn Þ of the points, and L is the n  n matrix whose entries are given by

4

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,

VOL. 14,

NO. 2,

MARCH/APRIL 2008

8 i ¼ j; < 1; lij ¼ ij ; pj 2 Vi ; : 0; otherwise: When ij ¼ k1i , the matrix L is usually called the Laplacian matrix. The rank of L depends on the neighborhood relationship among the points and the weights ij . When a mesh is given, the neighborhood of the points can be obtained from the incidence relationship of the mesh. In this case, supposing that ij 6¼ 0, the rank of L is n  r, where r is the number of connected components of the mesh [27]. In our case, where no mesh exists, it is important to define the neighborhood of the points so as to ensure the overlap condition stated as follows: Definition. Let S ¼ fp1 ; . . . ; pn g be a set of points and V ¼ fV1 ; . . . ; Vn g be the set of neighborhood relationships of the points in S. The set V is said to satisfy the overlap condition if for every two points pi and pj , there exists a sequence of neighborhoods V1ij ; . . . ; Vqij so that V1ij ¼ Vi , Vqij ¼ Vj and ij 6¼ ;, k ¼ 1; . . . ; q  1. Vkij \ Vkþ1 Fig. 1. Matrix A with control points p3 and p6 .

The overlap condition ensures the same properties on L as a mesh with a single connected component; that is, L has rank equal to n  1, ensuring a nontrivial solution. The problem is that any geometrical information is contained in L; thus, the solutions of the linear systems may be not useful. In order to render the solutions of the linear system more attractive, we must add some geometrical information into the system. This is done through control points that can be obtained by projecting some points of S onto IRd . The control points are built into the linear system as new rows in the matrix. The Cartesian coordinates of the control points are added to the right side of the system, giving rise to a nonzero vector. Thus, given a set of control points Sc ¼ fpc1 ; . . . ; pcnc g, we can rewrite (2) in the form Ax ¼ b;

ð3Þ

where A is a rectangular ðn þ ncÞ  n matrix given by A¼

L C;

cij ¼

1; pj is a control point; 0; otherwise;

and b is the vector: bi ¼

0; xpci ;

i  n; n < i  n þ nc;

where xpci is one of the Cartesian coordinates of the control point pci . Fig. 1 shows an example of the matrix A for a set S with six points. The neighbors of each point are presented on the right, and the bold circles are the control points (in this example, L is the Laplacian matrix). The linear system with the control points has full rank, and it can be solved in the least square sense, which means that we shall find x that minimizes kAx  bk2 , that is, x ¼ ðAT AÞ1 AT b. The system AT Ax ¼ AT b that must be solved is symmetric and sparse, facilitating its solution [27]. Another important characteristic of the above construction is that if a new point pnþ1 is added to S, the linear system (3) can easily be updated by only adding a new line in the L part of A. This strategy makes it possible to project points onto IRd iteratively, making it necessary only to

define a neighborhood relationship for each new added point.

3.2 Control Points In order to determine the set of control points, a sample of nc points must be carefully chosen from the set S. These points must be selected so as to represent, as best as possible, the data distribution in IRm . Aiming at executing this selection, the data set is split into nc clusters using the k-medoids method [2], and the medoid (the nearest point to the centroid) of each cluster is used as a control point. Other clustering methods can also be applied, but k-medoids was chosen due to its ability to deal with outliers [2]. Furthermore, it is not necessary to have a vector representation of the data set: only a dissimilarity function between the data instances is necessary. Other formulations might need the particular attributes of each point, not always available in some applications. Although this kind of process is more expensive ðOðn  ncÞÞ than a simple random sample, in our tests, it brings better results, and a small number of control points are needed to reach good-quality projections. Additionally, these clusters can also be used to define the neighborhood relationship of the points, as explained in the next section. Once the control points have been defined, they must be projected onto IRd through an MDS method [9] such as Force Scheme [6] or Sammon’s Mapping [10]. This control points placement has an important impact on LSP, since the remaining points will be interpolated for the final layout based on this initial layout. Thus, the algorithm should guarantee that the information loss during this process is as small as possible. In order to keep the computational complexity of LSP manageable, the number nc of control points must be chosen according to the MDS technique employed. If an pffiffiffi Oðn2 Þ MDS technique is used, nc ¼ n will make the 3 complexity of this step Oðn2 Þ due to the clustering method employed. Note that when nc is close to the number of points in S, the final projection will resemble the projection

PAULOVICH ET AL.: LEAST SQUARE PROJECTION: A FAST HIGH-PRECISION MULTIDIMENSIONAL PROJECTION TECHNIQUE AND ITS...

5

obtained if the MDS technique were applied to all points of S (and, of course, be as slow).

3.3 Neighborhood Relationship Together with the Cartesian coordinates of the control points, it is also necessary to define a list of points Vi 2 S for each point pi 2 S. Once pi is put in the convex hull of Vi , this list must reflect a neighborhood of pi , making the final layout based on local relations in IRm . This is an important feature if IRm is a high-dimensional sparse space, as obtained in document vector representations. In such spaces, frequently, the objects are arranged along local subspaces, and they are related with a small number of nearest neighbors inside the same subspace [31]. Normally, the procedure in finding the nearest neighbors of each point is prohibitive Oðn2 Þ. However, there are some ways to reduce the complexity. Cha´vez et al. [32] survey a number of techniques for nearest neighbor searches, splitting the techniques into two major classes: pivot-based algorithms and clustering techniques. In the former, some points are chosen to act as pivots, making it possible to avoid distance calculations. In the latter, the space is split into clusters, so some clusters and their points can be discarded during a search. Here, we employ a simple clustering-based technique to find a point’s neighborhood. This is our choice, since the space is already split into clusters by the process of control points definition (see Section 3.2). In this technique, first, a nearest neighbors search of the cluster’s medoids is performed, defining the k nearest clusters of each cluster. Thus, when a nearest neighbors search for a point pi is performed, only the cluster to which pi belongs and the nearest clusters of such cluster are examined. This is an approximation of the nearest neighbor search, but it yields good results: even when the nearest neighbors returned are not the real ones, they are still very close to pi . The complexity of such technique is determined by the number pffiffiffi 3 of clusters: with n clusters, it is Oðn2 Þ.

4

RESULTS

In order to create the linear system presented in Section 3.1, it is necessary to define how ij influence the final position of the points. Consider two one-dimensional points xk , xn 2 ½0; 1. Consider another point xi , with xk and xn being its neighbors. In order to place it (on a line), we have to solve xi ¼ k xk þ n xn (1). It is possible to show that xi will be closer to xk (and far from xn ) when k > n . Once our focus is to represent in the reduced space the same distance relations of the original space, these ij must be inversely proportional to distances between the points in the original space. Here, we employ the following to calculate these ij : , X 1 1 : ð4Þ ij ¼ ðxi ; xj Þ ðxi ; xpk Þ p 2V k

i

We now show some examples of the projections created using LSP, with emphasis on the projections of document collections. In order to be able to project documents using LSP, it is necessary to determine a way to measure the dissimilarities

Fig. 2. Projection of a document collection composed of scientific papers in four different areas (colors indicate the areas). (a) Whole map. (b) Zoomed part.

among documents. Here, we employ the vector space model [33] to represent the documents as vectors in a multidimensional space and a cosine-based distance, as defined in [34], to determine the dissimilarities among the documents as the distance between the vectors that represent them. In the vector space model, the terms that occur in the document collection are the space dimensions, and the frequency of these terms in each document are the coordinates. The process of defining this set of terms normally involves three steps: 1) removing stopwords, that is, noninformative words such as articles, prepositions, and such, plus any words known to lack relevance to the context, 2) stemming, which reduces words to their radicals, and 3) frequency counting and weighting (where available) to remove terms that occur too sparsely or too often and hence have little differential capability. The first example aims at verifying that LSP could keep the projection of points (documents) that are considered to be in the same general area close to each other in data sets with documents in various “general areas.” We looked at the positioning of papers within a group to see whether papers previously known to address that subject and to have highly similar content have been mapped close to each other. Fig. 2a shows a projection of a collection composed of 680 documents, which include title, authors, abstract, and references from scientific papers in four different subjects: Case-Based Reasoning (CBR), Inductive Logic Programming (ILP), Information Retrieval (IR), and Sonification (SON). In this figure, the points were preclassified in one of these subjects by source and were colored according to that class. The points in red represent CBR articles, the points in yellow represent ILP articles, those in light blue represent IR articles, and those in dark blue represent SON articles. It can be seen that the projection was able to position members of the same group together and also to separate the groups, with few “apparent” mismatches. Fig. 2b shows a view of part of Fig. 2a that mostly contains the SON papers. In this view, it is possible to see a subgroup, circled and identified as B, of IR articles inside the SON group. Closer examination reveals that these IR articles deal with “Audio Analysis,” showing that their positioning is actually more correct according to the goal of

6

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,

VOL. 14,

NO. 2,

MARCH/APRIL 2008

Fig. 5. Part of Fig. 2a, with new labels. Fig. 3. Other examples of document collection projections. These are colored based on the subject of the documents. (a) Projection of messages of three different discussion groups. (b) Projection of scientific papers in four distinct subjects.

Fig. 6. A projection of the voting records of members of the US House of Representatives.

Fig. 4. The same projection of Fig. 2a, but with four labels to help identify the content of the document collection.

the projection than simply matching neighborhood of other papers classified as IR. In the case of IR articles, the documents were retrieved from a Web search with the subject “IR,” leading naturally to papers that are better qualified as something else. In another test, also revealed in Fig. 2b, five papers coauthored by some of our team members were added, colored green, circled, and identified as A. These papers’ main subject is sonification, and one or two authors with one another. They present the evolution of the same system, so they have reasonably high correlation in content. It can be seen that LSP has placed them well, positioning the five papers within the sonification group and close to one another. Fig. 3 shows other projections, also colored by manual classification, demonstrating the capability of the technique to map related text documents close together. Fig. 3a shows a map of a document collection composed of 300 messages from three different discussion groups [35]. Fig. 3b presents a projection of a data set obtained from an Internet repository that includes 1,624 scientific papers on the subjects of Bibliographic Coupling (BC), Cocitation Analysis (SC), Milgrams (MG), and Information Visualization (IV).2 The subsets BC and SC (red and blue in the picture, respectively) are not distinguishable, since they treat the 2. http://ella.slis.indiana.edu/~katy/outgoing/hitcite/{bc, sc, mb, iv}.txt.

same subjects, so they are plotted together, which is an expected outcome for a proper mapping. Aside from being able to create projections of different types of data, the software that implements LSP (PEx) has some mechanisms to help the user interact with the projections and extract useful information from them, particularly for displays of document collections. One tool available in PEx creates labels from covariant terms in order to help the user understand what topic is commonly present in a certain group. Fig. 4 shows the same map presented in Fig. 2a, but with four labels automatically generated by the tool. There, it is possible to see that the labels generated correspond to the four areas of articles. Using the label information, the user can focus on a part of the projection or, in case of documents, a group of documents of interest, separate it, and examine it as an isolated projection. Fig. 5 shows a part of the projection identified as (information, retrieval). In this figure, some new labels are generated for the most noticeable clusters of points, showing coherence of content among the points present in the groups. LSP can also be applied to other data sets. It is only necessary to provide a dissimilarity function among the data instances. Fig. 6 shows a projection of the voting records of members of the US House of Representatives. The data was obtained from the 1984 Congressional Voting Records Database [35]. Each data instance contains information about the party affiliation of a representative, along with his/her voting record on 16 key issues. The blue points identify Democrats, and the red ones identify Republicans. It is possible to see that there is a separation between the two parties, but for those issues, the Republicans seemed to vote in a uniform fashion much more than the Democrats did.

PAULOVICH ET AL.: LEAST SQUARE PROJECTION: A FAST HIGH-PRECISION MULTIDIMENSIONAL PROJECTION TECHNIQUE AND ITS...

Fig. 7. Comparing projections of the Iris data set using the Force Scheme on all points and the LSP with different numbers of control points. (a) Force Algorithm applied to all points. (b) LSP with 50 percent of control points. (c) LSP with 10 percent of control points.

4.1

Analysis of the Effect of the Control Points on the Projection As discussed in Section 3.2, the choice of control points plays an important role in the LSP projection process. Here, we show examples designed to highlight the impact of the control points on the quality of the final projection. In summary, small changes in choice and positioning of control points do not affect the quality of the layout, whereas the number of control points and a strong bias in their choice do. First, let us observe that once the number of control points approaches the number of points to be projected, the final projection generated by LSP is closer to the projection generated by the MDS technique employed to place the control points in the fist step of the projection. Fig. 7 shows three distinct projections of the Iris data set.3 Fig. 7a shows a projection generated using the MDS technique (Force Scheme) for all points. Fig. 7b shows an LSP projection generated using 50 percent of the data points as control points, and Fig. 7c shows another LSP projection, employing only 10 percent of the data points as control points. In that picture, when the number of control points dropped to 10 percent, the groups did not change, but points became closer to one another in the layout; that is, they clustered tighter together. The quality of LSP projections remains consistent, provided that the amount and the distribution of control points are reasonable. As far as the decrease in quality of the final layout is concerned, in our tests, almost all of them are related to choosing a set of control points that do not match data distribution well. Two such situations are the most relevant: 1) selecting a set of control points that does not contain at least one representative of an important group of points in the original data, and 2) selecting a set of control points that has a strong bias in the original space, for instance, all of them approximate to a line. The first situation is exemplified on Fig. 8a, which shows a projection of the Iris data set. Colors on the projection indicate the flowers’ classes. In order to generate such projection, all control points were intentionally chosen to be in the green and blue classes. The red ones were not represented by any of the control points. That bias induced poor positioning of the red class, impairing their distinction due to extensive overlap. Although this scenario is a possibility, using a basic clustering technique to choose control points, as suggested here (see Section 3.2), will make it very unusual for representatives of important groups not 3. The Iris data set [35] is composed of 150 instances of flowers of three different classes (Setosa, Versicolour, and Virginica), with 50 flowers of each class.

7

Fig. 8. Problems that can occur with strong bias in the choice or positioning of control points. (a) No control points represent the red group. (b) All control points lie on a line.

to be part of the control point set. That scenario is more likely when too few data points are chosen as control points. In our tests, a number of control points, approaching 10 percent of the data set, has always shown to be enough. We have also developed an example to illustrate the second problem with LSP regarding bad choice of control points. Since the interpolation strategy employed by LSP is based on the idea of placing the points in the convex hull of its neighbors, if the control points are projected onto a line, the final layout will also approximate to a line. In Fig. 8b, we first project the control points onto a line to simulate a real (yet unlikely) case where that might happen, and then, we perform the interpolation of the remaining data points. Again, this scenario only happens if the selected control points lie on a straight line in the original data space or if their initial placement in the target 2D space is that poor. Indeed, these scenarios are more likely if the total number of data points is small (data sets with less than 200 points). In these situations, other projection techniques are preferable such as Sammon’s Mapping or a straight FDP model such as Force (both are fast enough to display small data sets in real time, and both are also available in PEx). Since the resulting layout is dependent on the set of control points, if they change, the final layout produced by the LSP will probably be different, depending on the way that they change. This is an effect of the choice of the MDS technique for the initial layout of control points. Since that positioning is normally based on an optimization process, two projections are similar if the amount of information lost during control point placement is the same for both (for instance, in terms of stress). In other words, if the placement algorithm for control points is not affected by minor changes, the resulting LSP will not either.

4.2 Evaluating the Results A common way of evaluating the quality of a projection in order to compare then analytically is known as stress [36]. Stress aims at measuring the amount of information lost during projection as the difference between the dissimilarities in the m-dimensional space and the distances in the d-dimensional space. The stress function defined by Kruskal [36] is presented as follows: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi uP u ðdðfðxi Þ; fðxj ÞÞ  ðxi ; xj ÞÞ2 ui

Suggest Documents