Tree Preserving Embedding - ICML 2011

3 downloads 0 Views 611KB Size Report
2000; Roweis & Saul, 2000; Belkin & Niyogi, 2003) preserve local distances defined by kernels or ... Zadeh & Ben-David, 2009). Preserving the topologies.
Tree Preserving Embedding

Albert D. Shieh [email protected] Tatsunori B. Hashimoto [email protected] Edoardo M. Airoldi [email protected] Department of Statistics & FAS Center for Systems Biology, Harvard University, Cambridge, MA 02138

Abstract Visualization techniques for complex data are a workhorse of modern scientific pursuits. The goal of visualization is to embed high-dimensional data in a low-dimensional space while preserving structure in the data relevant to exploratory data analysis such as clusters. However, existing visualization methods often either fail to separate clusters due to the crowding problem or can only separate clusters at a single resolution. Here, we develop a new approach to visualization, tree preserving embedding (TPE). Our approach uses the topological notion of connectedness to separate clusters at all resolutions. We provide a formal guarantee of cluster separation for our approach that holds for finite samples. Our approach requires no parameters and can handle general types of data, making it easy to use in practice.

1. Introduction Visualization is an important first step in the analysis of high dimensional data. High-dimensional data often has low intrinsic dimensionality, making it possible to embed the data in a low-dimensional space while preserving much of its structure. However, it is rarely possible to preserve all types of structure in the embedding. Therefore, dimensionality reduction methods can only aim to preserve particular types of structure. Linear methods such as principal component analysis (PCA) and multidimensional scaling (MDS) (Mardia et al., 1979) preserve global distances, while nonlinear methods such as manifold learning (Tenenbaum et al., 2000; Roweis & Saul, 2000; Belkin & Niyogi, 2003) preserve local distances defined by kernels or neighborAppearing in Proceedings of the 28 th International Conference on Machine Learning, Bellevue, WA, USA, 2011. Copyright 2011 by the author(s)/owner(s).

hood graphs. However, most dimensionality reduction methods fail to preserve clusters (Venna et al., 2010), which are often of greatest interest. Clusters are difficult to preserve in embeddings due to the so-called crowding problem (van der Maaten & Hinton, 2008). When the intrinsic dimensionality of the data exceeds the embedding dimensionality, there is not enough space in the embedding to allow clusters to separate. Therefore, clusters are forced to collapse on top of each other in the embedding. As the embedding dimensionality increases, there is more space in the embedding for clusters to separate and the crowding problem disappears, making it possible to preserve clusters exactly (Roth et al., 2003). However, since the embedding dimensionality is at most two or three for visualization purposes, the crowding problem is prevalent in practice. When the clusters are known, they can be used to guide the embedding to avoid the crowding problem (Xing et al., 2002). However, the clusters are often difficult to find. In fact, the embedding is often used to help find the clusters in the first place. Therefore, it is important to solve the crowding problem without knowledge of the clusters. Force-based methods such as stochastic neighbor embedding (SNE) (Hinton & Roweis, 2003), variants of SNE (Cook et al., 2007; van der Maaten & Hinton, 2008; Carreira-Perpin´an, 2010; Venna et al., 2010), and local MDS (Chen & Buja, 2009), have been proposed to overcome the crowding problem. Force-based methods use attractive forces to pull together similar points and repulsive forces to push apart dissimilar points. SNE and its variants use forces based on kernels, while local MDS uses forces based on neighborhood graphs. Force-based methods have long been used in graph drawing to separate clusters (Di Battista et al., 1998; Kaufmann & Wagner, 2001). Although force-based methods are effective, it is difficult to balance the relative strength of attractive and repulsive forces. When repulsive forces are too weak, they will fail to separate clusters, but when repulsive forces are too strong,

Tree Preserving Embedding

Preserving the SL dendrogram is a natural choice for several reasons. First, the SL dendrogram is the only dendrogram consistent with the topology of the minimum spanning tree (MST) (Gower & Ross, 1969; Zadeh & Ben-David, 2009). Preserving the topologies of neighborhood graphs has been shown to help overcome the crowding problem (Shaw & Jebara, 2009). However, while the topologies of neighborhood graphs such as the MST can only be preserved approximately in general (Eades, 1996), we show that the SL dendrogram can be preserved exactly. Second, the SL dendrogram represents both global and local structure due to its hierarchical nature. Preserving global structure allows TPE to separate clusters, while preserving local structure prevents TPE from artificially creating clusters. Finally, TPE can separate clusters even when the SL dendrogram cannot. Although SL is often criticized as a clustering method for finding poor clusters in practice (Hartigan, 1975), SL finds poor clusters due to the instability of cutting the SL dendrogram at a particular height (Stuetzle, 2003). Since TPE preserves the SL dendrogram at all heights, TPE is not sensitive to the instabilities of the SL dendrogram at any particular height. We make cluster separation in TPE precise using the topological notion of connectedness. A natural and commonly used notion of a cluster is a set of points that are connected at a particular resolution. It is well known that the SL dendrogram finds clusters of

3

0.2

0.4

0.6

0.8

2

1 3

0.0

4

4

5

Height

0.6

5 0.4

Dimension 2

0.8

2

0.2 0.3 0.4 0.5 0.6

1.0

SL Dendrogram

1

0.2

We introduce tree preserving embedding (TPE) in order to overcome the limitations of force-based methods. TPE aims to preserve both distances and clusters by preserving the single linkage (SL) dendrogram in the embedding. SL is a hierarchical clustering method that merges clusters with minimum nearest neighbor distance. The SL dendrogram is the associated tree with clusters as vertices and merge distances as vertex heights. TPE preserves the SL dendrogram in the sense that both the data and the embedding have the same SL dendrogram. Embeddings and dendrograms have long been used as representations for dissimilarities (Shepard, 1980). However, there is no guarantee that embeddings and dendrograms will be consistent when used separately. In particular, clusters found by dendrograms may not be found in embeddings due to the crowding problem. TPE combines embeddings and dendrograms in a common representation.

Embedding

0.0

they will artificially create clusters. Therefore, forcebased methods are sensitive to intrinsic resolution parameters such as kernel bandwidths and neighborhood graph sizes that control the amount of separation between points in the embedding.

1.0

Dimension 1

Figure 1. Relation between the SL dendrogram and connectedness in the embedding. Cutting the SL dendrogram at a height of ε = 0.3 produces three clusters of ε-connected points. Points 3 and 5 are ε-connected by point 4 because the ε-ball centered at point 4 contains points 3 and 5, while points 1 and 2 are not ε-connected to any other points because they are not contained by any ε-balls centered at other points.

connected points at different resolutions for different heights (Hartigan, 1975). We show that TPE preserves connectedness in the sense that points in the embedding are connected if and only if they are connected in the data. Preserving connectedness guarantees that clusters separated in the data remain separated in the embedding. Thus, TPE is guaranteed to separate clusters at all resolutions, rather than a single resolution. Besides its theoretical basis, TPE has several attractive features in practice. First, in contrast to many nonlinear dimensionality reduction methods that require careful parameter selection to perform well, TPE requires no parameters. Second, TPE inherits the generality of MDS, which can handle data represented only in terms of dissimilarities rather than as vectors. Dissimilarities are found in many applications where vector representations are either poor or unavailable. Finally, since the merge order of the SL dendrogram is preserved under monotonic transformations of the dissimilarities, TPE is more sensitive to the rank order than the values of the dissimilarities. Therefore, TPE can handle non-metric dissimilarities.

2. Tree Preserving Embedding In this section, we introduce TPE as a set of constraints in MDS that preserve the SL dendrogram. The constraints arise from a characterization of the SL dendrogram using a notion of connectedness. We introduce an algorithm similar to hierarchical clustering to implement TPE. In order to make the algorithm practical, we propose a variant based on a greedy approximation that maintains the constraints. Finally, we show that TPE preserves connectedness in a precise

Tree Preserving Embedding

sense that corresponds well with separating clusters. 2.1. Algorithm TPE preserves the SL dendrogram in MDS. Given an n × n dissimilarity matrix D for a set of n objects S = {1, . . . , n}, MDS (Buja et al., 2008) finds an embedding X = {x1 , . . . , xn } ⊆ Rp of the objects in p dimensions that minimizes the stress function X σ(X) = (di,j (X) − Di,j )2 , xi ,xj ∈X

the sum of squared errors between the Euclidean distances di,j (X) = kxi − xj k in the embedding and the dissimilarities Di,j . In general, stress is a natural objective that aims to preserve the underlying metric of the dissimilarities. However, since stress emphasizes approximating large dissimilarities well, minimizing stress without constraints on the embedding leads to the crowding problem. TPE preserves the SL dendrogram in order to overcome the crowding problem. SL is a hierarchical clustering method that iteratively merges pairs of clusters A, B ⊆ S with minimum nearest neighbor distance d(A, B) =

min Di,j

i∈A,j∈B

starting with singleton clusters and ending with the trivial cluster. The SL dendrogram is the associated binary tree of depth n−1 with singleton clusters as leaf vertices, the trivial cluster as the root vertex, merged clusters as internal vertices, and merge distances as vertex heights. There are many equivalent characterizations of the SL dendrogram (Hartigan, 1985). We use the following notion of connectedness to express the SL dendrogram as a set of constraints on pairs of objects. Definition 1. Objects i, j ∈ S are ε-connected if there exists a path α1 = i, . . . , αm = j ∈ S such that Dαl ,αl+1 ≤ ε for l = 1, . . . , m − 1. Definition 2. Points xi , xj ∈ X are ε-connected if there exists a path xα1 = xi , . . . , xαm = xj ∈ X such that dαl ,αl+1 (X) ≤ ε for l = 1, . . . , m − 1. Intuitively, objects are connected if there exists a path with short hops between them. The SL dendrogram contains the paths with short hops between objects. Cutting the SL dendrogram at a height of ε produces a partition of the objects into clusters with heights at most ε. It is well known that cutting the SL dendrogram at a height of ε produces clusters of ε-connected objects (Hartigan, 1975). Therefore, objects are εconnected if there exists a path of vertices with heights

Algorithm 1 Tree Preserving Embedding Input: dissimilarity matrix D, embedding dimensionality p Output: embedding X // Initialize clusters Set S1 = {1}, . . . , Sn = {n} Set I1 = {1, . . . , n} Set X1 = {x1 = 0}, . . . , Xn = {xn = 0} for k = 1 to n − 1 do // Find the next cluster merge Set ak , bk = arg mina,b∈Ik d(Sa , Sb ) s.t. a 6= b Set dk = d(Sak , Sbk ) // Merge the clusters Set Sn+k = Sak ∪ Sbk Set Ik+1 = {i ∈ Ik : i 6= ak , bk } ∪ {n + k} // Find the ultrametric distances for the merged cluster for i, j ∈ Sn+k do   Uak ,i,j if i, j ∈ Sak Set Un+k , i, j = Ubk ,i,j if i, j ∈ Sbk   dk otherwise end for // Embed the merged cluster Set Xn+k = arg min

σ(X)

X={xi ∈Rp :i∈Sn+k }

s.t.

xi , xj are Un+k,i,j -connected

di,j (X) ≥ Un+k,i,j

∀i, j ∈ Sn+k

∀i, j ∈ Sn+k

end for Return X2n−1

at most ε between their singleton clusters in the SL dendrogram. The relation between the SL dendrogram and connectedness in an embedding is illustrated in Figure 1. Cluster merges connect objects in the SL dendrogram. The ultrametric distance between objects is the distance at which they are merged into the same cluster in the SL dendrogram (Johnson, 1967). It is well known that the ultrametric distance in the SL dendrogram is equivalent to the sub-dominant ultrametric distance Ui,j =

min

m−1

max Dαl ,αl+1 ,

α1 =i,...,αm =j l=1

the maximum hop in a minimum path between the objects (Carlsson & M´emoli, 2010). The SL dendrogram can be characterized by two constraints on each pair of objects. First, each pair of objects must be connected by its ultrametric distance. Second, each pair of objects must not be connected by any distance less than

Tree Preserving Embedding

its ultrametric distance. The first constraint guarantees that clusters are merged at the same distances as the SL dendrogram, while the second constraint guarantees that clusters are merged in the same order as the SL dendrogram. TPE uses the constraints on pairs of objects as constraints on associated pairs of points in the embedding. Algorithm 1 implements TPE. The algorithm proceeds similarly to hierarchical clustering. There are n − 1 iterations, one for each depth of the SL dendrogram. At each iteration, a pair of clusters is merged and the merged cluster is embedded by minimizing stress subject to the connectedness constraints. At the last iteration, the trivial cluster is embedded and returned. The number of objects being embedded changes at each iteration depending on the size of the merged cluster. Since the embeddings at each iteration are independent, only the embedding at the last iteration is needed. However, earlier embeddings can be used to help initialize later embeddings in practice. The connectedness constraints in the optimization problem may appear to be too rigid to allow TPE to find a low stress embedding since each pair of objects can be connected by an arbitrary sequence of objects. However, the connectedness constraints do not specify that the paths connecting pairs of points must be the same as the paths connecting associated pairs of objects. Preserving the paths that fulfill the connectedness constraints would preserve the MST, which is not possible in general (Eades, 1996). Moreover, the flexibility in choosing the paths to fulfill the connectedness constraints allows points to move freely in the embedding to lower stress. However, this flexibility comes at a cost. Due to the combinatorial nature of choosing paths, the optimization problem is intractable. Nevertheless, we can obtain a tractable approximation by restricting the types of paths that can be chosen. 2.2. Greedy Approximation The optimization problem allows all points in a merged cluster to be rearranged in the embedding at each iteration. However, since each cluster being merged has already been embedded in prior iterations, it is wasteful to allow each cluster to be rearranged internally. The connectedness constraints within the clusters are already fulfilled in the prior embeddings. Since connectedness is preserved under rigid transformations, if we restrict the placement of the clusters to rigid transformations, then we only need to fulfill the connectedness constraints between the clusters. The paths that fulfill the connectedness constraints between the clusters must pass through their nearest neighbors. Therefore,

placing the clusters exactly their merge distance apart fulfills the connectedness constraints between them. The remaining flexibility in placing the clusters can be used to minimize the stress between them. In place of the optimization problem at each iteration k, the greedy approximation proceeds as follows. We find a rigid transformation that aligns the prior embeddings of the clusters while keeping them separated by their merge distance X T ∗ = arg min (di,j (T ) − Di,j )2 T ∈E(p) x ∈X ,x ∈X i ak j bk

s.t.

min

xi ∈Xak ,xj ∈Xbk

di,j (T ) = dk

where di,j (T ) = kT (xi )−xj k is the Euclidean distance in the embedding after alignment E(p) is the set of rigid transformations in p dimensions. We align the clusters by setting xi = T ∗ (xi ) for all xi ∈ Xak and and return the merged cluster Xn+k = Xak ∪ Xbk . The greedy approximation is reminiscent of Procrustes analysis (Gower & Dijksterhuis, 2004), which can be used to merge different embeddings (Quist & Yona, 2004). However, Procrustes analysis aligns embeddings without constraints, making it sensitive to the crowding problem. In contrast, the greedy approximation has a constraint that keeps the clusters separated in order to preserve the SL dendrogram. The constraint makes the greedy approximation more difficult to solve than Procrustes analysis. Nevertheless, the greedy approximation can be solved efficiently in practice using constrained optimization methods. The efficiency of the greedy approximation comes at the cost of sensitivity to the merge order of the SL dendrogram. Since the greedy optimization cannot change the shapes of the clusters from prior embeddings, it cannot always align the clusters well. For small cluster merges, the prior embeddings will have little effect on the alignment. However, for large cluster merges, there may not be enough empty space in the prior embeddings to allow the clusters to be aligned well. Nevertheless, we found that a good alignment of the clusters can usually be found in practice. The greedy approximation has a time complexity of O(n3 ) since there are O(n) iterations, each of which requires minimizing the O(n2 ) stress between the clusters in order to find the alignment of the clusters. The greedy approximation has the same time complexity as dimensionality reduction methods based on a spectral decomposition of a dissimilarity matrix. While the cubic time complexity of the greedy approximation may be prohibitive for some applications, methods developed to improve the scalability of MDS such as land-

Tree Preserving Embedding

Non−metric MDS

−4

−2

0

2

4

Dimension 1

6

8

10

t−SNE

−40

−20

0

20

100 50 0

A

−150

−20

A

E EE BB BB BB E B B E B BB EEE BEE B E B B EEEB B B E EE B B E B BB BEBB B B B B E B B E E B BBBBB E E B BB B EBB B B BB EEE BBB B E B EE BEB B E BE B E E B EE E BB E B B B AAA B A A AA E AA AA

Dimension 2

0

20

E

−50

40

A

−40

A A A A B B A A A A B B B A B A A B B B B B B A A A AB B B B B B B B B B B B B B B B B B B B B BB B B BB B B B B B B B B B B B B B B B B BB B B B B E E B B B B B B E E E B B E E E E E E E B B B EE E E BE E E E E E E E EE E EE E E E E E EE E E E E E E E

Dimension 2

5 0

Dimension 2

10

TPE

40

Dimension 1

AA A AAA AAA AA E E EE E EEEE EE B E B E E E E E EE B BB E EE E B EE E B E E A A A E BB BB B B B BBB B B B BB B B B E BBBBB E B E B B B E BB B A BB B B B B B B B EE B B E BB B BB BB E E B EE B BB E BB B E E B BB B B BB BB BB −150

−50

0

50

100

Dimension 1

Figure 2. Embeddings of protein sequences by TPE, non-metric MDS, and t-SNE. Each point is a protein sequence labeled by the domain it belongs to where ‘A’ denotes Archaea, ‘B’ denotes Bacteria, and ‘E’ denotes Eukaryota.

mark points (de Silva & Tenenbaum, 2003) can also be applied to TPE in principle. 2.3. Connectedness TPE preserves clusters at all resolutions rather than just a single resolution due to the hierarchical nature of the SL dendrogram. Since the clusters found by cutting the SL dendrogram at different heights can be characterized by connectedness at different resolutions, TPE preserves connectedness in the following sense. Theorem 1. For any ε > 0, points xi , xj ∈ X are ε-connected if and only if objects i, j ∈ S are εconnected. Proof. First, we show that if objects i, j ∈ S are εconnected, then points xi , xj ∈ X are ε-connected. We know that xi , xj are δ-connected by the distance δ = dk at which xi , xj were merged into the same cluster at some iteration k. Let α1 , . . . , αm ∈ S be the path that ε-connects i, j. Since i ∈ Sak and j ∈ Sbk , there exists some l such that αl ∈ Sak and αl+1 ∈ Sbk . Since αl and αl+1 are in different clusters, δ ≤ Dαl ,αl+1 ≤ ε. Therefore, xi , xj are ε-connected. Now, we show that if points xi , xj ∈ X are εconnected, then objects i, j ∈ S are ε-connected. Without loss of generality, let xi , xj be points merged into the same cluster at iteration k such that xi , xj are ε-connected by the distance ε = dk . Let xα1 , . . . , xαm ∈ X be the path that ε-connects xi , xj . We know that there exists some l such that dαl ,αl+1 (X) = dk . Let kl ≤ k be the minimum number of iterations such that α1 , . . . , αl are in the same cluster. Then α1 , . . . , αl are δ-connected by δ = dkl . Since

dk is monotonically increasing in k, δ ≤ ε. Therefore, α1 , . . . , αl are ε-connected. Similarly, αl+1 , . . . , αm are ε-connected. Since αl and αl+1 are in different clusters, Dαl ,αl+1 ≤ ε. Therefore, i, j are ε-connected. TPE preserves connectedness in the sense that points are connected in the embedding if and only if they are connected in the data. Clusters at any resolution can be neither too close together nor too far apart in the embedding without violating connectedness at some resolution. Therefore, preserving connectedness guarantees that clusters separated in the data remain separated in the embedding. Preserving connectedness is of more than just theoretical interest. Since connectedness applies to finite samples, preserving connectedness provides a formal guarantee of cluster separation for TPE in practice. To our knowledge, TPE is the first method with a formal guarantee of this kind. Preserving connectedness can be used to obtain distance bounds on points in the embedding. Points in the embedding can be no closer than the merge distance between their clusters and no further apart than the total merge distances between and within their clusters. Therefore, points that are merged into the same cluster earlier in the SL dendrogram will tend to be closer in the embedding. Theorem 2. We have m−1

max Dαl ,αl+1 ≤ di,j (X) ≤ l=1

m−1 X

Dαl ,αl+1

l=1

where α1 = i, . . . , αm = j is the path between the objects in the MST. Proof. Let xi , xj be points merged into the same cluster at iteration k. Since the merge distances of the

Tree Preserving Embedding

−10

0

10

20

−4

−3

−2

Dimension 1

−1

0

1

50 B 0

G G G G G GG B G G G G G GG B G GG G B GG G GG G G G G G GB G G G B B B B GG GG GG B BB B B BBG B G G B BG GG G B G B B G G BBB B GB GGBGB BGB B BBB G G G G G BG B GB G G BBB B G GG G BG G G G G G B G B B B G B G B BBG G G G G G G B B G G B G G G G B G B GB G BBB GG BG G G G BBB G G G G G B G G G G GG BG G GB B G B G G B GGG G G G BB BB G G G BB G GG B B G B G BGB BGG G G G G G B B B G G B G B GB G B GG G G GBB BB BBG GG G G G G G G G G G GGG B GB B B B GG G G B BG G G GG B B B GG BG GB GG B G G G G BB G G GG G G G G B B G B B

−50

3 2 1 0 −1 −2

Dimension 2

10 0 −10 −20

Dimension 2

−20

B

−3

B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B BBBBB B B B B B BB B B BB B B B B B B B B BBB BBBG G GGGB B B G B GB B B B G G B BBBGG B G B BG G BG G GB B G G G G G G B B G B G G BGBBG B G G G G G G G B G G G G G G G G G G G G G G GGGG G G G G BB G G G G G G B G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G B B GG G G G G G G G G G G GG B B G G GGG GG G G G G G G G G G GG G G G G G G G G G GGG GGG G GG G G G G GG G G G GG GGB GBBGGBGGB B B B G B BB B B B B B B B B B B B B B B B

t−SNE 100

PCA

Dimension 2

TPE

2

Dimension 1

−100

B BB G GBB BB GGG G BBB BB BBG GBB B BB G GG B G BB GG B G GG GG B B GG B G G G GGGGG G G G G G B G B B G G B B G B G G G G B G GGGG GBGB B GGG GGG BB B GG G BG B GG BB B GG B BG GGGG B BG BB GGG GG GG B GG GGGG B GB GBGBBB B GG G GB GG GB B G BG GG G G GG G G G B G GGBB GBB G G G G GGG GGG GG G G G GGBBBGG G BB B GG G G G G B G G G GG GG G G BBGG B BB GG B BG G GGB G G G G B G G G G B B G G GGGG G B B BB B G B B G G G B GGGGGG GGG GGG GG GG GGGGGG GG G BBBB BGG B BBB G GG BB G G G B BB B BBBB B B G G B G B B B G G B GGG BB BB B B GG GGG GG G GG G GG G G −50

0

50

Dimension 1

Figure 3. Embeddings of radar signals by TPE, PCA, and t-SNE. Each point is a radar signal labeled by its quality where ‘G’ denotes a good radar signal and ‘B’ denotes a bad radar signal.

SL dendrogram are equivalent to the edge lengths of the MST (Gower & Ross, 1969), the lower bound is equivalent to the merge distance between the clusters ak , bk and the upper bound is equivalent to the total merge distances between and within the clusters ak , bk . Therefore, the lower and upper bounds follow from the connectedness constraints. The distance bounds can be used to obtain an upper bound on the stress of the embedding. Therefore, preserving connectedness prevents the embedding from having arbitrarily high stress. Corollary 1. We have X σ(X) ≤ max{(Di,j − Ui,j )2 , (Di,j − Li,j )2 }.

the domains Archaea, Bacteria, and Eukaryota collected from public databases by Pollack et al. (2005). Since protein sequences cannot be represented as vectors, methods such as PCA that require such a representation cannot be used. Finding a good metric for protein sequences is a difficult and longstanding problem (Atchley et al., 2005). We used sequence alignment scores from the basic local alignment search tool (BLAST) (Althscul et al., 1990) as dissimilarities. Since BLAST scores can be highly non-metric (Roth et al., 2003), they are notoriously difficult to embed without collapsing points on top of each other. Figure 2 shows the embeddings by TPE, non-metric MDS, and t-SNE. TPE clearly separates all three domains, while non-metric MDS and t-SNE mix members of different domains together.

xi ,xj ∈X

3.2. Radar Signals

3. Results In this section, we demonstrate the applicability of TPE by analyzing simulated and real examples drawn from molecular biology, signal processing, and computer vision. Rather than being exhaustive, our goal is to highlight some of the properties of TPE through each example. We compare TPE to both traditional methods, PCA, non-metric MDS, and Isomap (Tenenbaum et al., 2000), and a force-based method, t-SNE (van der Maaten & Hinton, 2008), that recent studies have found separates clusters well (Venna et al., 2010). We demonstrate that TPE most robustly preserves different types of structure. 3.1. Protein Sequences As a first example, we analyzed 124 protein sequences of 3-phosphoglycerate kinases (3-PGKs) belonging to

As a second example, we analyzed 351 radar signals targeting free electrons in the ionosphere collected by (Sigillito et al., 1989). Each radar signal consisted of 34 integer and real measurements. We treated each radar signal as a 34 dimensional vector and used Euclidean distances as dissimilarities. Good radar signals were defined as those that returned evidence of free electrons in the ionosphere, while bad radar signals were defined as those that passed through the ionosphere and returned background noise. Therefore, good radar signals are highly similar, while bad radar signals can be highly dissimilar. Figure 3 shows the embeddings by TPE, PCA, and t-SNE. TPE clearly separates good and bad radar signals, while PCA and t-SNE mix them together.

t−SNE

222 2 2222 222 4 22 22 222222 02 422284522 42222 2222 5 2 5 2 5 2 2222 222222 222222 2240248233 222222222 2 228 422222252222 5 6 2 22 8 2 5000 0 00055522208 2222 2 888888222 28 2 2 22 0 00000000 506 5 0000000 2 14242 8 883 2228 222228 000 0000 00 000 25000000 0 000000 2 000 000000 0000 0 00 00 0000 4 00000000044 00 0 00 0 0 9 882888228884 555523222 4617 000000 0 0 000 0 0 00000000 7 0 8 0 1 00 0 0 0 0 0 0 1 5 0 0 0 4 0 0 0 1 0 0 8 1 1 1 8 1 8 0 1 00 0 46111 00000000000 1111 8888 1 8 8 338833533333 4 88 0000 11 114 44 11 00 11 4 88 11 000 200000 8 00 000666066066066 111 8888 8 383 2 666 4444448 1 1111 82888883333 11 00000 1149 6 000000 11 1 8888 11 00 66666 666666 60 3 32 60 0000 666 66 7 833 6619 99 338 99 00000 33333333333 8 77777 000 9 6 8 888 0 6 9 6 6 9 9 0 7 8 9 7 9 6 6 6 6 6 0 0 8 0 9 0 9 83 66666 00666666 333333333 7 66 9 777 99999 777888888 97 660 6050 977777 99 777 6 77 7 88838333333333333222 666666649 6 848888 9 7 999 556666 66 7 999 6 7 9 7 7 8 9 7 7 9 7 7 7 6 6 6 9 6 6 6 7 7 6 7 7 9 7 7 9 5 7 3 7 9 9 6 7 6 99 777 7778 778344 25322 777 756555 556666 9999999797744 74 409797 5556666666694 443 53 4 27833 549449 49 8 93388 3 44 555552 88 74444449872 66564444845494 2 3 9 55 45 4 2 44 5 7424248288 2 79

79 7 99779 7777 77 7 7799 99 7 797 7 7 7 7 7 77 77749977797983 7 793393 7 7 7 9 997 7 79 77749 7 8 9 4 7 9 89 8 938887492 944 9 4 7 9 8 7 9 9 9 7 9 8 3 333385 53 27 7 7994799 948884 9 4328 0 428 3 3 894 8 7 83 32 9949 32 99487 8 94522833 35 8 8 2 2 8 8 7 3 8 2 5 4 5 8 5 5 3 4 99 2 7 834883 4477 78353 32823 8 323 3088500 0 0 00 288 998848 455 8 4854288342822859 2 80000 000 0 0 83733 98 2 94 9789884 0 3 3 5 0 5 4 2 3 0 89 4 0 8 2 8 83 232435 7 848428338 4344 3 54708000500000 0 000 0 0 4340 3 522342 248 863 4332 383 5 56 44 888 828 0000000000000 79799 8 7 4 00 0 38 02063 24 3 22 2488882 0020032 23 8 8 0000 8238 2225803 2 00000 00 00 5250 05 499888 05 0 00 872 22 2 000 000000 0000000 0 63 5228 050 22 53 3 2 8 4 8 0 0 9 2 0 5 7 0 6 5 5 2 0 8 74 82 222824562225020 5022205 2 000 00 0 00000000 1 4 5 8 0 6 3 0 0 2 22 5242 206 060 0 0 0 0000 8222226 11111 6664 25 32 6 242202 6260020062 5005606 0000600000 0 42 1 0 250 000 6225225 606 000 11 1 11 6 1111464 46 6020 666006 11 11 662 6662222 11 06 60262 000000 0 6 1 1 6 6 2 1 11 6 2 2 1 2 2 6 1 0 1 1 2222 11111111 2 6660 60066400 0 000 0 4426264606 666 66666666 0666 00 0 6660666 111 61226626 2 2 6 66 66 6666 6 2 662 6 6 6

7 7 77777777777 77777 355 77 7 7777777 2 77777 5 7977 797777799977 9 999 333353333335555 9999 3 3 9 9 9 9 8 9 9 9 4 7 333 9 9 9 9 49999994 33353 3 33 9 99999 3 888 3 9 9 3 33 3 9 9 3 9 9 33333 9 94 3 83 883 888 8 9 94 2 3 33 449 44 7 3 888888 88288 77444 47 2 2 323 35355 555 2 4 442 94 555555 33888888 555 888 88888 4444444444 222 5 3 22222222 8884 44 444 44444 2 8 8 8 5 2 8 4 8 8 2 8 8 2 2 88 8888 22222 2 444455 8 88 8 2 5 8 8 8 2 8 2 8 2 2 2 6 8 2 1 4111111 7444 8 8 8 222222 2224222222 2 22 22 22 11111 11 111111 0 222222 11111 5 482 8885555 222 22 11111 11 1 7 111 2222 2 222222 2222 1 666666 111111111111 1 11 1 1 1 86 86 66 2222 2 222 11 66 666 11164 266666 66 11111 1 2 66666 66 66666 2 1 8 1 222222 11111 1 1 1 66 1 00 66666 66666 111111 66666 1 1 1 1 1 11 11111 66 00000 00 2 5 6 0 00 6 6 6 0 0 6 66 005 000 0 6 0 0 0 0 0 6 0 0 0 0 0000000 666 000 00000 00 6 6 6 6 0 0 0 0000 000000000 666 6000005004 00 000000 0000000 0 0000 000 0000 0 0000 0000 000 000 00 0 0000000008 00000 000000000 0 0 0 0 0 0 0 0 0 0 0 0 00 000000000000

5 −100

0

100

Dimension 1

200

−5

0

5

−20

0

20

10

Dimension 1

−40

Dimension 2

0 −5

40

PCA

5

TPE

Dimension 2

150 50 −50 0

Dimension 2

250

Tree Preserving Embedding

−40

−20

0

20

40

Dimension 1

Figure 4. Embeddings of handwritten digits by TPE, PCA, and t-SNE. Each point is an image labeled by the digit it represents.

3.3. Handwritten Digits As a third example, we analyzed 1000 images of handwritten digits collected by the United States Postal Service in (Hull, 1994). Each image was 16 × 16 pixels and greyscale color. We treated each image as a 256 dimensional vector and used Euclidean distances as dissimilarities. Since the intrinsic dimensionality of handwritten digits is thought to be much higher than two or three (Saul & Roweis, 2003), it is notoriously difficult to separate all ten digits in an embedding due to the crowding problem. Figure 4 shows the embeddings by TPE, PCA, and t-SNE. TPE and t-SNE separate all ten digits, while PCA only separates some. 3.4. Stability In order to test the sensitivity of TPE to sampling variability, we generated 100 samples of the barbell, a popular example of a two-dimensional non-convex manifold (Saul & Roweis, 2003). For each sample, we generated 150 points as follows. We sampled 50 points each from multivariate normal distributions with means (0, 0) and (10, 10) and standard deviation 1 and 50 points with coordinates (u, u) + z where u is sampled uniformly from the interval [0, 10] and z is sampled from a multivariate normal distribution with mean (0, 0) and standard deviation 0.05. The barbell is notoriously difficult to embed due to the presence of both clusters in the bells and continuity in the bar (Saul & Roweis, 2003), making it a good example to test for stability. We compared TPE, Isomap, and t-SNE by using Procrustes analysis to align their embeddings to the exact embedding. The average sample variance of the coordinates of the points in the embeddings was 3.45 for

the exact embedding, 2.72 for TPE, 3.40 for Isomap, and 984.54 for t-SNE. TPE and Isomap had comparable stability to the exact embedding, while t-SNE was two orders of magnitude less stable than TPE and Isomap.

4. Discussion Revealing clusters is one of the main goals of visualization. However, most dimensionality reduction methods have difficulty preserving clusters due to the crowding problem. In three difficult examples, TPE was able to separate clusters of interest well compared to other dimensionality reduction methods. It is important to emphasize that the success of TPE is not a mere consequence of the ability of the SL dendrogram to separate clusters. In all of the examples, the clusters found by cutting the SL dendrogram were no better than random clusters in terms of accuracy with respect to the clusters of interest. TPE succeeds by preserving clusters at all resolutions rather than just a single resolution. TPE is a promising approach to visualization because it has a formal guarantee of cluster separation, requires no parameters, and can handle general types of data. However, there are a few issues with TPE that limit its applicability. First, TPE has a cubic time complexity, which can be prohibitively slow for large data sets. Second, since TPE only provides an embedding rather than a mapping, it cannot be applied to out-ofsample data. Finally, although we have found that the greedy approximation works well in practice, better optimization methods may significantly improve the performance of TPE. We hope that these issues will be addressed by future research.

Tree Preserving Embedding

References Althscul, S F, Gish, W, Miller, W, Myers, E W, and Lipman, D J. Basic local alignment search tool. J Mol Biol, 215:403–410, 1990. Atchley, W R, Zhao, J, Fernandes, A D, and Dr¨ uke, T. Solving the protein sequence metric problem. Proc Natl Acad Sci USA, 102:6395–6400, 2005. Belkin, M and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput, 15:1373–1396, 2003. Buja, A, Swayne, D F, Littman, M L, Dean, N, Hofmann, H, and Chen, L. Data visualization with multidimensional scaling. J Comput Graph Stat, 17:444–472, 2008. Carlsson, G and M´emoli, F. Characterization, stability and convergence of hierarchical clustering methods. J Mach Learn Res, 11:1425–1470, 2010. Carreira-Perpin´ an, M A. The elastic embedding algorithm for dimensionality reduction. In Proceedings of the 27th International Conference on Machine Learning, 2010. Chen, L and Buja, A. Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. J Am Stat Assoc, 104:209–219, 2009. Cook, J A, Sutskever, I, Mnih, A, and Hinton, G E. Visualizing similarity data with a mixture of maps. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics, 2007.

Kaufmann, M and Wagner, D. Drawing Graphs: Methods and Models. Springer-Verlag, New York, 2001. Mardia, K V, Kent, J T, and Bibby, J M. Multivariate Analysis. Academic Press, London, 1979. Pollack, J D, Li, Q, and K, Pearl D. Taxonomic utility of a phylogenetic analysis of phosphoglycerate kinase proteins of archaea, bacteria, and eukaryota: Insights by bayesian analyses. Mol Phylogenet Evol, 35:420–430, 2005. Quist, M and Yona, G. Distributional scaling: an algorithm for structure-preserving embedding of metric and non-metric spaces. J Mach Learn Res, 5:399–420, 2004. Roth, V, Laub, J, Kawanabe, M, and Buhmann, J M. Optimal cluster preserving embedding of nonmetric proximity data. IEEE T Pattern Anal, 25:1540–1551, 2003. Roweis, S T and Saul, L K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323– 2326, 2000. Saul, L K and Roweis, S T. Think globally, fit locally: Unsupervised learning of low dimensional manifolds. J Mach Learn Res, 4:119–155, 2003. Shaw, B and Jebara, T. Structure preserving embedding. In Proceedings of the 26th International Conference on Machine Learning, 2009. Shepard, R N. Multidimensional scaling, tree-fitting, and clustering. Science, 210:390–398, 1980.

de Silva, V and Tenenbaum, J B. Global versus local methods for nonlinear dimensionality reduction. In Advances in Neural Information Processing Systems, volume 15, 2003.

Sigillito, V G, Wing, S P, Hutton, L V, and Baker, K B. Classification of radar returns from the ionosphere using neural networks. J Hopkins Apl Tech D, 10:262–266, 1989.

Di Battista, G, Eades, P, Tamassia, R, and Tollis, I G. Graph Drawing: Algorithms for the Visualization of Graphs. Prentice Hall, New York, 1998.

Stuetzle, W. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif, 20:25–47, 2003.

Eades, P. The realization problem for euclidean minimum spanning trees is np-hard. Algorithmica, 16:60–82, 1996.

Tenenbaum, J B, de Silva, V, and Langford, J C. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000.

Gower, J C and Dijksterhuis, G B. Procrustes Problems. Oxford University Press, Oxford, 2004. Gower, J C and Ross, G J S. Minimum spanning trees and single linkage cluster analysis. Appl Stat, 18:54–64, 1969. Hartigan, J A. Clustering Algorithms. Wiley, New York, 1975. Hartigan, J A. Statistical theory in clustering. J Classif, pp. 63–76, 1985. Hinton, G E and Roweis, S T. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems, volume 15, 2003. Hull, J J. A database for handwritten text recognition research. IEEE T Pattern Anal, 16:550–554, 1994. Johnson, S C. Hierarchical clustering schemes. Psychometrika, 32:241–254, 1967.

van der Maaten, L and Hinton, G E. Visualizing data using t-sne. J Mach Learn Res, 9:2579–2605, 2008. Venna, J, Peltonen, J, Nybo, K, Aidos, H, and Samuel, K. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. J Mach Learn Res, 11:451–490, 2010. Xing, E P, Ng, A Y, Jordan, M I, and Russell, S. Distance metric learning, with application to clustering with sideinformation. In Advances in Neural Information Processing Systems, volume 14, 2002. Zadeh, R B and Ben-David, S. A uniqueness theorem for clustering. In Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence, 2009.

Suggest Documents