A Convex Approach for Learning Near-Isometric ... - Semantic Scholar

2 downloads 1250 Views 653KB Size Report
Keywords: Linear Embeddings, Compressive Sensing, Approximate Nearest Neighbors. 1. ... Further, a random mapping is independent of the data under consideration ... be NP-hard, and so we perform a convex relaxation to obtain a trace-norm ..... The efficiency of nuclear-norm minimization for low-rank matrix recovery.
Journal of Machine Learning Research (2012)

Submitted 11/12; Published x/xx

A Convex Approach for Learning Near-Isometric Linear Embeddings Chinmay Hegde

[email protected]

CSAIL, MIT Cambridge, MA 02139

Aswin C. Sankaranarayanan

[email protected]

ECE Department, Rice University Houston, TX 77005

Wotao Yin

[email protected]

CAAM Department, Rice University Houston, TX 77005

Richard G. Baraniuk

[email protected]

ECE Department, Rice University Houston, TX 77005

Editor: TBD

Abstract We propose a novel framework for the deterministic construction of linear, near-isometric embeddings of a finite set of data points. Given a set of training points X ⊂ RN , we consider the secant set S(X ) that consists of all pairwise difference vectors of X , normalized to lie on the unit sphere. We formulate an affine rank minimization problem to construct a matrix Ψ that preserves the norms of all the vectors in S(X ) up to a distortion parameter δ. While affine rank minimization is NP-hard, we show that this problem can be relaxed to a convex formulation that can be solved using a tractable semidefinite program (SDP). In order to enable scalability of our proposed SDP to very large-scale problems, we adopt a two-stage approach. First, in order to reduce compute time, we develop a novel algorithm based on the Alternating Direction Method of Multipliers (ADMM) that we call Nuclear norm minimization with Max-norm constraints (NuMax) to solve the SDP. Second, we develop a greedy, approximate version of NuMax based on the column generation method commonly used to solve large-scale linear programs. We demonstrate that our framework is useful for a number of applications in machine learning and signal processing via a range of experiments on large-scale synthetic and real datasets. Keywords: Linear Embeddings, Compressive Sensing, Approximate Nearest Neighbors

1. Introduction In many machine learning and signal processing applications, we seek a low-dimensional representation (or embedding) of data that are elements of a high-dimensional ambient space. The classical approach to constructing such an embedding is principal components analysis (PCA) (Moore, 1981), which involves linearly mapping the N -dimensional data into the K-dimensional subspace spanned by the dominant eigenvectors of the data covariance matrix, typically with K  N . A key appeal of PCA is its computational efficiency; it c

2012 C. Hegde, A. C. Sankaranarayanan, W. Yin, and R. G. Baraniuk.

can be very efficiently performed using a singular value decomposition (SVD) on the data. Another key appeal is its generalizability; PCA produces a smooth, globally defined mapping that can be easily applied to unseen, out-of-sample test data points. Nevertheless, PCA has an important drawback: the produced embedding can arbitrarily distort pairwise distances between sample data points. This phenomenon is exacerbated when the data arises from a nonlinear submanifold of the signal space (Achlioptas, 2001). Due to this behaviour, PCA can potentially map two distinct points in the ambient signal space to a single point in the low-dimensional embedding space, rendering them indistinguishable. This hampers the application of PCA-like techniques to important signal processing problems such as reconstruction and parameter estimation. An alternative to PCA is the approach of random projections. Consider X , a cloud of Q points in a high-dimensional Euclidean space RN . The Johnson-Lindenstrauss Lemma (Johnson and Lindenstrauss, 1982) states that X can be linearly mapped to a  subspace of dimension M = O (log Q) with minimal distortion of the Q2 pairwise distances between the Q points (in other words, the mapping is near-isometric). Further, this linear mapping can be easily implemented in practice; one simply constructs a matrix Φ ∈ RM ×N with M  N whose elements are drawn randomly from a certain probability distribution. Then, with high probability, Φ is near-isometric under a certain lower-bound on M (Johnson and Lindenstrauss, 1982; Achlioptas, 2001). This approach can be extended to more general signal classes beyond finite point clouds. For example, random linear projections provably preserve, up to a given distortion, all pairwise distances between points lying on compact, differentiable low-dimensional manifolds (Baraniuk and Wakin, 2009; Clarkson, 2008) as well as pairwise distances between all sparse signals (Cand`es, 2006; Baraniuk et al., 2008). This intuition is a fundamental component of compressive sensing (CS), an emergent framework for signal acquisition and reconstruction. Despite their simplicity,random projections suffer from certain shortcomings. Their theoretical guarantees are probabilistic and asymptotic. Further, a random mapping is independent of the data under consideration and hence cannot leverage any special geometric structure of the data if present. In this paper, we propose a novel deterministic framework for constructing linear, nearisometric embeddings of a finite high-dimensional dataset. Given a set of training points X ⊂ RN , we consider the secant set S(X ) consisting of all pairwise difference vectors of X normalized to lie on the unit sphere. We formulate an affine rank minimization problem (3) to construct a matrix Ψ that preserves the norms of all of the vectors in S(X ) up to a desired distortion parameter δ. The affine rank minimization problem is known to be NP-hard, and so we perform a convex relaxation to obtain a trace-norm minimization (4), which is equivalent to a tractable semidefinite program (SDP). The SDP (4) can be solved using any generic interior-point method for convex programming (for example, the solvers SDPT3 (T¨ ut¨ unc¨ u et al., 2003) and SeDuMi (Polik, 2010)). However, the convergence of such generic SDP solvers, although guaranteed, is typically very slow, even for small problem sizes. Further, the presence of the max-norm constraints in (4), though convex, negates the direct application of existing first-order methods for large-scale semidefinite programming (Wen, 2009). Our specific contributions are three-fold. First, as mentioned above, we propose a convex optimization framework for learning near-isometric linear embeddings for finite datasets. Second, we develop a novel algorithm that we call Nuclear norm minimization with Max2

norm constraints (NuMax) to solve (4). NuMax is based on the Alternating Direction Method of Multipliers (ADMM); it decouples the complex SDP formulation (4) into a sequence of easy-to-solve subproblems and enables much faster rates of convergence than standard approaches. Third, in order to achieve scalability to very large-scale problems, we propose a modified, greedy version of NuMax that mirrors the column generation approach commonly used to solve large-scale linear programs (Dantzig and Wolfe, 1960). With this modification, NuMax can efficiently solve problems where the number of elements in the secant set S(X ), i.e., the number of constraints in (4), is extremely large (e.g., 107 or greater). We demonstrate that the NuMax framework is useful for a number of applications in machine learning and signal processing. First, if the training set X comprises sufficiently many points that are uniformly drawn from a low-dimensional smooth manifold M, then we show that the matrix Ψ represents a near-isometric linear embedding over all pairwise secants of M. In other words, Ψ satisfies the restricted isometry property (RIP) for signals belonging to M and therefore enables the design of efficient measurement matrices for the compressive sensing of manifold-modeled datasets. Second, since the embedding Ψ (approximately) preserves all pairwise secants in the training set X , it is also guaranteed to (approximately) preserve nearest-neighbors of all points of X . Therefore, NuMax produces an efficient method to design linear hash functions for high-dimensional data retrieval. Third, by carefully pruning the secant set S(X ), we can tailor Ψ for more general signal inference tasks, such as supervised binary classification. Several numerical experiments in Section 5 demonstrate the advantages of our approach. We organize this paper as follows. In Section 2 we provide a brief background on existing methods for linear dimensionality reduction and highlight some connections with compressive sensing. In Section 3 we introduce our main theoretical contributions and propose the SDP formulation for designing “good” linear embeddings. In Section 4 we develop efficient algorithms that can solve our proposed SDP for large-scale problems. In Section 5 we apply our linear embedding framework to a number of diverse problems and demonstrate its efficiency both on synthetic and real-world datasets. In Section 6 we provide concluding remarks and list potential directions for future work.

2. Background 2.1 Notation For the rest of this paper, we will exclusively work with real-valued vectors and matrices, although our techniques can be extended to the complex case mutatis mutandis. We use lowercase boldface letters to denote vectors, uppercase boldface letters to denote matrices, and calligraphic letters to denote sets or set-valued operators. The `p -norm of a vector x = [x1 , . . . , xN ]T ∈ RN is defined as  1   PN |x |p p , i=1 i kxkp =   max |xi |, i=1,2,...,N

3

p ∈ [1, ∞) p = ∞.

Given two symmetric matrices X, Y ∈ RN ×N , we write X  Y if the matrix X − Y is positive semidefinite (PSD). Denote the singular value decomposition (SVD) of a matrix X ∈ RN ×N as X = UΣVT , where Σ = diag(σ) is a diagonal, non-negative matrix where σ is the vector of (sorted) singular values of X. The rank of X is equal to the number of nonzero entries in σ. The Frobenius norm of X, denoted by kXkF , is the square root of the sum of squared entries of X, or equivalently, the `2 -norm of σ. The nuclear norm of X, denoted by kXk∗ , is equal to the sum of its singular values, or equivalently, the `1 -norm of σ. When X is a positive semidefinite (PSD) symmetric matrix, kXk∗ is equal to the trace, or the sum of diagonal values, of X. 2.2 Principal Components Analysis (PCA) Consider a set of Q data vectors X = {x1 , x2 , . . . , xQ } ⊂ RN , where N, Q are potentially very large. We group the elements of X as columns of the matrix X ∈ RN ×Q , which we term the data matrix. Given such a data matrix, a natural question is whether the Q points can be projected or embedded into a lower-dimensional space RM , M < N, with minimal distortion. Such an embedding would enable tractable storage and processing of the dataset without sacrificing accuracy. One such embedding can be obtained via a popular statistical technique known as principal components analysis (PCA), also known as the Karhunen-Lo`eve Transform (Moore, 1981). PCA proceeds as follows; given X, we perform an SVD of X, i.e., compute X = UΣVT , and then linearly project the columns of X onto the subspace spanned by the r leftmost columns of U (termed the PCA basis vectors). Denote the projected columns by the rank-r matrix Xr (of size N × Q). Then, it is well-known that Xr is the optimal approximation to X in terms of the Frobenius norm, i.e., kX − Xr kF ≤ kX − YkF , where Y is any other rank-r matrix. Since the Frobenius norm measures the error aggregated over all columns of a matrix, PCA can be viewed as an efficient linear embedding method that incurs minimal distortion of the data on average. PCA is ubiquitous in machine learning and statistics and has been successfully applied for signal acquisition, compression, estimation, and classification (Dony and Haykin, 1995; Tipping and Bishop, 1999). PCA is conceptually simple and efficient. The primary computational requirements for PCA merely involve performing the SVD of the data matrix. Additionally, owing to its linear nature, PCA can easily be extended to out-of-sample points. Given a training dataset X, a low-dimensional embedding of a new, unseen data vector x0 ∈ / X simply involves a linear projection of x0 onto the PCA basis vectors. Further, PCA can be adapted to account for problem-specific requirements. For example, if the data vectors originate from one of two classes and the goal is to maintain class separability, then PCA can be modified to produce new, related techniques such as Fisher’s Linear Discriminant Analysis (LDA) or Factor Analysis (Fisher, 1936; Harman, 1976). These properties highight the considerable advantages of PCA when compared more sophisticated nonlinear data embedding (or “manifold learning”) methods that have emerged over the last decade (Tenenbaum et al., 2000; Roweis and Saul, 2000; Donoho and Grimes, 2003; Weinberger and Saul, 2006). Nevertheless, the convenience of PCA is balanced by certain drawbacks. Crucially, the optimality of PCA is not accompanied by any guarantees regarding the local geometric 4

properties of the resulting embedding (Achlioptas, 2001). Therefore, any information contained in the geometric inter-relationships between data points is irrevocably lost. One can easily generate examples of datasets where the distance between the PCA embeddings of two distinct high-dimensional points is vanishingly small. In other words, PCA embeddings are not guaranteed to be isometric (i.e., distance-preserving) or even invertible. This profoundly affects both algorithm design and analysis. 2.3 Random Projections The problem of constructing low-dimensional isometric embeddings of data, i.e., embeddings that preserve all pairwise distances between data points, has been studied in depth and is quickly becoming classical (for an excellent introduction to this subject, see Linial et al. (1995)). Concretely, we seek an embedding that satisfies the following relaxed notion of isometry: Definition 1 Suppose M ≤ N and consider X ⊂ RN . An embedding operator P : X → RM satisfies the restricted isometry property (RIP) on X if there exists a positive constant δ > 0 such that, for every x, x0 in X , the following relations hold:

2

2

2 (1 − δ) x − x0 2 ≤ Px − Px0 2 ≤ (1 + δ) x − x0 2 . (1) The quantity δ encapsulates the deviation from perfect isometry and is called the isometry constant. We (trivially) observe that the identity operator on X always satisfies the RIP with δ = 0; however, in this case M = N . It is less clear whether embedding operators that satisfy the RIP even exist for M < N . The celebrated Johnson-Lindenstrauss (JL) Lemma answers this question in the affirmative (Johnson and Lindenstrauss, 1982). A simplified version of the JL Lemma is as follows.  Lemma 1 Consider a dataset X = {x1 , x2 , . . . , xQ } ⊂ RN . Let M ≥ O δ −2 log Q . Construct a matrix Φ ∈ RM ×N by drawing each element of Φ independently from a Gaussian distribution with zero mean and variance 1/M . Then, with high probability, the linear operator Φ : RN → RM satisfies the RIP on X . Thanks to Lemma 1, we can construct a (near) isometric embedding of a dataset X , with high probability, simply by forming a matrix Φ whose elements are chosen randomly from a Gaussian distribution. We highlight some important features of the JL Lemma. First, like PCA, the linear embedding Φ is computationally efficient and can be applied to out-of-sample points. Second, unlike PCA, the constructed embedding Φ is universal, i.e., fully independent of the dataset X . Instead of projecting the data onto the basis vectors of the subspace formed by the singular vectors of X , one simply picks a few basis vectors at random and projects the data onto these vectors. The JL Lemma guarantees that such an embedding preserves the local geometric structure of X . Third, the dimension M of the lower-dimensional embedding is only logarithmic in the number of data points and is independent of the ambient dimension N ; therefore, potentially M  N . The method of random projections can be extended to more general signal classes beyond finite point clouds. For example, random linear projections provably satisfy the RIP 5

for data modeled as compact, differentiable low-dimensional submanifolds (Baraniuk and Wakin, 2009; Clarkson, 2008). A particularly striking connection has been made with compressive sensing (CS), an emergent paradigm for efficient acquisition and processing of K-sparse signals, i.e., signals that can be expressed as the sum of only K elements from a basis (Baraniuk et al., 2008). The central result of CS asserts that if a matrix Φ ∈ RM ×N satisfies the RIP on the set of all K-sparse signals, then it is possible to stably recover a sparse signal x from the linear embedding (or “measurements”) y = Φx, even when M is only proportional to K log N/K. Further, this recovery can be achieved efficiently via a convex program or a greedy pursuit (Cand`es, 2006; Donoho, 2006; Tropp and Gilbert, 2007). From a practical signal acquisition perspective, it is even possible to build practical signal acquisition systems where the embedding y = Φx is performed in real-time (Wakin et al., 2006; Laska et al., 2007). Random projections provide a simple method to construct embeddings that satisfy the RIP for arbitrary datasets. It can be shown that, in the worst case for a given isometry constant δ, there exist datasets that cannot be embedded into any M -dimensional space where M ≤ δ −2 log(δ −1 ) log Q (Alon, 2003). However, this worst case only occurs for a specific configuration of data points that rarely occurs in practice. Further, the universality property of random projections negates its ability to construct embeddings that leverage the intrinsic geometry of a given set of data vectors. In light of the above, we pose the following question: for a specific dataset X and a parameter δ, what is the dimension of the smallest possible space into which X can be embedded with distortion δ ? Our proposed framework in Section 3 takes some initial steps towards answering this question. 2.4 Secant-Preserving Embeddings Our motivation for constructing efficient near-isometric linear embeddings stems from a result in differential geometry known as Whitney’s Embedding Theorem (Hirsch, 1976). The theorem states that any smooth K-dimensional submanifold M ⊂ RN can be embedded into R2K+1 such that the geometric structure of M is preserved. An important notion in the proof of Whitney’s Theorem is the normalized secant manifold of M:   x − x0 0 0 S(M) = , x, x ∈ M, x 6= x . (2) kx − x0 k2 The secant manifold S(M) is a 2K-dimensional submanifold of the (N − 1)-dimensional unit sphere in RN . Any point on the unit sphere corresponds to a projection from RN to RN −1 . Therefore, by choosing a projection direction that does not belong to S(M), one can map M into RN −1 injectively (i.e., without overlap). If 2K ≤ N − 1, then this is always possible. Whitney’s (weak) Embedding Theorem formalizes this intuition; unfortunately, the proof for Whitney’s Embedding Theorem is non-constructive and therefore does not lend itself to easy implementation in practice. In Section 3 below, we develop an efficient computational framework to produce a lowdimensional linear embedding that satisfies the RIP (1) on a given dataset X . We build and improve upon the Whitney Reduction Network (WRN) approach for computing autoassociative graphs (Broomhead and Kirby, 2001, 2005). The WRN is a heuristic that is algorithmically similar to PCA. Instead, our approach follows a completely different path. 6

In Section 3 we develop an optimization formulation (4) based on convex programming that is guaranteed to produce a near-isometric, linear embedding. Then, in Section 4, we develop efficient algorithms to solve the optimization (4) for large-scale problems.

3. Near-Isometric Linear Embeddings 3.1 Optimization Framework Given a dataset X ⊂ RN , our goal is to find a linear embedding P : RN → RM , M  N, that satisfies the RIP (1) on X with parameter δ > 0. Following (Baraniuk and Wakin, 2009), we will refer to δ as the isometry constant. We form the secant set S(X ) using (2) to obtain a set of S = Q2 unit vectors S(X ) = {v1 , v2 , . . . , vS }. Then, we seek a projection matrix Ψ ∈ RM ×N with as few rows as possible that satisfies the RIP on S(X ). We cast this problem in terms of an optimization over the space of PSD symmetric . matrices. Let SN ×N be the set of symmetric N × N matrices. Define P = ΨT Ψ ∈ S N ×N ; 2 then, rank(P) = M . We also have the constraints that | kΨvi k2 − 1| = viT Pvi − 1 is no greater than δ for every secant vi in S(X ). Let 1S denote the S-dimensional all-ones vector, and let A denote the linear operator that maps a symmetric matrix X to the S-dimensional vector A : X → {viT Xvi }Si=1 . Then, the matrix P that we seek is the solution to the optimization problem minimize subject to

rank(P)

(3)

kA(P) − 1S k∞ ≤ δ, P  0.

Rank minimization is a non-convex problem and is known to be NP-hard, in general. Therefore, following Fazel (2002), we propose to instead solve a nuclear-norm relaxation of (3): minimize subject to

kPk∗

(4)

kA(P) − 1S k∞ ≤ δ, P  0.

Since P is a PSD symmetric matrix, the nuclear norm of P is equal to its trace. Thus, the problem (4) consists of minimizing a linear objective function subject to linear inequality constraints over the cone of PSD symmetric matrices1 . Hence, it is equivalent to a semidefinite program (SDP) and can be solved in polynomial time (Alizadeh, 1995). Once the solution P∗ = UΛUT to (4) is found, rank(P∗ ) determines the value of M , the dimensionality of the linear embedding. The desired linear embedding Ψ can then be calculated using a simple matrix square root 1/2

Ψ = ΛM UTM ,

(5)

1. A similar minimization formulation, called PhaseLift, has been recently proposed in Candes et al. (2012), although in a different context; specifically, they focus on the classical problem of recovering highdimensional signals from magnitude-only linear measurements.

7

where ΛM = diag{λ1 , . . . , λM } denotes the M leading (non-zero) eigenvalues of P∗ , and UM denotes the set of corresponding eigenvectors. In this manner, we obtain a low-rank matrix Ψ ∈ RM ×N that satisfies the RIP on the secant set S(X ) with isometry constant δ. The convex optimization formulation (4) is conceptually very simple, the only inputs being the input dataset X and the desired isometry constant δ > 0. 3.2 Analysis Since we seek an embedding matrix Ψ with a minimal number of rows, a natural question to ask is whether the nuclear-norm relaxation (4) is guaranteed to produce solutions P∗ of minimum rank. The efficiency of nuclear-norm minimization for low-rank matrix recovery has been thoroughly examined in a number of different settings (Recht et al., 2010; Cand`es and Recht, 2012). However, we highlight two unique aspects of the optimization problem (4). First, the `∞ -norm constraints in (4) are non-standard. Second, the best known theoretical results make certain restrictive assumptions on the linear operator A in (4); for example, one common assumption is that the entries of the matrix representation of A are independently drawn from a standard normal distribution. This assumption is clearly violated in our case, since A is a function of the secant set S(X ) and in general depends heavily on the geometry of the data at hand. Nevertheless, the following classical result from semidefinite programming provides an upper bound of the rank of the optimum P∗ in (4). Proposition 1 (Barvinok, 1995; Moscato et al., 1998) Let r∗ be the rank of the optimum to the SDP (4). Then, &p ' 8|S(X )| + 1 − 1 r∗ ≤ . (6) 2 In essence, the rank of P∗ grows as the square root of the cardinality of the secant set S(X ). The upper bound on the optimal rank r∗ provided in (6) can be loose, since the cardinality of S(X ) can potentially be very large. Additionally, one might intuitively expect the optimal rank r∗ to depend upon the geometric arrangement of the data vectors in X , as well as the input isometry constant δ; however, the bound in Proposition 1 does not reflect this dependence. A full analytical characterization of the optimal rank obtained by the program (4) is of considerable interest both in theory and practice, but we will not pursue this direction further in this paper. 3.3 Linear Embeddings of Manifolds The optimization framework (4) provides a novel method for producing an efficient, lowdimensional, linear embedding of an arbitrary high-dimensional dataset X . More specifically, suppose X comprises points that are sampled from a K-dimensional smooth, compact manifold M ⊂ RN . In such a scenario, we can make a stronger claim: under certain assumptions on M and X , the near-isometry property of the proposed embedding Ψ holds not only on pairwise secants of X , but more generally to all pairwise secants in M. In other words, the near-isometry property of Ψ can be extended to new, unseen data from the underlying manifold M. 8

More precisely, suppose that Ψ satisfies the RIP with constant δ over a training set X ∈ RN . If the training set X represents a particular high-resolution sampling of points on or close to M, then Ψ provably satisfies the RIP with a slightly larger constant over the entire manifold M; see Section 3.2.5 of Baraniuk and Wakin (2009) for a detailed derivation. We numerically validate this phenomenon on synthetic manifold-modeled data below in Section 5. Several SDP-based approaches for processing manifold-modeled data have been developed in the literature (Weinberger and Saul, 2006; Kulis et al., 2007), but their focus has been on producing general nonlinear mappings of high-dimensional datasets into a lowdimensional space. In contrast, our optimization formulation (4) produces an explicit linear embedding operator Ψ ∈ RM ×N , thereby enabling easy application to out-of-sample unseen data points. Furthermore, since Ψ satisfies the RIP for all signals belonging to M, Ψ can be interpreted as a highly optimized compressive sensing matrix specifically tailored for signals belonging to M. We explore this further in our numerical experiments below in Section 5. An important challenge in such a context is the size of the sampled dataset X . The proof techniques of Baraniuk and Wakin (2009) assume that the training set X is an -cover of M, i.e., for every m ∈ M, there exists an x ∈ X such that minx∈X dM (m, x) ≤  for a small constant  > 0. However, covering results from high-dimensional geometry state that the cardinality of such a set X , in the worst case, can be exponential in the manifold dimension K, i.e.,  K ! 1 |X | = O . δ For practical real-world problems, computations involving such large training sets X may be intractable, and thus one may have to resort to heuristic sub-optimal methods. In Section 4 below, we develop one such sub-optimal but efficient method. Nevertheless, our approach can be viewed as an initial step towards obtaining provably efficient linear embeddings that preserve the isometric structure of arbitrary nonlinear submanifolds of the signal space. 3.4 Class-Specific Linear Embeddings We observe that the inequality constraints in (4) are derived by enforcing an approximate isometry condition on all pairwise secants {vi }Si=1 . However, this could prove to be too restrictive for some applications. For example, consider a supervised classification scenario where the signal of interest x can arise from one of two classes that are each modeled by the low-dimensional manifolds M, M0 . The goal is to infer the signal class of x from the linear embedding Ψx. Here, the embedding matrix Ψ should be designed such that signals from different classes are mapped to sufficiently distant points in the low-dimensional space. However, it does not matter if signals from the same class are mapped to the same point. Geometrically speaking, we seek a Ψ that preserves merely the inter-class secants   x − x0 0 0 S(M, M ) = , x ∈ M, x ∈ M (7) kx − x0 k2 up to an isometry constant δ. This represents a reduced set of linear constraints in (4). Since, the solution space to (4) is larger, the optimal P∗ will necessarily be of lower rank, 9

thus leading to a further reduction in the dimension of the embedding space. We examine this application further in our numerical simulations below in Section 5.

4. Efficient Algorithms for Designing Embeddings The SDP (4) admits a tractable solution in polynomial time using interior-point methods. However, for a generic SDP with S constraints and a matrix  variable of size N ×N , interiorpoint methods incur memory costs that scale as O S 2 and time-complexity costs that scale as O N 6 . Therefore, solving (4) using traditional SDP solvers (T¨ ut¨ unc¨ u et al., 2003; Polik, 2010) quickly becomes infeasible. Here, we develop two algorithms that exploit the special structure of the optimization problem (4) to produce very efficient solutions at vastly reduced costs. 4.1 ADMM We develop an efficient algorithm to solve (4) based on the Alternating Direction Method of Multipliers (ADMM). We dub our algorithm NuMax, an abbreviation for Nuclear norm minimization with Max-norm constraints. We rewrite (4) by introducing the auxiliary variables L ∈ SN ×N and q ∈ RS to obtain the optimization problem min

P,L,q

subject to

kPk∗

(8)

P = L, A(L) = q, kq − 1S k∞ ≤ δ, P  0.

This approach can be viewed as an instance of the Douglas-Rachford variable splitting method in convex programming (Douglas and Rachford, 1956). Next, we relax the linear constraints and form an augmented Lagrangian of (8) as follows: min

P,L,q

subject to

β1 β2 kP − L − Λk2F + kA(L) − q − ωk22 2 2 kq − 1S k∞ ≤ δ, P  0.

kPk∗ +

(9)

Here, the symmetric matrix Λ ∈ SN ×N and vector ω ∈ RS represent the scaled Lagrange multipliers. The optimization in (9) is carried out over the variables P, L ∈ SN ×N and q ∈ RS , while Λ and ω are iteratively updated as well. Instead of jointly optimizing over all three variables, we optimize the variables one at a time while keeping the others fixed. That is, we can solve the optimization (9) via a sequence of three sub-problems, each of which admits a computationally efficient solution. Let the subscript k denote the estimate of a variable at the k th iteration of the algorithm. The following steps are performed until convergence. 1. Update q: Isolating the terms that involve q, we obtain a new estimate qk+1 as the solution of the constrained optimization problem qk+1 ← arg min q

β2 kA(Lk ) − ω k − qk22 , s.t. kq − 1S k∞ ≤ δ. 2

This problem has a closed-form solution using a component-wise truncation procedure for the entries in q. Denote z = A(Lk ) − ω k − 1S . Then, it is easily seen that qk+1 = 1S + sign(z) · max(|z|, δ), 10

(10)

where the sign and max operators are applied component-wise. Therefore, this step can be performed in O (S) operations. 2. Update P: Isolating the terms that involve P, we obtain a new estimate Pk+1 as the solution of the constrained optimization problem Pk+1 ← arg min kPk∗ + P

β1 kP − Lk − Λk k2F , s.t. P  0. 2

This problem also admits an efficient closed form solution via the singular value shrinkage operator (Ma et al., 2011). Denote P0 = Lk + Λk and perform the SVD P0 = UΣVT , where Σ = diag(σ). Then, the optimum Pk+1 can be expressed as Pk+1 = UDα (Σ)VT , Dα (Σ) = diag({(σi − α)+ }),

(11)

where α = β1 and t+ represents the positive part of t, i.e., t+ = max(t, 0). The dominant computational cost for this update is incurred by performing the SVD of P0 ∈ SN ×N ; in general this step can be carried out in O N 3 operations. This step can potentially be made even faster by using fast, approximate SVD methods (Halko et al., 2011). 3. Update L: Isolating the terms that involve L, we obtain a new estimate Lk+1 as the solution of the unconstrained optimization problem Lk+1 ← arg min L

β2 β1 kPk − L − Λj k2F + kA(L) − qk+1 − ω k k22 . 2 2

(12)

This is a least-squares problem, and the minimum is achieved by solving the linear system of equations β1 (Pk − L − Λj ) = β2 A∗ (A(L) − qk+1 − ω k ),

(13)

where A∗ represents the adjoint of A. The dominant cost in this step arises due to the linear operator A∗ A. A single application of this operator incurs a complexity  2 2 of O N S . The least-squares solution to (13) can be calculated using a number of existing methods for solving large-scale linear equations, such as conjugate gradients (Meijerink and van der Vorst, 1977; Liu and Nocedal, 1989). 4. Update Λ, ω: Finally, as is standard in Augmented Lagrange methods, we update the parameters Λ, ω according to the equations Λk+1 ← Λk − η(Pk − Lk ), ω k+1 ← ω k − η(A(Lk ) − qk ). The overall NuMax method is summarized in pseudocode form in Algorithm 1. The convergence properties of NuMax, both in terms of precision as well as speed, are affected by the user-defined parameters η, β1 , and β2 . In all of the experiments below in Section 5, we set η = 1.618 and β1 = β2 = 1. 11

Algorithm 1 NuMax Inputs: Secant set S(X ) = {vi }Si=1 , isometry constant δ Parameters: Weights β1 , β2 , step size η b Output: Symmetric PSD matrix P Initialize: P0 , L0 , ω 0 , q0 , k ← 0, b ← 1S , set A : X 7→ {viT Xvi }Si=1 while not converged do k ←k+1 z ← A(Lk ) − ω k − b qk+1 ← b + sign(z) · max(|z|, δ)

{Truncation}

P0 ← Lk + Λk , P0 = UΣVT Pk+1 ← UDα (Σ)VT

{Singular value shrinkage}

Z ← β2 A∗ (qk+1 + ω k ), Z0 ← β1 (Pk − Λk ) Lk+1 ← β2 (A∗ A + I)† (Z + Z0 )

{Least squares}

Λk+1 ← Λk − η(Pk − Lk ) ω k+1 ← ω k − η(A(Lk ) − qk )

{Update Lagrange multipliers}

end while b ← Pk return P

4.2 Column Generation NuMax (Algorithm 1) dramatically decreases the time-complexity of solving the SDP (4). However,  for a problem with S input secants, the memory complexity of NuMax still remains O S 2 , and this could be prohibitive in applications involving millions (or billions) of secants. We now develop a heuristic optimization method that only approximately solves (4) but that scales very well to such problem sizes. Our key idea is based on the Karush-Kuhn-Tucker (KKT) conditions describing the optimum of (4). Recall that (4) consists of optimizing a linear objective subject to inequality constraints over the cone of PSD matrices. Suppose that strong duality holds, i.e., the primal and dual optimal values of (4) are equal. Then, classical results in optimization theory (Boyd and Vanderberghe, 2004) state that complementary slackness holds and that the optimal solution is entirely specified by the set of those constraints that hold with equality. Such constraints are also known as active constraints. We therefore propose a simple, greedy method to rapidly find the active constraints of (4). We prescribe the following steps: 1. Solve (9) with only a small subset S0 of the input secants S(X ) using NuMax (Alg. b Identify the set Sb of secants that correspond to 1) to obtain an initial estimate P. 12

active constraints, i.e., b i − 1| = δ}. Sb ← {vi ∈ S0 : |viT Pv 2. Select additional secants S1 ⊂ S that were not selected previously and identify all the b secants among S1 that violate the infinity norm constraints at the current estimate P. b Append these secants to the set of active constraints S to obtain an augmented set Sb [ Sb ← Sb {vi ∈ S1 : |viT Pvi − 1| ≥ δ}. 3. Solve (4) with the augmented set Sb using NuMax (Alg. 1) to obtain an new estimate b P. 4. Identify the secants that correspond to active constraints. Repeat Steps 2 and 3 until b convergence is reached in the estimated optimal matrix P. Instead of performing a large numerical optimization procedure on the entire set of secants S(X ), we perform a sequence of optimization procedures on small subsets of S(X ). When the number of active constraints is a small fraction of the overall secants, the computational gains are significant. This approach is analogous to the column generation (CG) method used to solve very large-scale linear programs (Dantzig and Wolfe, 1960). Therefore, we dub our overall algorithm NuMax-CG; this algorithm is listed in pseudocode form in Algorithm 2. A key benefit of NuMax-CG is that the set of secants upon which NuMax acts upon within each iteration never needs to be explicitly stored in memory and can in fact be generated on the fly. This can potentially lead to significant improvements in terms of memory complexity of the overall procedure. An important caveat is that we are no longer guaranteed to converge to the optimal solution of (4); nevertheless, as we see below in Section 5, NuMax-CG yields excellent results on massively sized real-world datasets. In practice, evaluating the KKT conditions for NuMax (and NuMax-CG) is computationally expensive. As a consequence, we use a notion of infeasibility as our main halting criterion. Specifically, we measure the errors in the strict enforcement of the equality constraints P = L and q = A(L), e1 =

2kP − LkF , kPkF + kLkF

e2 =

2kq − A(L)k2 . kqk2 + kA(L)k2

When max(e1 , e2 ) is less than a user-specified parameter η, we proclaim convergence. For the numerical experiments below in Section 5, we use η = 5 × 10−5 .

5. Numerical Experiments We illustrate the performance of our proposed NuMax framework and algorithms via a number of numerical experiments and show that our approach enables improved performance in machine learning applications such as approximate nearest neighbor (ANN)-based data retrieval and supervised binary classification. Recall that we use η = 1 and β1 = β2 = 1 13

Algorithm 2 NuMax-CG Inputs: Secant set S = {vi }Si=1 , isometry constant δ, the NuMax algorithm Parameters: Size of selected secant sets S 0 , S 00 b Output: Symmetric PSD matrix P 0

Initialize: Select a subset of S 0 secants S0 , set A : X 7→ {viT Xvi }Si=1 Obtain initial estimate P ← NuMax(S0 , δ) while not converged do Sb ← {vi ∈ S0 : |viT Pvi − 1| = δ} {Retain active constraints} 00 S S1 ← {vi ∈ S : vi ∈ / S0 }i=1 {Select additional test secants} S Sb ← Sb {vi ∈ S1 : |viT Pvi − 1| ≥ δ} {Add secants that violate constraints} b δ) P ← NuMax(S, b S0 ← S end while b ←P return P

{Update estimate}

for all our numerical simulations. Further, we use Alg. 1 (NuMax) when S, the number of secants, is less than 5000, and Alg. 2 (NuMax-CG) for larger sized problems. For the rest of this section, we will interchangeably use the terms “projections” and “measurements” whenever the context is clear. 5.1 Linear Low-Dimensional Embeddings We first demonstrate that NuMax can be used to derive linear, low-dimensional embeddings of possibly complicated image datasets. Given a dataset X of images, we estimate the performance of the three different types of low-dimensional embeddings outlined in the Introduction. First, we perform an empirical estimation of isometry constants obtained via PCA. We achieve this by projecting the secants onto M PCA basis functions learned on the secant set S(X ) and calculating the norms of the projected secants. The worst-case deviation from unity gives the estimate of the isometry constant. Second, we perform a similar isometry constant calculation using M random Gaussian projections. Each entry of the M × N linear embedding matrix is sampled independently from a Gaussian distribution with zero mean and variance 1/M . Third, for a desired value of isometry constant δ, we solve (4) using NuMax (Alg. 1) to obtain a positive semidefinite symmetric matrix P∗ . We measure the rank of P∗ and denote it by M . We are interested in characterizing the variation of the isometry constant δ with the number of measurements M . We first consider a synthetic manifold M comprised of N = 16 × 16 = 256-dimensional images of translations of a white square on a black background (see Fig. 1(a)). The degrees of freedom for each image are simply the 2D coordinates of the center of the square, and hence the dimension of the manifold is K = 2. We construct a training set S(X ) of S = 1000 secants by randomly sampling pairs of images from this manifold and normalizing the secants using (2). Figure 1(b) plots the variation of the number of measurements M 14

Number of measurements M

100

NuMax PCA Random

80 60 40 20 0 0

0.2

0.4 0.6 Isometry constant δ

0.8

(b)

(a)

Figure 1: (a) Example images from the manifold of translating squares. Each image is a point in N = 16 × 16 = 256-dimensional space. (b) Empirical isometry constant δ vs. number of measurements M using NuMax, PCA, and random embeddings. The SDP secant embedding approach ensures global approximate isometry using by far the fewest measurements.

as a function of the isometry constant δ. Figure 1 can be viewed as analogous to the ratedistortion curve commonly studied in information theory; here, δ represents the distortion and the undersampling factor M/N represents the compression rate. We observe that the NuMax embedding Ψ achieves the desired isometry constant on the secants using by far the fewest number of measurements. Interestingly, both the NuMax and PCA embeddings far outperform the random projection approach. For illustration purposes, we display (in montage form) the measurement basis functions (i.e., rows of Ψ) obtained by NuMax in Fig. 2. Next, we consider a more challenging real-world dataset. The MNIST dataset (LeCun and Cortes, 1998) contains a large number of digital images of handwritten digits and is commonly used as a benchmark for various machine learning algorithms. The images themselves exhibit dramatic variations (see Fig. 3(a)), and presumably the set of all images forms a highly nonlinear, non-differentiable manifold. We construct a training dataset S(X ) comprising S = 3000 secants and estimate the variation of the isometry constant δ with the number of measurements M . The results of this experiment are plotted in Fig. 3(b). Once again, we observe that NuMax provides the best linear embedding for a given value of δ in terms of reduced dimensionality and that both NuMax and PCA outperform random projections. For instance, for a distortion parameter δ = 0.2, NuMax produces an embedding with 8× fewer measurements than PCA. In essence, NuMax provides the best possible rate-distortion curve in terms of compressing the given image database. Next, we compare runtime performance of NuMax and NuMax-CG by testing them on subsets of the MNIST dataset. We use the training dataset associated with the letter “5”. We generate problems of different sizes by varying the number of secants. For each ensuing collection of secants, we solve both NuMax and NuMax-CG and observe the individual 15

(a) δ = 0.50, M = 12

(b) δ = 0.25, M = 23

(c) δ = 0.1, M = 38

(d) δ = 0.05, M = 62

Figure 2: Montage of basis functions (rows of Ψ) obtained by NuMax for the dataset in Fig. 1 with different values of δ.

running times as well as the fraction of constraints that are active at the solution of NuMaxCG. For each problem size, we perform 10 different trials and compile average statistics. Figure 4(a) demonstrates that the fraction of active secants can be significantly smaller than the total number of secants, suggesting that NuMax-CG can be considerably faster than NuMax. Figure 4(b) confirms this fact: for a problem size with S = 5 × 104 , NuMaxCG outperforms NuMax in terms of running time by a full order of magnitude. Moreoever, despite the heuristic nature of NuMax-CG, we observed in practice that the solutions obtained NuMax and NuMax-CG were virtually identical. 5.2 Approximate Nearest Neighbors (ANN) The notion of nearest neighbors (NN) is vital to numerous problems in estimation, classification, and regression (Cover and Hart, 1967); the ubiquity of NN-based machine learning in part stems from its conceptual simplicity. Suppose that a large dataset of training examples is available. Then, given a new (query) data point, NN-based machine learning techniques 16

Number of measurements

100

NuMax PCA Random

80 60 40 20 0 0

0.2

0.4 0.6 Isometry Constant δ

0.8

(b)

(a)

Figure 3: (a) Example images from the MNIST dataset. Each image is a point in N = 28×28 = 784–dimensional space. (b) Empirical isometry constant δ vs. number of measurements M using NuMax, PCA, and random embeddings. NuMax ensures global approximate isometry using the fewest number of measurements; for example, for a distortion parameter δ = 0.2, it produces an embedding with 8× fewer measurements than PCA.

3

10 10

Running time in sec

Fraction of active secants

−1

−2

10

−3

10

2

10

NuMax−CG NuMax 3

10

4

10

5

6

10 10 Number of secants

7

3

10

10

(a)

4

10

5

6

10 10 Number of secants

7

10

(b)

Figure 4: Performance of NuMax-CG on the MNIST handwritten digit database (LeCun and Cortes, 1998). (a) Ratio of active secants to total number of secants for problems of different sizes. As the problem size (number of secants) increases, the ratio of active secants decreases exponentially; this implies dramatic improvements in computational cost for NuMax-CG over NuMax. (b) Timing plots comparing NuMax-CG and NuMax for problems of different sizes.

identify the k points in the training dataset closest to the query point and use these points for further processing. 17

Suppose that the data points are modeled as elements of a vector space. As the dimension N of the data grows, the computational cost of finding the k nearest neighbors becomes challenging (Arya et al., 1998). To counter this challenge, as opposed to computing nearest neighbors of the query data point, one can instead construct a near-isometric embedding of the data into an M -dimensional space and estimate approximate nearest neighbors (ANN) in the embedded space. By carefully controlling the distortion in distance caused by the lower-dimensional embedding, efficient inference techniques can be performed with little loss in performance. The ANN principle forms the core of locality sensitive hashing (LSH), a popular technique for high-dimensional pattern recognition and information retrieval (Shakhnarovich et al., 2005; Indyk and Motwani, 1998). Given a fixed dataset, the time complexity of a particular ANN method directly depends upon the dimension M of the embedded space; the smaller the embedding dimension, the faster the ANN method. Most existing ANN methods (including LSH) either compute a randomized linear dimensionality reduction or a PCA decomposition of the data. In contrast, we immediately observe that our proposed NuMax algorithm provides a linear near-isometric embedding that achieves a given distortion δ while minimizing M . In other words, NuMax can potentially enable far more efficient ANN computations over conventional approaches. We test the efficiency of our approach on a set of Q = 4000 images taken from the LabelMe database (Russell et al., 2008). This database consists of high-resolution photographs of both indoor and outdoor scenes (see Fig. 5) for several examples). We compute GIST feature descriptors (Oliva and Torralba, 2001) for every image. In our case, the GIST descriptors are vectors of size N = 512 that coarsely express the dominant spatial statistics of the scene; such descriptors have been shown to be very useful for image retrieval purposes. Therefore our “ground truth” data consists of a matrix of size N × Q. Since the number of pairwise secants in this case is extremely high (S = Q2 = 6 × 106 ), we use NuMax-CG to estimate the linear embedding of lowest rank for a given distortion parameter δ. We record M , the rank of the optimal linear embedding, and for comparison purposes we also compute M -dimensional random linear projections of the data as well as the best M -term PCA approximation of the data. Subsequent ANN computations for a set of 1000 test query points are performed in the corresponding M -dimensional space. Figure 6 displays the benefits in ANN computations using the linear embedding generated by NuMax-CG. For a given neighborhood size k, we plot the fraction of k-nearest neighbors computed using the full (ground truth) N -dimensional data that are also knearest neighbors in the corresponding M -dimensional embedding. We observe from Fig. 6 that the linear embedding obtained by NuMax-CG provides the best embedding results for a wide range of measurements M and neighborhood sizes k. In particular, for embedding dimensions of M > 45, NuMax-CG outperforms both PCA and random projections for all values of k by a significant margin. 5.3 Compressive Sensing of Manifold-Modeled Signals We demonstrate the utility of our framework for designing efficient compressive sensing measurement matrices. As discussed in Section 2, the canonical approach in CS theory and practice is to construct matrices Φ ∈ RM ×N , with as small M as possible, that satisfy the 18

Figure 5: Example images from the LabelMe dataset (Russell et al., 2008). RIP (with distortion parameter δ) on the set of signals of interest. Typically, such matrices are constructed simply by drawing elements from, say, a standard normal probability distribution. Such matrices are universal in the sense that they can be constructed independently from the signal set of interest. Our proposed framework and NuMax algorithm suggests an alternate approach for constructing CS measurement matrices that are tailored to specific signal models. We perform the following numerical experiment. Given a set of example signals originating from a low-dimensional manifold, we divide it into training and test datasets. Using the training dataset, we learn a measurement matrix Ψ that satisfies RIP for all secants generated from the training dataset using NuMax-CG for a pre-chosen value of δ. Given such a measurement matrix, we are interested in (a) characterizing the RIP of the matrix Ψ when applied to secants from the test dataset, and (b) characterizing the efficiency of CS recovery using Ψ on signals belonging to the test dataset. Figure 7 displays the results of this experiment on an image dataset corresponding to a 2D manifold of a translating Gaussian blob. Each element on this 2D-manifold corresponds to an image of size N = 32 × 32 = 256 pixels. The standard deviation of the blob is chosen as 6 pixels. As the training dataset, we select images where the center pixel of the Gaussian blob is on an even row and column. All other images are considered to comprise the test dataset. Figure 7(a) compares the number of measurements required to reach a specified isometry constant δlearn for both NuMax and random measurement matrices. As in earlier experiments, NuMax requires significantly lesser number of measurements, as compared to more conventional (random) CS matrices, to achieve the same value of δlearn . Figure 7(b) demonstrates the variation of the empirical isometry constant of both on new, unseen secants from the test dataset. In Fig. 7(b), δlearn is the parameter used for applying NuMax to the training dataset, while δtest is the worst-case distortion among all pairwise secants from the test dataset. Thanks to their universality, we observe that traditional (random) CS matrices enjoy the same isometry constant on both training and 19

1 Percentage of preserved NN

Percentage of preserved NN

1

0.8

0.6

0.4

0.2

0 0

NuMax PCA Random 20

40 60 Number of NN

0.8

0.6

0.4

0.2

0 0

80

δ = 0.5, M = 38

40 60 Number of NN

80

1 Percentage of preserved NN

Percentage of preserved NN

20

δ = 0.3, M = 46

1

0.8

0.6

0.4

0.2

0 0

NuMax PCA Random

NuMax PCA Random 20

40 60 Number of NN

0.8

0.6

0.4

0.2

0 0

80

δ = 0.2, M = 59

NuMax PCA Random 20

40 60 Number of NN

80

δ = 0.1, M = 86

Figure 6: Approximate Nearest Neighbors (ANN) for the LabelMe dataset using various linear embedding methods. A set of 4000 images are chosen and GIST features of size N = 512 are computed for every image. For a given number of nearest neighbors k, we plot the average fraction of k-nearest neighbors that are retained in an M -dimensional embedding relative to the full N -dimensional data. NuMaxCG provides the best embedding results for a wide range of measurements M and neighborhood sizes k.

test datasets. However, we observe that for the matrix Ψ generated by NuMax, δtest is marginally greater than δlearn . This suggests a moderate loss of universality using Ψ, but a significant gain in terms of lowering the number of measurements. Finally, we demonstrate the improved performance of CS recovery using the NuMax embeddings. We obtain (noisy) compressive measurements using both random Gaussian matrices and the matrices obtained by NuMax for different values of measurement SNR and M . Using the noisy measurements, we perform CS recovery via Manifold Iterative Pursuit (MIP), a projected-gradient type method for the recovery of manifold-modeled signals (Shah and Chandrasekharan, 2011). Figure 8 compares the recovery performance for different SNRs and different number of measurements for both random Gaussian and NuMax measurement matrices. We observe that in terms of recovered signal MSE, NuMax outperforms random Gaussian measurements for all values of SNR and for all values of M . 20

0

1000

10 NuMax Random

NuMax Random

−1

test

100

δ

M

10

−2

10

10 0

−3

0.1

0.2

0.3

δ

0.4

10 −3 10

0.5

−2

10

learn

(a)

−1

δlearn

10

0

10

(b)

Figure 7: (a) Number of measurements M vs. input isometry constant δlearn . (b) Empirical (observed) isometry constant δtest vs. input isometry constant δlearn for NuMax and random projections.

NuMax: 20 dB NuMax: 6 dB NuMax: 0 dB Random: 20 dB Random: 6 dB Random: 0 dB

20

MSE

15

10

5

0

15

20

25 M

30

35

Figure 8: CS recovery performance for NuMax and random projections. NuMax far outperforms random Gaussian projections in terms of recovered signal MSE for all ranges of measurements M as well as SNR.

5.4 Supervised Binary Classification We consider a supervised classification problem, where our training classes consist of binary images of shifts of a translating disk and a translating square (several examples are shown in Fig. 9(a)). We construct a training dataset of S = 2000 inter-class secants and obtain a measurement matrix Ψinter via NuMax. Using a small number of measurements of a test 21

0.5

NuMax PCA Random

Probability of error

0.4 0.3 0.2 0.1 0 0

10

20 30 40 50 Number of measurements M

60

(b)

(a)

Figure 9: Binary classification from low-dimensional linear embeddings. (a) The signals of interest comprise shifted images of a white disk/square on a black background. We observe M linear measurements of a test image using different matrices, and classify the observed samples using a GMLC approach. (b) Observed probability of classification error as a function of M . NuMax approach yields high classification rates using very few measurements.

signal, we estimate the class label using a Generalized Maximum Likelihood Classification (GMLC) approach following the framework in (Davenport et al., 2007); assuming the availability of sufficiently many training examples, this formulation is essentially equivalent to nearest neighbors (NN)-based classification. We repeat this classification experiment using measurement basis functions learned via PCA on the inter-class secants as well as random projections. Figure 9(c) plots the variation of the number of measurements M vs. the probability of error. Again, we observe that NuMax significantly outperforms PCA and random projections. Finally, we test our approach on the more challenging CVDomes dataset, a collection of simulated X-band radar scattering databases for civilian vehicles (Dungan et al., 2010). The database for each vehicle class consists of (complex-valued) short-time signatures of length N = 128 for the entire 360◦ azimuth, sampled every 0.0625◦ . We model each of the two classes as a 1D manifold, construct S = 2000 training inter-class secants, and compute the measurement matrix Ψinter using NuMax. Additionally, we obtain a different measurement matrix Ψjoint from a training dataset of S = 2000 vectors that comprise both inter- and intra-class secants. Given a new test signal, we acquire M linear measurements and perform maximum likelihood classification. From Fig. 10, we infer that the SDP approach of optimizing (4) over the inter-class secants yields the best classification performance among the differerent measurement techniques. In particular, Ψinter produces the best classification performance, proving the potential benefit of our approach. 22

0.5

NuMax Inter NuMax Joint PCA Random

Probability of error

0.4 0.3 0.2 0.1 0 0

5

10 15 20 25 Number of measurements M

30

Figure 10: Classification of a Toyota Camry versus a Nissan Maxima using M linear measurements of length-128 radar signatures from the CVDomes dataset (Dungan et al., 2010). The SDP approach produces > 95% classification rates using only a small number of measurements M ≤ 15.

6. Discussion In this paper, we have taken some initial steps towards constructing a comprehensive algorithmic framework that creates a linear, isometry-preserving embedding of a highdimensional dataset. Our framework is based on a convex optimization formulation (in particular, the SDP (4)) that approximately preserves the norms of all pairwise secants of the given dataset. We have developed efficient algorithms, NuMax and NuMax-CG, that efficiently construct the desired embedding with considerably smaller computational complexity than existing approaches. Our NuMax methods can be easily adapted to perform more complicated machine learning tasks, such as approximate nearest neighbors (ANN) as well as supervised binary classification. In addition, the NuMax embeddings can be successfully used in compressive sensing applications where the signals of interest can be modeled as elements lying on a smooth, low-dimensional manifold. The problem of constructing linear, information-preserving embeddings of highdimensional signals into a low-dimensional space is of central importance in a number of machine learning and signal processing applications. Despite their practical significance, surprisingly little is known about near-isometric linear embeddings beyond the JohnsonLindenstrauss Lemma (Matousek, 2011). For example, it is unclear, even from a theoretical perspective, how to establish the dimension M of the smallest possible subspace into which a specific dataset X can be embedded using a linear mapping. The framework proposed in this paper adopts a deterministic, algorithmic approach to answering such important questions. While we do not provide a rigorous analytical characterization for our framework, we hope to initiate a line of work that might lead to some interesting theoretical conclusions. Several challenges remain. First, our approach relies on the efficiency of the nuclear norm as a proxy for the matrix rank in the objective function in (4). A natural question is under what conditions the optimum of the convex relaxation (4) equals the optimum 23

of the nonconvex problem (3). Moreover, while the speed of convergence of our proposed algorithms (NuMax and NuMax-CG) have been shown to empirically shown to be far better than traditional methods, the analysis of our algorithms from a theoretical perspective remains a challenging task. Finally, from a practical perspective, it is common nowadays in machine learning to encounter datasets that involve millions (or even billions) of training signals, and optimization on such datasets is only feasible when performed in a highly parallel, decentralized, and distributed fashion. How, then, to extend our proposed algorithms to such scenarios? We defer such important challenges to future research.

Acknowledgments This work was supported in part by the grants NSF CCF-0431150, CCF-0926127, CCF1117939, DMS-0748839 and ECCS-1028790; DARPA/ONR N66001-11-C-4092 and N6600111-1-4090; ONR N00014-08-1-1101, N00014-08-1-1112, N00014-10-1-0989, and N00014-11-10714; AFOSR FA9550-09-1-0432; ARO MURI W911NF-07-1-0185 and W911NF-09-1-0383; and the Texas Instruments Leadership University Program. A preliminary conference version of this paper appeared in the Proceedings of the Statistical Signal Processing Workshop (SSP) in August 2012 (Hegde et al., 2012).

References D. Achlioptas. Database-friendly random projections. In Proc. Symp. Principles of Database Systems (PODS), Santa Barbara, CA, May 2001. F. Alizadeh. Interior-point methods in semidefinite programming with applications to combinatorial optimization. SIAM J. Optimization, 5(01), 1995. N. Alon. Problems and results in extremal combinatorics. Discrete Math., 273(1):31–53, 2003. S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM, 45(6):891–923, 1998. R. Baraniuk and M. Wakin. Random projections of smooth manifolds. Found. Comput. Math., 9(1):51–77, 2009. R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Const. Approx., 28(3):253–263, 2008. A. Barvinok. Problems of distance geometry and convex properties of quadratic maps. Discrete and Comput. Geometry, 13(1):189–202, 1995. S. Boyd and L. Vanderberghe. Convex Optimization. Cambridge Univ. Press, Cambridge, England, 2004. D. Broomhead and M. Kirby. The Whitney Reduction Network: A method for computing autoassociative graphs. Neural Comput., 13:2595–2616, 2001. 24

D. Broomhead and M. Kirby. Dimensionality reduction using secant-based projection methods: The induced dynamics in projected systems. Nonlinear Dynamics, 41(1):47–67, 2005. E. Cand`es. Compressive sampling. In Proc. Int. Congress of Math., Madrid, Spain, Aug. 2006. E. Cand`es and B. Recht. Simple bounds for low-complexity model reconstruction. to appear in Math. Prog., 2012. E. Candes, T. Strohmer, and V. Voroninski. PhaseLift: Exact and stable signal recovery from magnitude measurements via convex programming. to appear in Comm. Pure Appl. Math., 2012. K. Clarkson. Tighter bounds for random projections of manifolds. In Proc. Symp. Comp. Geom., pages 39–48. ACM, 2008. T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Trans. Inform. Theory, 13(1):21–27, 1967. G. Dantzig and P. Wolfe. Decomposition principle for linear programs. Math. Operations Research, 8(1):101–111, 1960. M. Davenport, M. Duarte, M. Wakin, J. Laska, D. Takhar, K. Kelly, and R. Baraniuk. The smashed filter for compressive classification and target recognition. In Proc. IS&T/SPIE Symp. Elec. Imag.: Comp. Imag., San Jose, CA, Jan. 2007. D. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289–1306, 2006. D. Donoho and C. Grimes. Hessian Eigenmaps: Locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci., 100(10):5591–5596, 2003. R. Dony and S. Haykin. Optimally adaptive transform coding. IEEE Trans. Image Processing, 4(10):1358–1370, 1995. J. Douglas and H. Rachford. On the numerical solution of heat conduction problems in two and three space variables. Trans. Amer. Math. Soc, 82:421–439, 1956. K. Dungan, C. Austin, J. Nehrbass, and L. Potter. Civilian vehicle radar data domes. In Proc. SPIE Algo. Synth. Aperture Radar Imagery, Orlando, FL, Apr. 2010. M. Fazel. Matrix Rank Minimization With Applications. PhD thesis, Stanford Univ., 2002. R.A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Human Genetics, 7(2):179–188, 1936. N. Halko, P. Martinsson, and J. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53(2):217– 288, 2011. H. Harman. Modern Factor Analysis. U. Chicago Press, 1976. 25

C. Hegde, A. Sankaranarayanan, and R. Baraniuk. Near-isometric linear embeddings of manifolds. In Proc. IEEE Work. Stat. Signal Processing (SSP), Ann Arbor, MI, Aug. 2012. M. Hirsch. Differential Topology. Springer, 1976. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. ACM Symp. Theory of Comput., pages 604–613, New York, NY, 1998. W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Proc. Conf. Modern Anal. and Prob., New Haven, CT, Jun. 1982. B. Kulis, A. Surendran, and J. Platt. Fast low-rank semidefinite programming for embedding and clustering. In Proc. Int. AISTATS Conf., 2007. J. Laska, S. Kirolos, M. Duarte, T. Ragheb, R. Baraniuk, and Y. Massoud. Theory and implementation of an analog-to-information conversion using random demodulation. In Proc. IEEE Int. Symposium on Circuits and Systems (ISCAS), New Orleans, LA, May 2007. Y. LeCun and C. Cortes. MNIST handwritten digit database, 1998. Available online at http://yann.lecun.com/exdb/mnist. N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15(2):215–245, 1995. D. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 45(1):503–528, 1989. S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterative methods for matrix rank minimization. Math. Prog., 128(1):321–353, 2011. J. Matousek. Open problems on embeddings of finite metric spaces, 2011. Available online at http://kam.mff.cuni.cz/~matousek/metrop.ps. J. Meijerink and H. van der Vorst. An iterative solution method for linear systems of which the coefficient matrix is a symmetric M -matrix. Math. Comp., 31(137):148–162, 1977. B. Moore. Principal component analysis in linear systems: Controllability, observability, and model reduction. IEEE Trans. Automat. Control, 26(1):17–32, 1981. P. Moscato, M. Norman, and G. Pataki. On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues. Math. Operations Research, 23(2): 339–358, 1998. A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vision, 42(3):145–175, 2001. I. Polik. Sedumi 1.3, 2010. Available online at http://sedumi.ie.lehigh.edu. 26

B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471–500, 2010. S. Roweis and L. Saul. Nonlinear dimensionality reduction by local linear embedding. Science, 290:2323–2326, 2000. B. Russell, A. Torralba, K. Murphy, and W. Freeman. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vision, 77(1):157–173, 2008. P. Shah and V. Chandrasekharan. Iterative projections for signal identification on manifolds. In Proc. Allerton Conf. on Comm., Contr., and Comp., Monticello, IL, Sept. 2011. G. Shakhnarovich, T. Darrell, and P. Indyk. Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. MIT Press, Cambridge, MA, 2005. J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000. M. Tipping and C. Bishop. Probabilistic principal component analysis. J. Royal Statist. Soc B, 61(3):611–622, 1999. J. Tropp and A. Gilbert. Signal recovery from partial information via orthogonal matching pursuit. IEEE Trans. Inform. Theory, 53(12):4655–4666, 2007. R. T¨ ut¨ unc¨ u, K. Toh, and M. Todd. Solving semidefinite-quadratic-linear programs using SDPT3. Math. Prog., 95(2):189–217, 2003. M. Wakin, J. Laska, M. Duarte, D. Baron, S. Sarvotham, D. Takhar, K. Kelly, and R. Baraniuk. An architecture for compressive imaging. In Proc. IEEE Int. Conf. Image Processing (ICIP), Atlanta, GA, Oct. 2006. K. Weinberger and L. Saul. Unsupervised learning of image manifolds by semidefinite programming. Int. J. Comput. Vision, 70(1):77–90, 2006. Z. Wen. First-order Methods for Semidefinite Programming. PhD thesis, Columbia University, 2009.

27

Suggest Documents