Simons Center for Data Analysis, Simons Foundation, New York, NY 10011. Abstract. We propose a novel method for distance
Sparse distance metric learning for embedding compositional data
Zachary D. Kurtz ZACHARY. KURTZ @ MED . NYU . EDU Sackler Institute of Graduate Biomedical Sciences, NYU School of Medicine, New York, NY 10016 Christian L. Müller CMUELLER @ SIMONSFOUNDATION . ORG Richard A. Bonneau RBONNEAU @ SIMONSFOUNDATION . ORG Simons Center for Data Analysis, Simons Foundation, New York, NY 10011
Abstract We propose a novel method for distance metric learning and generalized Aitchison embedding of multi-class data on the (p − 1)-simplex. We consider the p > n setting where p is the number of variables and n the number of data points. This problem setup is motivated by common learning tasks on data sets arising in microbial ecology, where the relative abundance of p species is measured across n environments or patients. Our approach can specifically handle data that contain a large number of zero measurements (zero inflation), a common property for data acquired from targeted high throughput and single-cell sequencing.
to the underlying sparsity of individual samples xk we adopt a novel regularization framework for distance metric learning with the following (squared) Mahalanobis distance (Mateu-Figueras et al., 2013): d2M (xi , xj )
T
=
(a(xi ) − a(xj )) Σ−1 (a(xi ) − a(xj ))
=
∆Tij Q∆ij ,
where Σ denotes the covariance matrix of the Aitchison embedded variables, ∆ij = (log(xi + bi ) − log(xj + bj )), and Q = PT ΛP the Mahalanobis metric of the Aitchison embedding with Λ the covariance of the log-transformed data.
We are given n data pairs (xk , yk ) where x are pdimensional compositional data vectors that are restricted p−1 to simplex S+ = {x = (x1 , . . . , xi , . . . , xp ): xi ≥ 0, Pthe p i=1 xi = 1}, and each vector xk is associated with a class label yk ∈ {1, . . . , K}. We seek to learn an embedding of the compositional data in Euclidean space under the constraint that distances between xi and xj are small when yi = yj and large for all data points across classes to improve multi-class learning using k-Nearest Neighbor classification (Weinberger & Saul, 2009).
Let X = (x1 , . . . , xi , . . . , xn ) ∈ Rp×n be the given data p×n the matrix of unand B = (b1 , . . . , bi , . . . , bn ) ∈ R+ known pseudocounts. Our novel regularized distance metric learning approach is based on the following non-convex optimization problem: min − log det Q + tr S T Q
We propose to use Generalized Aitchison Embeddings (GAEs) (Le & Cuturi, 2013) of the form
where S = log(X + B) log(X + B)T and D is the set of dissimilar pairs. The `1 -penalty on the positive definite matrix Q promotes sparsity, which is a common structural assumption in the p > n regime (Friedman √ et al., 2008). Similarly, the nuclear norm kBk∗ = tr BB T promotes low-rank structure in the pseudocount matrix, implying that there is a low-dimensional subspace that can accurately capture the variability of the sample sparsity. The positive scalar quantities s and r are tuning parameters that require data-driven selection.
m
a(x) ≡ P log(x + b) ∈ R , where P ∈ Rm×p , m ≤ p, is a projection matrix and b ∈ Rp+ a positive vector of “pseudocounts". (Le & Cuturi, 2013; 2014) cast the objective of learning the matrix P and vector b as a metric learning problem for the Euclidean distance between two compositions under an Aitchison embedding: da (xi , xj )
= dE (a(xi ), a(xj )) = kP log(xi + b) − P log(xj + b)k2 .
To (i) learn embeddings for high-dimensional compositions when p > n and (ii) adapt the vector of pseudocounts
Q0,B∈Rp×n +
X
dM (xi , xj ) > 1, kQk1 ≤ s, kBk∗ ≤ r
(xi ,xj )∈D
To efficiently solve the proposed optimization problem to local optimality we develop a novel iterative SplitBregman-type algorithm. We show the validity and superior performance of our novel embedding method on a variety of synthetic benchmarks as well as real-world microbial abundance data across different habitats.
Sparse distance metric learning for embedding compositional data
References Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. Sparse inverse covariance estimation with the graphical lasso. Biostatistics (Oxford, England), 9(3):432–441, 2008. Le, Tam and Cuturi, M. Generalized Aitchison Embeddings for Histograms. Asian Conference on Machine Learning, (2011):293–308, 2013. URL http://jmlr.org/proceedings/papers/ v29/Le13.html. Le, Tam and Cuturi, Marco. Adaptive Euclidean maps for histograms: generalized Aitchison embeddings. Machine Learning, aug 2014. ISSN 0885-6125. doi: 10.1007/s10994-014-5446-z. URL http://link.springer.com/10.1007/ s10994-014-5446-z. Mateu-Figueras, G, Pawlowsky-Glahn, Vera, and Juan JosÂŽ e Egozcue. The normal distribution in some constrained sample spaces. SORT, 37(1):29–56, 2013. URL http://arxiv.org/abs/0802.2643. Weinberger, K Q and Saul, L K. Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research, 10:207–244, 2009. ISSN 1532-4435. URL http: //machinelearning.wustl.edu/mlpapers/ paper{_}files/NIPS2005{_}265.pdf.